ArticlePDF AvailableLiterature Review

Assessing the function of genetic variants in candidate gene association studies

Authors:

Abstract

Knowledge of inherited genetic variation has a fundamental impact on understanding human disease. Unfortunately, our understanding of the functional significance of many inherited genetic variants is limited. New approaches to assessing functional significance of inherited genetic variation, which combine molecular genetics, epidemiology and bioinformatics, promise to enhance reproducibility and plausibility of associations between genotypes and disease.
REVIEWS
NATURE REVIEWS | GENETICS VOLUME 5 | AUGUST 2004 | 589
The human genome contains about ten million SNPs,
with an estimated two common missense variants per
gene
1
.At least five million SNPs have already been
reported in public databases
2,3
.Many of these variants
might be involved in human disease aetiology, but it is
often difficult to assess their function on the basis of
nucleotide sequence alone. This is particularly true
when variants do not alter an amino acid or do not dis-
rupt a well-characterized motif that affects protein
function or structure. In addition, only a small subset
of variants that affect the phenotype will confer small
to moderate effects on phenotypes that are causally
related to disease risk. So, an important challenge that
faces molecular epidemiological association studies of
candidate disease-susceptibility genes is to define the
variants that are functionally implicated in disease.
This is particularly urgent because the amount of
genomic information that is available greatly exceeds
the information about the function of variants that are
used in human disease studies. There is insufficient
guidance for molecular epidemiologists to optimally
select variants for an epidemiology study, highlighting
the need for methods that prioritize the choice of
genetic variants to be genotyped in molecular epidemi-
ological studies
4
.Here, we bring together examples of
experimental population genetics and evolutionary
approaches to assessing variant function, and evaluate
the potential for using this information to improve
inferences from disease association studies.
Genomic variation and molecular epidemiology
The protein coding sequences of the human genome
contain approximately 100,000–300,000 common
SNPs, and additional SNPs lie within putative regula-
tory regions of genes that might be relevant for studies
of human health and disease
1
.Regulatory and coding
SNPs are of particular interest to molecular epidemio-
logical association studies. Non-synonymous SNPs
(nsSNPs), or missense variants, translate into amino-
acid polymorphisms in the proteins they encode.
Regulatory SNPs (rSNPs) can affect the expression,
tissue-specificity or function of relevant proteins. It
seems that both nsSNPs and rSNPs are relatively rare
compared with the total number of SNPs in the human
genome
5
.The rarity of nsSNPs might be a consequence
of selection against the functional disruptions of
amino-acid variation. Some molecular functional diver-
sity is attributable to the effects on protein function
caused by nsSNPs. For example, the kinetic parameters
of enzymes, the DNA-binding properties of proteins
that regulate transcription, the signal transduction
activities of transmembrane receptors, and the architec-
tural roles of structural proteins are all susceptible to
perturbation by nsSNPs and their associated amino-acid
polymorphisms.
In addition to ongoing efforts to identify and charac-
terize these genetic variants, molecular epidemiological
disease association studies are under way to better under-
stand the role of inherited genetic variation in disease.
ASSESSING THE FUNCTION OF
GENETIC VARIANTS IN CANDIDATE
GENE ASSOCIATION STUDIES
Timothy R. Rebbeck*, Margaret Spitz
and Xifeng Wu
Knowledge of inherited genetic variation has a fundamental impact on understanding human
disease. Unfortunately, our understanding of the functional significance of many inherited genetic
variants is limited. New approaches to assessing functional significance of inherited genetic
variation, which combine molecular genetics, epidemiology and bioinformatics, promise to enhance
reproducibility and plausibility of associations between genotypes and disease.
*Department of Biostatistics
and Epidemiology, and
Abramson Cancer Center,
University of Pennsylvania
School of Medicine, 904
Blockley Hall, 423 Guardian
drive, Philadelphia,
Pennsylvania 19104, USA.
Department of
Epidemiology,
M. D. Anderson Cancer
Center, 1515 Halcombe
Boulevard, Houston,
Te x as 77030, USA.
Correspondence to T.R.R.
e-mail: trebbeck@
cceb.med.upenn.edu
doi:10.1038/nrg1403
LINKAGE DISEQUILIBRIUM
The observation that two or
more alleles, usually at loci that
are physically close together on a
chromosome, are not inherited
independently but are observed
to occur together more
frequently than predicted under
Mendel’s law of independent
assortment.
NIFEDIPINE
A calcium-blocker drug (also
called Procardia) that was one of
the first drugs recognized to be
metabolized by CYP3A4, and for
which a regulatory element
specific to the CYP3A4 gene was
named.
MENARCHE
The first occurrence of
menstruation in a woman.
590 | AUGUST 2004 | VOLUME 5 www.nature.com/reviews/genetics
REVIEWS
tissue-specific effects or experimental conditions. For
example, the functional effect of a variant might be small
in magnitude in an experimental system, but might
become more important in a specific human tissue or on
exposure to relevant environmental agents. Similarly,
large experimental effects might be observed that reflect
negligible in vivo effects in humans. Similarly, effects that
are small in magnitude in a single time point-experiment
might be amplified or only have a phenotypically rele-
vant effect over long time periods. Indeed, most genetic
variants that are studied in complex traits and disease
would be expected to have small effects on function. So,
the use of model systems that are appropriate to detect
large effects (for example, as might be observed in high
penetrance ‘inborn errors of metabolism settings) might
not be optimal.
As outlined above, experimental evidence of genetic
variant function might not be consistent with results of
epidemiological studies. For example, CYP3A4 metabo-
lizes drugs and other compounds, including steroid
hormones, that are important in the aetiology of many
common diseases
11
.A variant has been identified in the
CYP3A4 promoter (denoted CYP3A4*1B) that consists
of an AG nucleotide substitution at position –290
(denoted A–290G) in the
NIFEDIPINE-specific element
(NFSE)
12
.Subsequently, the CYP3A4
*
1B variant was
epidemiologically associated with various characteris-
tics, including a more advanced stage of prostate
tumours
13,14
,decreased risk for treatment-related
leukaemia
15
, early age of MENARCHE
16,17
and increased
plasma levels of insulin-like growth factor-I among
users of oral contraception
18
.Although these data pro-
vide evidence for functionally significant effects of
CYP3A4*1B, the basic science literature has not consis-
tently supported the hypothesis that CYP3A4*1B has a
functionally significant effect. Hashimoto et al.
12
identi-
fied several regulatory elements, including a putative
repressor fragment and a NFSE element in the CYP3A4
promoter, which indicates that variants in this promoter
region might affect CYP3A4 transcription
(FIG. 1).
Lamba et al.
19
reported that CYP3A4*1B alleles were
found significantly more frequently in Caucasians with
low CYP3A4 protein levels than in those with higher
levels of the protein.
Many authors
20–26
have studied the relationship of
CYP3A4 expression or function to the CYP3A4*1B pro-
moter
(FIG. 1).Most of them concluded that there were
no biologically meaningful effects, given the small mag-
nitude of effects that were observed. However, most
studies reported consistently elevated expression associ-
ated with CYP3A4*1B (a 20%–200% increase over the
consensus CYP3A4*1A)
20–26
.
On the basis of these data, it is possible that
CYP3A4*1B has only a small to moderate phenotypic
effect. These small phenotypic effects will probably not
have a clinically meaningful impact on drug disposi-
tion. It is not clear, however, whether this magnitude of
phenotypic perturbation is sufficient to alter metabo-
lism on environmental exposure to steroid hormones
or other agents that might confer disease risk over the
lifetime of an individual. For example, would a 20%
A major challenge for epidemiologists undertaking
candidate gene–disease association studies is to choose
target SNPs that are most likely to affect the phenotype
and that ultimately contribute to disease development.
Var iants in biologically plausible candidate genes are
usually selected for study on the basis of both variant
allele frequency and the functional effect of the variant on
relevant traits. Although there is often sufficient informa-
tion to assess the allele frequency of a candidate variant,
understanding the functional significance of genetic
variants is usually more difficult.
Knowledge of gene and SNP function is crucial to
direct the appropriate design and interpretation of can-
didate gene association studies. This is in contrast to
genome-wide scans and other approaches that rely on
LINKAGE DISEQUILIBRIUM
between loci to identify genotype-
disease associations, for which knowledge of function is
not required. Most epidemiologists are not trained in
experimental laboratory methods and do not maintain
laboratories in which detailed laboratory assessment of
variant function in target tissues or individuals of interest
can be made. They must therefore often rely on making
an assessment of the function of a candidate gene or
SNP from the published literature. However, published
experimental evidence might not be adequate to guide
the design or interpretation of a molecular epidemio-
logical study. For example, inadequate information
about the function of a genetic variant impedes the abil-
ity to evaluate whether the association is consistent with
a causative event that is consistent with a biological
mechanism, whether the association is a reflection of
linkage disequilibrium with a truly causative variant, or
whether the association represents a false positive result.
The lack of reproducibility of many association studies
might reflect the number of studies that involve genetic
variants with no functional significance
6,7
.
Experimental approaches
Inferences about function of a genetic variant can be
made using experimental systems, including in vitro sys-
tems and in vivo animal models. These include the effect
of variants on regulatory-region control of DNA
expression, RNA stability or degradation, protein struc-
ture, protein denaturing, protein expression, other mea-
sures of in vivo and in vitro control of protein levels, and
tissue specificity
8–10
.Because the scope of experimental
approaches for assessing SNP function is large, the pur-
pose of this section is not to provide a comprehensive
review of experimental approaches for evaluating SNP
function, but instead to provide a context in which epi-
demiological studies can use experimental data in the
design and interpretation association studies that
involve candidate genes.
These approaches can provide the strongest evidence
for the functional role of a genetic variant, but can also be
difficult to interpret in the context of studies of complex
human traits. The extent of the effects of genetic variants
on relevant phenotypes involved in the disease process is
probably small. These effects may be highly dependent
on the context in which the genetic variant is acting,
including influence from environmental exposure,
NATURE REVIEWS | GENETICS VOLUME 5 | AUGUST 2004 | 591
REVIEWS
further compounded if the magnitude of functional
effects of individual variants is small, leading to a greater
potential for artificial masking or enhancement of func-
tional effects in experimental systems. Therefore, the
ability to detect functionally relevant effects of genetic
variants might be highly dependent on the context of
the experimental system used, and experimental sys-
tems might not reflect relevant in vivo effects in
humans. Given the differences between the relevant
phenotypes in humans and experimental systems,
information obtained from the latter might not be con-
sistent with results of epidemiological association stud-
ies. Therefore, it might not be possible to make clear
inferences about in vivo genotype function in humans
from experimental data.
Inconsistencies between experimental data and
epidemiological studies can also reflect a potential
study bias that is inherent to some epidemiological
investigations. Poor study design, including insuffi-
cient statistical power, could influence the results of an
epidemiological study to produce false positive or
false negative inferences
(BOX 1).
In addition to more traditional experimental
approaches that are used to assess SNP function, a wide
variety of novel approaches for assessing gene function
have been proposed, including gene tagging
27
,gene trap-
ping
28
, N-ethyl-N-nitrosourea (ENU) mutagenesis
29
,
proteomics methods
30
and evaluation of epigenetic
mechanisms
31
.The HaploCHIP method
32
is an illustra-
tive example. HaploCHIP analyzes SNPs that affect
gene regulation in vivo using chromatin immunopre-
cipitation and mass spectrometry to identify differential
protein–DNA binding in vivo that is associated with
greater metabolism of testosterone by CYP3A4*1B over
the course of a mans lifetime (as indicated by the data
in
FIG. 1) sufficiently increase the risk of prostate cancer
to the extent that it could be detected in an epidemio-
logical context? If so, the apparent discrepancy between
epidemiological associations of this genetic variant and
the functional effects of this variant in experimental
systems might not exist.
Va r iation in experimental approaches used to assess
the function of genetic variants in complex metabolic
systems could affect inferences about function (see
FIG. 1
and BOX 1). Results might vary depending on the spe-
cific experimental conditions, unrecognized effects of
regulatory elements, and post-transcriptional or post-
translational processing. Hashimoto et al.
12
suggest
that removing the repressor region upstream of the
CYP3A4*1B variant sequence might unmask a promoter
effect. Studies that evaluate regulatory region genetic
variants in CYP3A4 might differ substantially depending
on whether this region is present or absent in the experi-
mental systems. In fact, various constructs were used to
evaluate in vitro functional effects of CYP3A4*1B
(FIG. 1),
and this experimental variability could have influenced
the inferences made in these studies. The inclusion of
enhancer elements that might be required in expression
assays can similarly affect expression. Regulation of
expression might only become apparent under exposure
to the relevant compounds in the proper cellular context.
The use of primary cells versus cell culture systems can
also affect inferences about function. The use of different
cell or tissue types (for example see
FIG. 1) might reflect
different regulatory influences that affect the functional
assessment of a genetic variant. These effects might be
Putative repressor region
NFSE
Cell system Effect of CYP3A4*1B versus CYP3A4*1A Refs
Human hepatocytes 190% Increase in testosterone 6 β-oxidation
MCF7 90% Increase in luciferase expression
HepG2 40% Increase in luciferase expression 21
Human hepatocytes 40% Increase in nifedipine oxidase activity 22
Human hepatocytes 110% Increase in CYP3A4 protein expression 22
HepG2 No change in luciferase expression 23
HepG2 + enhancer 20% Increase in luciferase expression 23
MCF7 20% Increase in luciferase expression 25
HepG2 40% Increase in luciferase expression 25
Human hepatocytes 20–90% Increase in luciferase expression 25
In response to xenobiotics:
HepG2 20–117% Increase in transcriptional activation
HuH7
CaCo-2
–490 +10
+53–362
+53–362
–1019 –6
–570 +22
–570 +22
–572 –6
–572 –6
–572 –6
–1203 –61
–1203 –61
–1203 –61
–1019 –6
ATG*1B
20
21
26
26
26
75–147% Increase in transcriptional activation
4–19% Increase in transcriptional activation
Figure 1 | In vitro studies of the effect of CYP3A4*1B compared with CYP3A4*1A. At the top of the figure on the left is a
schematic representation of the structure of the CYP3A4 region. Blue bars represent the genomic regions that are used in the
CYP3A4-containing constructs for each study. The postive and negative numbers on each of the bars represent the postion of the
terminal basepairs of these regions relative to the postion of the SNP. The studies summarized here encompass substantial
variability in experimental conditions. Nonetheless, there is a small but consistent effect of increased CYP3A4 expression in the
presence of CYP3A4*1B. NFSE, nifedipine-specific element.
592 | AUGUST 2004 | VOLUME 5 www.nature.com/reviews/genetics
REVIEWS
pleiotropic effects on disease risk. This should be a main
focus of future studies attempting to assess the func-
tional significance of genetic variants.
Population genetics approaches
Knowledge of population genetic structure might
provide insight into the functional relevance of a
genetic variant on a disease trait. For example, genetic
variants with a functionally significant impact on rele-
vant phenotypes, including disease endpoints, might be
more likely to deviate from expected allele and genotype
frequencies compared with alleles that are functionally
neutral. It remains unclear whether there is evidence of
selection for or against low penetrance alleles over evo-
lutionary time, although it is clear that selection has
influenced the pattern of genetic variability in the
human genome
33,34
.However, a number of diseases that
could confer selective pressures on genes of interest are
relatively new with respect to evolutionary genetics his-
tory.For example, diabetes and obesity might confer
disease risks that could lead to selective pressure on
allele or genotype frequencies, but these effects still
occur relatively late in life and have been a problem to
human health on a population level for only a few gen-
erations. So, whilst substantial new information about
evolutionary genetics history and low penetrance SNPs
is becoming available, the link between this knowledge
and the functional importance of specific SNPs has yet
to be fully elucidated.
Deviations from expected allele or genotype frequen-
cies across relevant phenotypic groups could be used to
identify alleles that are more likely to be functionally asso-
ciated with disease aetiology. For example, alleles that are
truly causative of a disease state would be expected to be
disproportionately over-represented among cases with
the disease but under-represented among the disease-
free controls
35,36
.As a result, deviations from HARDY-
WEINBERGPROPORTIONS
or other measures of population
frequency might provide clues to the role of genes in the
aetiology of this disease. For example, Hoh et al.
37
devel-
oped a so- called ‘set association approach that evalu-
ates sets of SNPs at various positions in the genome.
Information about allelic association and Hardy-
Weinberg equilibrium are combined over multiple
markers in the genome. A genome-wide
TEST STATISTIC is
generated by summing-up contributions from many
SNPs located in different genomic regions. The method
performs a significance test on several sets of loci
simultaneously, while using conservative measures of
inference to control for the potential of false positive
associations. Hoh et al.
37
applied their approach to an
SNP association study of
RESTENOSIS.Among 779 patients
with heart disease, 342 showed restenosis (cases) 6
months after
ANGIOPLASTY.The remaining patients, on
whom angioplasty was not performed, were the con-
trols. Eighty nine SNP markers were genotyped in 62
candidate genes for each individual. The algorithm iden-
tified nine genes that conferred susceptibility to the dis-
ease. Unfortunately, despite promising results, the
method is sensitive to genotyping errors. Population
ADMIXTURE,in which cases and controls belong to different
allelic variants of a gene as a surrogate measure of tran-
scriptional activity. This allele-specific quantification
method uses haplotype-specific chromatin immunopre-
cipitation (CHIP) to measure the amount of phosphory-
lated RNA polymerase II that is bound to different alleles,
thereby estimating the differences in protein binding
between the alleles. This assay can be adapted for high-
throughput analysis because of its sensitivity and ability
to quantify the relative abundance of two different alleles
in a sample of immunoprecipitated chromatin.
Using this approach, Knight et al.
32
showed that
there was a close correlation between the level of bound
phosphorylated (active) RNA polymerase II at the
imprinted small nuclear ribonucleoprotein polypeptide
N locus and allele-specific expression. They also used
this method to identify an SNP of the cytokine tumour
necrosis factor (TNF), which plays a pivotal role in
inflammation, immunity and apoptosis, although the
involvement of this SNP in modulating TNF transcrip-
tion is yet to be shown. Application of the HaploCHIP
approach to the TNF/lymphotoxin-α (LTA) locus iden-
tified functionally important haplotypes that correlate
with allele-specific transcription of LTA
32
.
Despite reports of both traditional and new ways of
assessing the function of a genetic variant, little has been
done to determine whether genotypic differences that
lead to small phenotypic perturbations are functionally
significant in the context of complex human disease. No
criteria have been established to evaluate the impor-
tance of genetic variants with small but potentially
HARDY-WEINBERG
PROPORTIONS
The binomial distribution of
genotypes (that is, frequencies of
genotypes AA, Aa and aa will be
p
2
, 2pq, and q
2
,respectively,
where p is the frequency of
allele A, and q is the frequency
of allele a) that result in a
population when there are no
external pressures that cause
deviations from p
2
, 2pq and q
2
.
TEST STATISTIC
A quantity whose value is used
to decide whether or not the null
hypothesis should be rejected,
usually based on quantities
computed using observed data.
RESTENOSIS
The constriction, narrowing or
blockage of a coronary artery
after an initial treatment such as
angioplasty aimed at removing
this blockage.
ANGIOPLASTY
An operation that is used to
repair a damaged blood vessel or
unblock a coronary artery.
Box 1 | Factors influencing consistency of gene–disease associations
Va r iables affecting inferences from experimental studies:
•In vitroor in vivo system studied
Cell type studied
Cultured versus fresh cells studied
Genetic background of the system
DNA constructs
DNA segments that are included in functional (for example, expression) constructs
Use of additional promoter or enhancer elements
Exposures
Use of compounds that induce or repress expression
Influence of diet or other exposures on animal studies
Va r iables affecting epidemiological inferences:
Inclusion/exclusion criteria for study subject selection
Sample size and statistical power
Candidate gene choice
A biologically plausible candidate gene
Functional relevance of the candidate genetic variant
Frequency of allelic variant
Statistical analysis
Consideration of confounding variables, including ethincity, gender or age.
Whether an appropriate statistical model was applied (for example, were interactions
considered in addition to main effects of genes?)
Violation of model assumptions
NATURE REVIEWS | GENETICS VOLUME 5 | AUGUST 2004 | 593
REVIEWS
multiple alignment information of these sequences to
estimate ‘tolerance indices that predict tolerated and
deleterious (that is, intolerant) substitutions for every
position of the query sequence. Substitutions at each
position with normalized tolerance indices that are
below a chosen cut-off point are predicted to be delete-
rious. Substitutions that are greater than or equal to the
cut-off point are predicted as being tolerated (that is,
putatively non-functional). Using three examples
(LacI, HIV-I protease and bacteriophage T4 lysozyme),
Ng et al.
48
showed that a high proportion of substitu-
tions that are predicted to be deleterious by SIFT did
affect phenotypes in experimental assays. However,
some positions that were predicted to be intolerant by
SIFT were tolerated in experimental assays. In these
cases, the positions are usually involved in an unknown
function that the assay does not detect.
There is also evidence that SIFT fails to identify
residues that are vital for protein function but that have
not been conserved throughout the family
48
. SIFT predic-
tions are based on sequence data alone and do not require
knowledge of protein structure or function. So, substitu-
tions in uncharacterized proteins can be evaluated by
SIFT only when homologous sequences are provided.
Recently, Zhu et al.
6
compared the relationship of
tolerance to amino-acid change, as predicted by SIFT,
with the reported association with SNPs in cancer-
related genes
(FIG. 2).For the study, 166 published
case-control studies that reported associations in 46
SNPs, in 39 different genes, in 16 different cancer
sites, were chosen. All of these SNPs are located in
biologically plausible cancer-related genes, such as
those involved in DNA repair, carcinogen metabolism
or
CELL CYCLE CHECKPOINTS.The putative functional signif-
icance of these affected SNPs was calculated using tol-
erance indices from SIFT by comparing sequences
from different species. The analyses showed a signifi-
cant inverse correlation between estimates of cancer
risk as assessed by the
ODDS RATIO (OR) and the toler-
ance indices of the amino-acid variants. These findings
indicate that alterations in conserved amino-acids in
SNPs are more likely to be associated with cancer sus-
ceptibility. So, using a molecular evolutionary approach
might help identify SNPs to be genotyped in future
molecular epidemiological studies.
Bayesian phylogenetic analysis is another compara-
tive evolutionary method that identifies potential func-
tionally important amino-acid sites that disrupt gene
function
52
.This approach aligns amino-acid sequences
for different species and constructs a phylogenetic tree.
It then obtains maximum likelihood estimates of
nucleotide substitution rates, identifies conserved
regions and calculates ancestral sequences. Finally,
regions that evolve under positive selection are iden-
tified and compared with the distribution of con-
served sites with missense changes that have been
reported in public database. Fleming et al.
52
used this
approach, as well as the SIFT program, to identify mis-
sense changes in exon 11 of the BRCA1 gene. The phy-
logenetic approach inferred that 38 of 139 missense
changes affected function in exon 11. By contrast, SIFT
ethnic groups with different SNP allele frequencies, can
also adversely distort the test results. Therefore,
although incorporating population genetics informa-
tion can provide further information to a genotype–dis-
ease association study, additional research is required to
determine the conditions under which these approaches
will be optimally applied.
Evolutionary and structural approaches
Natural selection shapes patterns of genetic variation in
populations such that favourable variants increase in
frequency relative to less favourable variants over time.
Mutations in functionally important sites can be elimi-
nated or kept at low frequencies. So, nucleotide or
amino-acid residues that have been conserved across
species or within a gene family are more likely to be
involved in regulation of vital functions, expression or
tissue specificity than residues that are not conserved.
Knowledge of the functional domains can therefore be
useful when assessing the functional impact of an
amino-acid change. The value of this knowledge was
demonstrated early on for the effects of mutations in the
primary sequence of haemoglobin
38
.In this case, the mol-
ecular basis of the clinical effects caused by mutations
could be inferred as soon as the structural information
became available. These pioneering studies recognized
crucial links between the putative functional motifs and
potential effects of mutations on function.
Recently, a number of computational algorithms
have been developed to predict the impact of nucleotide
or amino-acid substitutions on protein structure,
expression and function. Evolutionary conservation of
the amino-acid sequence can be determined by align-
ing amino-acid sequences of related proteins from
unrelated organisms or across gene families. A number
of algorithms have been proposed that use DNA or
amino-acid sequence data to identify potentially func-
tional residues or domains in a comparison with
sequences in the public databases. These include meth-
ods that are based on protein three-dimensional (3D)
structure
39–44
,methods that are based on evolutionary
considerations
45,46
, and machine-learning approaches
47
.
As examples of these approaches, we present two algo-
rithms, SIFT (‘sorting intolerant from tolerant’;
REFS 48–50;
see Further information for website) and PolyPhen
(‘polymorphism phenotyping’;
REF. 51;see Further infor-
mation for website). Both are based entirely around
amino-acid sequence, and are useful for predicting the
impact of exonic amino-acid substitutions on disease
risk. A third algorithm, known as CODDLE (‘choosing
codons to optimize discover of deleterious lesions’; see
Further information for website) uses nucleotide
sequence (for example, a PCR amplicon) to identify all
potentially deleterious nucleotide substitutions, including
nonsense, frameshift, missense and splicing mutations.
The SIFT algorithm. This algorithm is based on the
assumption that amino-acid positions that are impor-
tant for the correct biological function of the protein
are conserved across the protein family and/or across
evolutionary history. SIFT uses protein sequence and
ADMIXTURE
Combining two or more
populations into a single group.
Combining two populations has
implications for studies of
genotype–disease associations if
the component populations
have different genotypic
distributions.
CELL CYCLE CHECKPOINTS
Steps in the normal sequence of
development and division in the
cell. Disruption can lead to
uncontrolled cell growth, and
possibly cancer.
ODDS RATIO
A measure of relative risk that is
usually estimated from case
control studies.
594 | AUGUST 2004 | VOLUME 5 www.nature.com/reviews/genetics
REVIEWS
known detrimental missense changes in several
domains of BRCA1, showing the utility of this approach
in identifying genetic variants that could be prioritized
in further association studies.
Concerted efforts to further understand the func-
tional role of genomic elements, including SNPs, are
also underway. For example, the National Institutes of
Health has set up a public research consortium known
as ‘Encode’ (‘encyclopaedia of DNA elements’;
REF. 53),
with the goal of identifying and categorizing DNA func-
tional elements including transcriptional regulatory
sequences and determinants of chromosome structure
and function. This initiative involves comparing existing
computational and experimental approaches, and
developing new methods for the identification and
characterization of function.
Although these and other approaches can be used to
assist molecular epidemiologists in identification of
genetic variants in specific nucleotides or residues that
are most likely to be functionally significant, they will
probably provide only limited information about the
true function of a genetic variant. The methods
described above generally cannot identify functional
effects in non-coding regions, and cannot evaluate the
effect of other factors on function, such as exposures,
that might be involved in the aetiology of the disease.
Recently, Rogan et al.
54
proposed a computational
approach to evaluate the putative effect of splicing
mutations. They used information theory-based models
to evaluate the relationship between variants and pre-
dicted splice sites and relevant phenotypes. The results
presented by Rogan et al.
54
corresponded to known
functions of these genes and serve as a model for addi-
tional studies that evaluate the effect of both coding and
non-coding variation on functionality (see also
REF. 55).
Having interpreted the results from SIFT and
related approaches, it might not be possible to identify
whether a single variant within a putative functional
domain is sufficient to disrupt function. Such app-
roaches require sequence data and might therefore be
limited by the sequence information that is available in
public databases. Therefore, inferences of conservation
(or the lack thereof) might simply reflect data limita-
tions. Inconsistencies of inference using different classes
of sequence data might also arise. For example, an
analysis of sequences among human gene families
might provide inferences that are different from across-
species sequence analysis. Although potentially helpful
in uncovering the role of a particular genetic region or
variant, such differences might also obscure inferences
about putative functional significance.
The PolyPhen algorithm. Missense variants might affect
protein folding, binding or interaction sites,as well as
the solubility or stability of the protein. These effects can
beestimated from physical considerations and from the
contextof an amino-acid replacement within the family
of homologousproteins. Sunyaev et al.
56
demonstrated
that a significant fraction of missense variants (nsSNPs)
are likely to affect proteinstructure or function. Their
approach, implemented in PolyPhen algorithm uses
identified 36 of these 38 changes, and an additional 34.
Fleming et al.
52
hypothesized that SIFT predicted more
changes because substitutions in all taxa are given the
identical weight in SIFT. By contrast, under the assump-
tion of the phylogenetic method, if substitutions are not
shared by sister taxa, they are more likely to be sequenc-
ing errors. In addition, non-conservative substitutions
maintained at sites in one pair of sister taxa are unlikely
to be functionally significant. Applying this approach,
Fleming et al.
52
successfully predicted >85% of the
0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.2 0.4 0.6 0.8 1.0 1.2
0
0
1
2
3
12
Log2ORsLog2ORs
PSIC score
Tolerance index
a
b
Figure 2 | Relationship of in silico indices of SNP function
and associations (measured by log
2
-transformed odds
ratios) taken from the literature. a | The relationship
between the position-specific independent count (PSIC) score
difference for studied SNPs from PolyPhen (‘polymorphism
phenotyping’) analyses and log
2
-transformed odds ratios
(Log
2
ORs) from published associations. b | The relationship
between the tolerance index for studied SNPs from SIFT
(‘sorting intolerant from tolerant’) analysis and log
2
ORs from
associations. The data indicate that SNPs that are inferred as
functional are more likely to be associated with higher odds-
ratio effects. By inference, they are more likely to be causative
of disease. Modified, with permission, from
REF. 6 © (2004) The
American Association for Cancer Research.
NATURE REVIEWS | GENETICS VOLUME 5 | AUGUST 2004 | 595
REVIEWS
difference, and between tolerance index and PSIC
score difference. So, the more probable it is that an SNP
is functionally relevant, the higher the OR effect and the
more likely it is that causative associations will be identi-
fied. These results imply that using a method to assess
functional significance, such as PolyPhen, can optimize
the ability to identify meaningful and reproducible
molecular epidemiological associations.
In addition to identifying important genetic variants
for research prioritization, genotyping efforts could be
reduced by eliminating amino-acid substitutions that
have been deemed ‘neutral’ by algorithms such as
PolyPhen. In some cases, some genes could be removed
from further consideration if all of the variant alleles
were deemed to be non-functional. Because prediction
algorithms provide numerical data, it is feasible to fur-
ther subdivide the variant alleles of a gene into those
that might have a moderate, modest or no impact. In
particular, when a gene is known to have many genetic
variants, prediction algorithms can also reduce the
effective number of alleles that need to be considered in
association studies. Therefore, the use of these app-
roaches should limit the number of genotypes that
need to be considered, thereby limiting the potential for
false positive associations that might result from per-
forming unnecessary hypothesis tests. The drawback of
these approaches is that potential interactions among
combinations of substitutions at a gene might be
ignored, and the dependence of functional impact on
genotype at other genes or on exposure might not be
addressed. Nonetheless, the algorithms for predicting
genotype impact on allele function provide an initial
and important step towards simplifying the seemingly
overwhelming complexity of the genotype data. The
concept of equivalent alleles provides a basis for addi-
tional steps towards reducing the complexity of the
data for molecular epidemiological studies.
Optimizing molecular epidemiological studies
To improve the probability of obtaining biologically
and clinically meaningful associations between geno-
types and disease outcomes, the design and interpreta-
tion of molecular epidemiological studies should
include an assessment of functional significance. As
outlined above and in
TABLE 1,a number of criteria
and/or approaches can be used to assess functional sig-
nificance of a genetic variant. These include genetic
characteristics such as the type of genetic change (mis-
sense, frameshift, nonsense, regulatory, splicing, disrup-
tion of a known functional motif, etc.), population
genetics characteristics and evolutionary conservation
of nucleotide sequences, experimental evidence from in
vitro or animal model studies that consider repro-
ducibility of functional findings, experimental condi-
tions (such as cell/animal system, inducers/repressors)
that are used to evaluate a functional effect, the magni-
tude of the inferred biological effects, and the relevance
of the experimental model to human systems, includ-
ing knowledge of the effect of the variant in the target
tissue (that is, tissue specificity) and consideration of
timing, dose or duration of relevant exposures.
amino-acid sequence, phylogenetic and structural
information to characterize the potential functional
significance of a missense change.
PolyPhen was developed to identify functionally
important SNPs by predicting whether an amino-acid
change is likely to be deleterious for the protein onthe
basis of 3D structure and multiple alignmentof homol-
ogous sequences
57
.The possibly damaging effect of vari-
ants in an amino-acid sequence can bedetermined if the
substitution is in an annotated active or binding site,
affects interaction with ligands present in the crystallo-
graphic structure, leadsto hydrophobicity or electrostatic
charge change in a buriedsite, destroys a disulfide bond,
affects the proteins solubility, inserts proline in an
α-helix, or is incompatiblewith the profile of amino-acid
substitutions observed at thissite in the set of homolo-
gous proteins. Mapping the amino-acid substitution tothe
known 3D structure reveals if the change is likely to
destroy the hydrophobic core of a protein, electrostatic
interactions, interactions with ligands, or other features
of a protein.
Briefly, PolyPhen uses protein sequence and variant
data to search homologous sequence data from publicly
available protein databases. Only sequences with 30%
or more identity to the protein of interest are consid-
ered. On the basis of the alignment of these homolo-
gous sequences, profile scores can be computed for
allelic variants. Profile scores, known as position-specific
independent counts (PSICs), are logarithmic ratios of
the likelihood that a given amino acid occurs at a par-
ticular site to the background probability of the amino
acid occurring at random at a given position. Large dif-
ferences in PSIC values for specific genetic variants
might indicate that the substitution of interest is rarely
or never observed in the protein family. Finally,
PolyPhen maps the amino-acid substitution to the
known 3D structure of the protein to examine whether
the substitution might destroy the proteins hydropho-
bic core, electrostatic interactions or interactions with
ligands, or other important features of a protein based
on the analysis of several structural parameters, and
also on the analysis of several contact parameters.
Therefore, Polyphen can provide information about
whether an nsSNP is probably damaging, possibly
damaging, benign or whether its function is unknown.
Using this approach, Sunyaev et al.
57
estimated that
20% of human nsSNPs might affect protein function,
although the proportion of SNPs that are truly delete-
rious is probably substantially lower. Sunayev et al.
57
further estimated that there are 20,000–60,000 func-
tionally relevant nsSNPs in the human genome. By
comparison, Fay et al.
33
used a different model-based
approach to estimate that 20–45% of nsSNPs are
slightly deleterious and reach population frequencies of
1–10%, although as many as 80% of nsSNPs might
have some functional impact.
Zhu et al.
6
applied PolyPhen to the 166 molecular epi-
demiological studies mentioned above to examine the
correlation between PSIC score and the OR associated
with a particular nsSNP
(FIG. 2).They found a significant
inverse correlation between the ORs and PSIC score
596 | AUGUST 2004 | VOLUME 5 www.nature.com/reviews/genetics
REVIEWS
The population genetics of CYP3A4*1B has been
well described
59
,but no data have been reported about
deviation of frequencies from Hardy-Weinberg propor-
tions in cases versus controls, so this criterion provides
little support for or against function.
The ‘experimental evidence’ about CYP3A4*1B
function
(FIG. 1) has been controversial, but there seems
to be a small but consistent increase in expression asso-
ciated with CYP3A4*1B.This relatively small effect
might be insufficient to confer major effects on drug
metabolism, but indicates possible associations with
altered disease risk. We could therefore conclude that
the ‘experimental evidence’ criterion suggests ‘moderate
support’ for functional significance.
It is well known that CYP3A4 metabolizes com-
pounds that are strongly associated with disease risk, such
as testosterone metabolism in prostate cancer aetiology
11
.
Therefore, there is strong support that the product of
this gene metabolizes relevant aetiological exposures,
including steroid hormones and carcinogens
11
.
Finally, there are numerous epidemiological studies
that report an association of CYP3A4*1B with various
phenotypes, thereby providing strong support for func-
tion on the basis of the ‘epidemiological evidence’crite-
rion. So, these criteria suggest moderate to strong support
for the hypothesis that CYP3A4*1B is a functionally
meaningful DNA change.
By applying these or similar criteria, molecular
epidemiologists might be able to maximize the
chance that association studies involve functionally
significant genetic variants, and therefore could
reduce the likelihood of
TYPE I ERRORS, increase repro-
ducibility of candidate gene association studies, and
facilitate interpretation of positive associations. Even
in the absence of a definitive function of a genetic
variant, candidate gene association studies should
not be undertaken without some consideration of
function.
Information from previously published epidemio-
logical investigations indicating the effect of an SNP can
also be considered for studies that attempt to replicate
association findings. For example, an ad hoc measure of
support for the functional significance of a particular
variant could be created by weighting in favour of a
functionally significant effect if the criteria for ‘strong
support’ or ‘moderate support’ for functional signifi-
cance are available, and therefore might be prioritized
for association studies. Similarly, variants for which
there is no or neutral functional information might be
ranked lower in priority for association studies.Variants
for which there is evidence against functional signifi-
cance should not be considered in association studies
unless the hypothesis and association study methods
consider linkage disequilibrium approaches to identify
candidate regions of interest. In this case, the study
would assume that the variants under investigation are
in linkage disequilibrium with the causative allele(s).
Var iants would be chosen for analysis on the basis of
polymorphic content and haplotype or population
genetics structure at that locus.
How would this approach be applied for a specific
candidate SNP? For example, applying the criteria
outlined in
TABLE 1 for CYP3A4*1B,we learn first that
CYP3A4*1B disrupts a regulatory motif (NFSE)
12
.
Because the function of NFSE is not clear, we might
therefore conclude that there is ‘moderate support’
for function on the basis of the ‘nucleotide sequence’
criterion.
The regulatory region of CYP3A4 is similar, at the
sequence level, to many members of the CYP3A multi-
gene family, so we might conclude that there is strong
support for function on the basis of the evolutionary
conservation’ criterion. However, because the role of
putative regulatory domains in CYP3A genes
58
is still
debatable, the assessment of ‘evolutionary conservation
is not completely clear.
TYPE I ERROR
Incorrectly rejecting a null
hypothesis when the null
hypothesis is correct. Similarly,
the false positive rate.
Table 1 | Criteria for assessing the functional significance of a genetic variant in candidate gene association studies
Criteria Strong support for Moderate support for Evidence against
functional significance functional significance functional significance
Nucleotide sequence Variant disrupts a known functional Variant is a missense change or disrupts a Variant disrupts a non-coding
or structural motif putative functional motif; changes to protein region with no known functional or
structure might occur structural motif
Evolutionary Consistent evidence from multiple Evidence for conservation across species Nucleotide or amino-acid residue
conservation approaches for conservation across or multigene families not conserved
species and multigene families
Population genetics In the absence of laboratory error, strong In the absence of laboratory error, moderate Population genetics data indicates
deviations from expected population to small deviations from expected population no deviations from expected
frequencies in cases and/or controls in a frequencies in cases and/or controls; effects proportions
particular ethnicity are not well characterized by ethnicity
Experimental evidence Consistent effects from multiple lines of Some (possibly inconsistent) evidence for Experimental evidence consistently
experimental evidence; effect in human function from experimental data; effect in indicates no functional effect
context is established; effect in target human context or target tissue is unclear
tissue is known
Exposures (for example, Variant is known to affect the Variant might affect metabolism of the Variant does not affect metabolism
genotype–environment metabolism of the exposure in exposure or one of its components; of exposure of interest
interaction studies) the relevant target tissue effect in target tissue might not be known
Epidemiological Consistent and reproducible reports of Reports of association exist; Prior studies show no effect of
evidence moderate-to-large magnitude associations replication studies are not available variant
NATURE REVIEWS | GENETICS VOLUME 5 | AUGUST 2004 | 597
REVIEWS
1. Cargill, M. et al. Characterization of single-nucleotide
polymorphisms in coding regions of human genes. Nature
Genet. 22, 231–238 (1999).
2. Salisbury, B. A. et al. SNP and haplotype variation in the
human genome. Mutat. Res. 526, 53–61 (2003).
3. Schneider, J. A. et al. DNA variability of human genes.
Mech. Ageing Dev. 124, 17–25 (2003).
4. Schork, N. J., Fallin, D. & Lanchbury, J. S. Single nucleotide
polymorphisms and the future of genetic epidemiology.
Clin. Genet. 58, 250–264 (2000).
5. Sachidanandam, R. et al. A map of human genome
sequence variation containing 1.42 million single nucleotide
polymorphisms. Nature 409, 928–933 (2001).
6. Zhu, Y. et al. An evolutionary perspective on SNP screening
in molecular cancer epidemiology. Cancer Res. 64,
2251–2257 (2004).
The first comprehensive evaluation and comparison
of SIFT and PolyPhen algorithms in molecular
epidemiological association studies.
7. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. &
Hirschhorn, J. N. Meta-analysis of genetic association
studies supports a contribution of common variants to
susceptibility to common disease. Nature Genet. 33,
177–182 (2003).
A comprehensive evaluation of the consistency of
association studies that demonstrates the need for
functional correlates in achieving consistency in
association study results.
8. Botstein, D. & Risch, N. Discovering genotypes underlying
human phenotypes: past successes for mendelian disease,
future approaches for complex disease. Nature Genet. 33
(Suppl.), 228–237 (2003).
9. Stoilov, P. et al. Defects in pre-mRNA processing as causes
of and predisposition to diseases. DNA Cell Biol. 21,
803–818 (2002).
10. Knight, J. C. Functional implications of genetic variation in
non-coding DNA for disease susceptibility and gene
regulation. Clin. Sci. (Lond.) 104, 493–501 (2003).
11. Li, A. P., Kaminski, D. L. & Rasmussen, A. R. Substrates of
human hepatic cytochrome P450 3A4. Toxicology 104, 1–8
(1995).
12. Hashimoto, H. et al. Gene structure of CYP3A4, an adult-
specific form of cytochrome P450 in human livers and its
transcriptional control. Eur. J. Biochem. 218, 585–595 (1993).
13. Rebbeck, T. R., Jaffe, J. M., Walker, A. H., Wein, A. J. &
Malkowicz, S. B. Modification of clinical presentation of
prostate tumors by a novel genetic variant in CYP3A4.
J. Natl Cancer Inst. 90, 1225–1229 (1998).
14. Paris, P. L. et al. Association between a CYP3A4 genetic
variant and clinical presentation in African-American prostate
cancer patients. Cancer Epidemiol. Biomarkers Prev. 8,
901–905 (1999).
15. Felix, C. A., et al. Association of CYP3A4 genotype with
treatment-related leukemia. Proc. Natl Acad. Sci. 95,
13176–13181 (1998).
16. Kadlubar, F. F. et al. The putative high activity variant,
CYP3A4*1B, predicts the onset of puberty in young girls.
Cancer Epidemiol. Biomarkers Prev. 12, 327–331 (2003).
17. Lai, J., Vesprini, D., Chu, W., Jernstrom, H. & Narod, S. A.
CYP gene polymorphisms and early menarche. Mol. Genet.
Metab. 74, 449–457 (2001).
18. Jernstrom, H. et al. Genetic factors related to racial variation
in plasma levels of insulin-like growth factor-1: implications
for premenopausal breast cancer risk. Mol. Genet. Metab.
72, 144–154 (2001).
19. Lamba, J. K. et al. Common allelic variants in cytochrome
P4503A4 and their prevalence in different populations.
Pharmacogenetics 12, 121–132 (2002).
20. Westlind, A., Lofberg, L., Tindberg, N., Andersson, T. B. &
Ingelman-Sundberg, M. Interindividual differences in hepatic
expression of CYP3A4: relationship to genetic
polymorphism in the 5-upstream regulatory region.
Biochem. Biophys. Res. Commun. 259, 201–205 (1999).
21. Amirimani, B., Walker, A. H., Weber, B. L. & Rebbeck, T. R.
Response: re: modification of clinical presentation of
prostate tumors by a novel genetic variant in CYP3A4.
J. Natl Cancer Inst. 91, 1588–1590 (1999).
22. Ando, Y. et al. Re: modification of clinical presentation of
prostate tumors by a novel genetic variant in CYP3A4.
J. Natl Cancer Inst. 91, 1587–1590 (1999).
23. Spurdle, A. B. et al. The CYP3A4*1B polymorphism has no
functional significance and is not associated with risk of
breast or ovarian cancer. Pharmacogenetics 12, 355–366
(2002).
24. Floyd, M. D. et al. Genotype-phenotype associations for
common CYP3A4 and CYP3A5 variants in the basal and
induced metabolism of midazolam in European- and
African-American men and women. Pharmacogenetics 13,
595–606 (2003).
25. Amirimani, B. et al. Transcriptional activity effects of a
CYP3A4 promoter variant. Environ. Mol. Mutagen. 42,
299–305 (2003).
26. Hamzeiy, H., Bombail, V., Plant, N., Gibson, G. & Goldfarb, P.
Transcriptional regulation of cytochrome P4503A4 gene
expression: effects of inherited mutations in the 5-flanking
region. Xenobiotica 33, 1085–1095 (2003).
27. Jeon, J. & An, G. Gene tagging in rice: a high throughput
system for functional genomics. Plant Sci. 161, 211–219
(2001).
28. Cecconi, F. & Meyer, B. I. Gene trap: a way to identify novel
genes and unravel their biological function. FEBS Lett. 480,
63–71 (2000).
29. Adams, M. D. ENU mutagenesis for pharma. Drug Discov.
Today 8, 199–200 (2003)
30. Lee, Y. S. & Mrksich, M. Protein chips: from concept to
practice. Trends Biotechnol. 20 (Suppl.), S14–18 (2002).
31. Nikaido, I. et al. EICO (expression-based imprint candidate
organizer): finding disease-related imprinted genes. Nucleic
Acids Res. 32 (database issue), D548–551 (2004).
32. Knight, J. C., Keating, B. J., Rockett, K. A. & Kwiatkowski,
D. P. In vivo characterization of regulatory polymorphisms by
allele-specific quantification of RNA polymerase loading.
Nature Genet. 33, 469–475 (2003).
The authors report a new method and application of
experimental approaches to assessing genotype
function.
33. Fay, J. C., Wyckoff, G. J. & Wu, C. I. Positive and negative
selection on the human genome. Genetics 158, 1227–1234
(2001).
34. Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D.
Interrogating a high-density SNP map for signatures of
natural selection. Genome Res. 12, 1805–1814 (2002).
35. Feder, J. N. et al. A novel MHC class I-like gene is mutated
in patients with hereditary haemochromatosis. Nature
Genet. 13, 399–408 (1996).
36. Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting marker-
disease association by testing for Hardy-Weinberg
disequilibrium at a marker locus. Am. J. Hum. Genet. 63,
1531–1540 (1998).
37. Hoh, J., Wille, A. & Ott, J. Trimming, weighting, and
grouping SNPs in human case-control association studies.
Genome Res. 11, 2115–2119 (2001).
The authors propose a novel approach to association
studies that incorporates both association and
population genetics information in identifying disease
genes, including the possibility of genome-wide
associations.
38. Perutz, M. F. Structure and function of haemoglobin.
I. A tentative atomic model of horse oxyhaemoglobin.
J. Mol. Biol. 13, 646–668 (1965).
39. Wang, Z. & Moult, J. Three-dimensional structural
location and molecular functional effects of missense SNPs
in the T cell receptor Vβ domain. Proteins 53, 748–757
(2003).
40. Wang, Z. & Moult, J. SNPs, protein structure, and disease.
Hum. Mutat. 17, 263–270 (2001).
41. Chasman, D. & Adams, R. M. Predicting the functional
consequences of non-synonymous single nucleotide
polymorphisms: structure-based assessment of amino acid
variation. J. Mol. Biol. 307, 683–706 (2001).
42. Ferrer-Costa, C., Orozco, M. & de la Cruz, X.
Characterization of disease-associated single amino acid
polymorphisms in terms of sequence and structure
properties. J. Mol. Biol. 315, 771–786 (2002).
43. Saunders, C. T. & Baker, D. Evaluation of structural and
evolutionary contributions to deleterious mutation prediction.
J. Mol. Biol. 322, 891–901 (2002).
44. Herrgard, S. et al. Prediction of deleterious functional effects
of amino acid mutations using a library of structure-based
function descriptors. Proteins 53, 806–816 (2003).
45. Miller, M. P. & Kumar, S. Understanding human disease
mutations through the use of interspecific genetic variation.
Hum. Mol. Genet. 10, 2319–2328 (2001).
46. Koref, M. E. S., Gangeswaran, R., Koref, I. P. S., Shanahan, N.
& Hancock, J. M. A phylogenetic approach to assessing the
significance of missense mutations in disease genes. Hum.
Mutat. 22, 51–58 (2003).
47. Krishnan, V. G. & Westhead, D. R. A comparative study of
machine-learning methods to predict the effects of single
nucleotide polymorphisms on protein function.
Bioinformatics 19, 2199–2209 (2003).
48. Ng, P. C. & Henikoff, S. Predicting deleterious amino acid
substitutions. Genome Res. 11, 863–874 (2001).
An outline of the SIFT approach to assessing missense
variant function using evolutionary similarity.
49. Ng, P. C. & Henikoff, S. Accounting for human
polymorphisms predicted to affect protein function. Genome
Res. 12, 436–446 (2002).
50. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes
that affect protein function. Nucleic Acids Res. 31,
3812–3814 (2003).
51. Ramensky, V., Bork, P. & Sunyaev, S. Human non-
synonymous SNPs: server and survey. Nucleic Acids Res.
30, 3894–3900 (2002).
An outline of the PolyPhen methodology for using
evolutionary and structure data to assess SNP function.
52. Fleming, M. A., Potter, J. D., Ramirez, C. J., Ostrander, G. K.
& Ostrander, E. A. Understanding missense mutations in the
BRCA1 gene: an evolutionary approach. Proc. Natl Acad.
Sci. USA 100, 1151–1156 (2003).
53. National Institutes of Health. The ENCODE Project:
ENCyclopedia Of DNA Elements [online],
<http://www.genome.gov/10005107> (2003).
54. Rogan, P. K., Svojanovsky, S. & Leeder, J. S. Information
theory-based analysis of CYP2C19, CYP2D6 and CYP3A5
splicing mutations. Pharmacogenetics 13, 207–218 (2003).
55. Pagani, F. & Baralle, F. E. Genomic variants in exons and
introns: identifying the splicing spoilers. Nature Rev. Genet.
5, 389–396 (2004).
56. Sunyaev, S., Ramensky, V. & Bork, P. Towards a structural
basis of human non-synonymous single nucleotide
polymorphisms. Trends Genet. 16, 198–200 (2000).
57. Sunyaev, S. et al. Prediction of deleterious human alleles.
Hum. Mol. Genet. 10, 591–597 (2001).
58. Schuetz, E. G. Lessons from the CYP3A4 promoter. Mol.
Pharmacol. 65, 279–281 (2004).
59. Zeigler-Johnson, C. M. et al. Ethnic differences in the
frequency of prostate cancer susceptibilty alleles at SRD5A2
and CYP3A4. Hum. Hered. 54, 13–21 (2002).
Acknowledgements
Some of the work discussed in this review was supported by
grants from the Public Health Service and the University of
Pennsylvania Cancer Center.
Competing interests statement
The authors declare that they have no competing financial interests.
Online links
DATABASES
The following terms in this article are linked online to:
Entrez Gene:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
CYP3A4 | TNF
FURTHER INFORMATION
CODDLE: http://www.proweb.org/coddle
PolyPhen: http://www.bork.embl-heidelberg.de/PolyPhen
SIFT: http://blocks.fhcrc.org/%7Epauline/SIFT
Access to this interactive links box is free online.
... [9][10][11][12][13] Provisional evidence has also suggested that augmented RS theta waves might be considered a relevant biomarker of SUDs. 14,15 Looking at the genotype level, Prom-Wormley and colleagues 16 summarized the results of genetic associations studies (i.e., candidate gene associations studies [CGASs], genome wide association studies [GWASs]) among community-based and SUD clinical samples. They showed significant associations between genes involved in modulating neurotransmission linked to the core psychopathological features of SUDs (e.g, reward processing, negative affectivity and impulsivity). ...
Article
Background: Alterations in EEG activity have been considered valid endophenotypes of substance use disorders (SUDs). Empirical evidence has supported the association between genetic factors (e.g., genes, single nucleotide polymorphisms [SNPs]) and SUDs, considering both clinical samples and individuals with a positive family history of SUDs [F + SUD]). Nevertheless, the relationship between genetic factors and intermediate phenotypes (i.e., altered EEG activity) among individuals with SUD phenotypes remains unclear. Objective(s): The current study aims at summarizing genetic factors linked to aberrant EEG activity among individuals with SUDs and those with F + SUD. Methods: Sixteen studies (5 [N = 986] + 11 from the Collaborative Studies On Genetics of Alcoholism [COGA] sample [432 ≤ N ≤ 8810]) were included for a qualitative systematic review. Thirteen studies (5 + 8 studies from the COGA sample) were used for multi-level meta-analytic procedures. Results: Qualitative analyses highlighted a multivariate genetic architecture linked to alterations in EEG waves among individuals with SUD phenotypes (i.e., augmented resting-state beta waves; reduced resting-state alpha waves; reduced resting-state and task-dependent theta waves). The most recurrent genetic factors were involved in cellular energy homeostasis, modulation of inhibitory and excitatory neural activity together with neural cell growth. Meta-analytic results showed a moderate association between genetic factors and altered resting-state and task-dependent EEG activity. Meta-analytic results also suggested non-additive genetic effects on altered EEG activity. Conclusions: Complex genetic interactions mediating neural activity and brain development might constitute a causal pathway toward intermediate phenotypes associated with phenotypic features, which in turn are linked to SUDs.
... Rebbeck et al. [127] stated that "the lack of reproducibility of many association studies might reflect the number of studies that involve genetic variants with no functional significance." It is noteworthy that several MTAs detected by this study were also found in previous ones on cacao and functional roles are expected. ...
Preprint
Full-text available
A genome-wide association study was undertaken to unravel marker-trait associations (MTAs) between SNP markers and yield-related traits. It involved a subset of 421 cacao accessions from the large and diverse collection conserved ex situ at the International Cocoa Genebank Trinidad. An average linkage disequilibrium (r 2 ) of 0.10 at 5.2 Mb was found across several chromosomes. Seventeen significant ( P ≤ 8.17 × 10 -5 (–log10 (p) = 4.088)) MTAs of interest, which accounted for 5 to 17% of the explained phenotypic variation, were identified using a Mixed Linear Model in TASSEL version 5.2.50. The most significant MTAs identified were related to seed number and seed length on chromosome 7 and seed number on chromosome 1. Other significant MTAs involved seed length to width ratio on chromosomes 3 and 5 and seed length on chromosomes 4 and 9. It was noteworthy that several yield-related traits, viz ., seed length, seed length to width ratio and seed number were associated with markers on different chromosomes, indicating their polygenic nature. Approximately 40 candidate genes that encode embryo and seed development, protein synthesis, carbohydrate transport and lipid biosynthesis and transport were identified in this study. A significant association of fruit surface anthocyanin intensity co-localised with MYB-related protein 308 on chromosome 4. Testing of a genomic selection approach revealed good predictive value (GEBV) for economic traits such as seed number (GEBV = 0.611), seed length (0.6199), seed width (0.5435), seed length to width ratio (0.5503), seed/cotyledon mass (0.6014) and ovule number (0.6325). The findings of this study could facilitate genomic selection and marker-assisted breeding of cacao thereby expediting improvement in the yield potential of cacao planting material.
... As bases nitrogenadas (A: adenina; G: guanina; C: citosina; T: timina) são divididas em aproximadamente 20-25 mil genes que, após serem transcritos, dão origem a uma proteína específica. Algumas dessas proteínas, em sua transcrição, sofrem mutação, não possuindo o potencial da expressão do gene (Rebbeck, Spitz, & Wu, 2004). A identificação da variação genética juntamente com a rotina de treinamento individualizada para um atleta pode influenciar na performance em várias modalidades esportivas, possibilitando também a identificação de talentos em esportes específicos. ...
Article
Full-text available
Resumo Objetivo: Analisar o gene ACTN3 na prática esportiva de alto rendimento, bem como verificar a relação com o desempenho na modalidade específica e a influência do gene ACTN-3 sobre a caracterização das fibras musculares. Metodologia: Trata-se de um estudo de revisão narrativa da literatura, que analisou artigos publicados em língua portuguesa e/ou inglesa, entre os anos 1995 e 2020, encontrados na plataforma de pesquisa Google Acadêmico, selecionados a partir das palavras-chave, sinônimos e descritores "genética", "polimorfismo", "performance" e "actn3". Foram incluídos estudos com texto completo disponível em meio online, de forma gratuita, que forneciam informações sobre genética humana, o polimorfismo do gene ACTN3, a influência da genética no desempenho esportivo e os tipos de fibras musculares e sua intervenção no esporte, totalizando 41 artigos. Resultados: A análise
... Polymorphism arises from genetic variation within or among populations with the influence of environmental stimuli that can result in a genetic trait or phenotypic alterations (Brookes, 1999;Rebbeck et al., 2004;Hirschhorn and Daly, 2005). Genetic difference at a single position within a DNA among a population is referred to as single nucleotide polymorphism (SNP). ...
Article
Human dihydrofolate reductase (DHFR) is a conserved enzyme that is central to folate metabolism and is widely targeted in pathogenic diseases as well as cancers. Although studies have reported the fact that genetic mutations in DHFR leads to a rare autosomal recessive inborn error of folate metabolism and drug resistance, there is a lack of an extensive on how the deleterious non-synonymous SNPs (nsSNPs) disrupt its phenotypic effects. In this study, we aim at discovering the structural and functional consequences of nsSNPs in DHFR by employing a combined computational approach consisting of ten recently developed in silico tools for identification of damaging nsSNPs and molecular dynamics (MD) simulation for getting deeper insights into the magnitudes of damaging effects. Our study revealed the presence of 12 most deleterious nsSNPs affecting the native phenotypic effects, with three (R71T, G118D, Y122D) identified in the co-factor and ligand binding active sites. MD simulations also suggested that these three SNPs particularly Y122D, alter the overall structural flexibility and dynamics of the native DHFR protein which can provide more understandings into the crucial roles of these mutants in influencing the loss of DHFR function. Data availability Any dataset or information used in this current study is available from the corresponding author on reasonable request.
... 3 Genetic diversity, which occurs between and within different breeds, is due to the presence of polymorphisms in their genomes caused by environmental stimuli and mutations. 4,5 A single-nucleotide polymorphism (SNP) is a genetic code variation. They are widely used to study genetic differences between individuals and populations. ...
Article
In this study, the effect of genetic variations of four heat shock transcription factor genes (HSF1, HSF2, HSF4, and HSF5) on the 3 D protein structure and function were studied. We defined the breed-specific genetic variations of pooled DNA of Tali goat that differed from the goat reference sequence (CHI2.0). Disordered regions of HSF proteins were predicted using PONDR. Post-translation changes were studied by several predicted online servers. Then, the structure of the order region of proteins was anticipated by using the Swiss model. Tali goat HSF genes contain a total number of 181, 679, 91, and 301 SNPs for HSF1, 2, 4, and 5, respectively. Also, 5 and 3 variants were identified as nsSNPs in the coding region of HSF4 and HSF5, respectively. (r.145A/S), (r.322P/Y), (r.379T/C) in HSF4 and (r.300Q/P), (r.573E/Q) in HSF5 obtained the tolerant and high confidence (SIFT score) for nsSNPs. More than half of these proteins are predicted to be disordered (56, 50, 52, and 50%, respectively for HSF1, 2, 4, and 5). Phosphorylation, acetylation, glycosylation, and Sumoylation sites of HSFs were compared between Tali goat and reference goat. Three residues S145, S263, and S322 of HSF4 in Tali goat were phosphorylation sites, and in HSF5, the reference goat has a phosphorylation site in S593.
... Numerous non-synonymous (missense) SNPs of the MMP2 gene may deleteriously affect the primary configuration of the MMP2 protein and this has not been analyzed until now. A potential method to display a colossal number of significant SNPs could prioritize them in agreement with their practical consequences (Bao and Cui, 2005;Rebbeck et al., 2004). ...
Article
Full-text available
Matrix metalloproteinases - 2 (MMP2) protein stimulates multiple processes involving the nervous system, vascularization, and metastasis. Mutations in MMP2 is linked to Winchester and Nodulosis-Arthropathy-Osteolysis (NAO) syndromes. In this extensive investigation, we performed a computational analysis of 114 missense Single Nucleotide Polymorphisms (SNPs) of MMP2 protein using various in-silico algorithms. A total of 21 highly deleterious and pathogenic missense SNPs (T86K, R146H, A167T, G216E, R252P, R252L D326V, D326Y, G346S, G367S, R368L, S396R, A408P, R482C, P517L, A522E, E525K, Y543C, Y552S, K579M, M598T) were identified that probably could alter the structural and functional configuration of MMP2 gene. Moreover, conservation analysis, protein stability, TM-score and RMSD calculation, protein structure prediction and superimposition, ligand binding site prediction, protein-protein interaction, protein-ligand, and protein-protein docking studies were carried out. ConSurf analysis showed seventeen missense variants in the highly conserved regions, which were predicted as highly deleterious and pathogenic by eight in-silico platforms. Furthermore, G367S and K579M showed a greater impact on stability, structural alterations, protein-ligand and protein-protein interactions. This study will help in developing target-dependent medication for diseases and could enhance the understanding of the significance of uncharacterized missense SNPs and their interrelation with the disease. This study also contemplates the computational perception into the high-risk missense SNPs on protein structural and functional configuration.
... Additionally, several studies have been performed to estimate the effect of such nsSNPs, whether they are disease-causing or neutral [Adzhubei et al., 2013;Vijay, 2018;Zhao et al., 2018]. A possible way to study the massive amount of nucleotide variation data obtained from nextgeneration sequencing (NGS) could be prioritizing them as per their structural and functional consequences using in-silico approach as it is cost-effective, reliable, easy to perform and less time-consuming [Singh and Mistry, 2016, Bao and Cui, 2005, Rebbeck et al., 2004, Jiang et al., 2007. Henceforth, in the present study, efforts have been made to demonstrate the overview of the structural and functional aspects of human PTCH1 protein in the context of possible interaction with their ligand and switching on the Hh signaling to maintain the existence of CSCs in cancer tumors. ...
Article
Background Human PTCH1 is a negative receptor of Hedgehog (HH) signaling required to sustain cancer stem-like cells (CSCs) in a tumor microenvironment to drive tumor growth. If mutations are present in the PTCH1 gene results in the autocrine signaling and switches pathway on even in the absence of HH ligand. Further functional reduction in native PTCH1 protein may result in the impaired function of the protein. Objective To identify the functionally deleterious SNPs in human PTCH1 protein. Methods Various computational tools like SIFT, PolyPhen2, PROVEAN, SNAP2, SNPs&GO, PhD-SNP, I-Mutant, iPTREE-STAB and MUpro were used to predict most deleterious SNPs of PTCH1. ConSurf and NCBI conserved domain search tool was used to find conserved domain. Post translational modification sites were predicted using ModPred. SPARKS-X was used to generate 3D structure of the PTCH1 protein. Results In human PTCH1 gene, a total of 60 nonsynonymous SNPs (nsSNPs) were identified to be deleterious or non-tolerable by SIFT. Further, out of 60 nsSNPs, P298L, V907G, R1024H, R134Q, W235R, D259A, and R570Q were predicted to be present in the conserved domain of PTCH1 protein and potentially damaging by all the prediction tools. Amongst them, P298, V907 and R570 were predicted as a site for post-translational modifications (PTM) while R1024 to be involved in a ligand-binding residue. Conclusion Current study demonstrated that most deleterious nsSNPs found is arginine to histidine at position 1024 responsible for disease occurrence which might be helpful to eliminate CSCs in a tumor bulk.
Article
Full-text available
The G protein-coupled receptor 84 (GPR84) is found in immune cells and its expression is increased under inflammatory conditions. Activation of GPR84 by medium-chain fatty acids results in pro-inflammatory responses. Here, we screened available vertebrate genome data and found that GPR84 is present in vertebrates for more than 500 million years but absent in birds and a pseudogene in bats. Cloning and functional characterization of several mammalian GPR84 orthologs in combination with evolutionary and model-based structural analyses revealed evidence for positive selection of bear GPR84 orthologs. Naturally occurring human GPR84 variants are most frequent in Asian populations causing a loss of function. Further, we identified cis- and trans-2-decenoic acid, both known to mediate bacterial communication, as evolutionary highly conserved ligands. Our integrated set of approaches contributes to a comprehensive understanding of GPR84 in terms of evolutionary and structural aspects, highlighting GPR84 as a conserved immune cell receptor for bacteria-derived molecules.
Chapter
Full-text available
The precision in the plant genetic analysis has been revolutionized due to the use of molecular markers which have been used in molecular breeding/marker-assisted selection in crops. Single-nucleotide polymorphisms (SNPs) have been found to be the most popular and efficient marker technique due to its abundance in the genome and being amenable to get automated since the last three decades. The advances in next-generation sequencing associated with decreased analysis costs and bioinformatics-based computational resources aid in wide-scale SNP discovery in several model and nonmodel plant species. This chapter focuses on computational approaches for SNP discovery in plants through reference-based and de novo strategies. Gel-based and nongel-based SNP genotyping have also been discussed. Besides, various applications of SNP markers in plant breeding and crop improvement have been discussed.
Article
s Cancer is a leading factor of mortality globally. Cytochrome P450 (CYP) enzymes play a pivotal role in the biotransformation of both endogenous and exogenous compounds. Evidence from numerous epidemiological, animal, and clinical studies points to instrumental role of CYPs in cancer initiation, metastasis, and prevention. Substantial research has found that CYPs are involved in activating different carcinogenic chemicals in the environment, such as polycyclic aromatic hydrocarbons and tobacco-related nitrosamines. Electrophilic intermediates produced from these chemicals can covalently bind to DNA, inducing mutation and cellular transformation that collectively result in cancer development. While bioactivation of procarcinogens and promutagens by CYPs has long been established, the role of CYP-derived endobiotics in carcinogenesis has emerged in recent years. Eicosanoids derived from arachidonic acid via CYP oxidative pathways have been implicated in tumorigenesis, cancer progression and metastasis. The purpose of this review is to update on the current state of knowledge about the cancer molecular mechanism involving CYPs with focus on the biochemical and biotransformation mechanisms in the various CYP-mediated carcinogenesis, and the role of CYP-derived reactive metabolites, from both external and endogenous sources, on cancer growth and tumour formation.
Article
The distinction between deleterious, neutral, and adaptive mutations is a fundamental problem in the study of molecular evolution. Two significant quantities are the fraction of DNA variation in natural populations that is deleterious and destined to be eliminated and the fraction of fixed differences between species driven by positive Darwinian selection. We estimate these quantities using the large number of human genes for which there are polymorphism and divergence data. The fraction of amino acid mutations that is neutral is estimated to be 0.20 from the ratio of common amino acid (A) to synonymous (S) single nucleotide polymorphisms (SNPs) at frequencies of ≥ 15%. Among the 80% of amino acid mutations that are deleterious at least 20% of them are only slightly deleterious and often attain frequencies of 1–10%. We estimate that these slightly deleterious mutations comprise at least 3% of amino acid SNPs in the average individual or at least 300 per diploid genome. This estimate is not sensitive to human population history. The A/S ratio of fixed differences is greater than that of common SNPs and suggests that a large fraction of protein divergence is adaptive and driven by positive Darwinian selection.
Article
CYP3A4 is involved in the metabolism of endogenous steroids, and an allelic variant, CYP3A4*1B, consisting of an A to G polymorphism within the 5′-flanking region termed the nifedipine-specific response element (NFSE) has been associated with high grade and advanced stage of prostate cancers. Because steroid hormone exposure is known to influence breast and ovarian cancer risk, we conducted case-control studies to assess the relationship between CYP3A4*1B and risk of breast or ovarian cancer. CYP3A4 NFSE genotype was determined in 951 breast cancer cases and 500 controls frequency matched for age and 488 ovarian cancer cases and 276 controls of similar age distribution. Case-control analyses and comparisons of genotype distributions were conducted by unconditional logistic regression. In addition, the functional significance of the CYP3A4*1B polymorphism was assessed by analysis of CYP3A4-reporter gene constructs transiently transfected into liver-derived cell lines and primary cultures of well-differentiated rat hepatocytes. The GG genotype was rare in all groups (0-0.4%). There was no risk of cancer associated with the AG/GG genotypes combined, with an OR (95% CI) of 0.86 (0.54-1.33) for breast cancer (P = 0.5), and 1.51 (0.80-2.89) for ovarian cancer (P = 0.2). Analysis of CYP3A4-luciferase constructs showed that CYP3A4*1B did not consistently affect reporter gene activity. Our data suggest that the CYP3A4*1B polymorphism is not associated with risk of breast or ovarian cancer. In support of this negative finding, in-vitro functional studies indicate that NFSE genotype is not a critical factor in the transcriptional activity of the CYP3A4 5′-flanking region, and is thus unlikely to modulate CYP3A4-mediated metabolism of steroids.
Article
Marked interindividual variability in expression of CYP3A4 influences the disposition of many endo- and xenobiotics, including the metabolism of steroids, environmental toxins and therapeutically useful drugs. The present study was designed to determine the genetic basis of CYP3A4 variability. We analysed DNA from 82 individuals with known CYP3A4 phenotype including 53 Caucasians and 21 African-American liver donors, seven individuals who were outliers in CYP3A4 metabolism and five individuals in a family of a poor nifedipine metabolizer. In addition, we analysed DNA from the eight person DNA Polymorphism Discovery Resource subset (Coriell Institute) and 89 individuals representing nine ethnic groups. Five non-synonymous mutations in the coding region of CYP3A4 were observed. CYP3A4*14 (T44C) in exon 1 resulted in an L15P change;CYP3A4*15 (G14387A) in exon 6 resulted in a R162Q substitution;CYP3A4*10 (G14422C) in exon 6 resulted in a D174H substitution;CYP3A4*16 (C15721G) in exon 7 resulted in a T185S amino acid substitution; and CYP3A4*12 (C22002T) in exon 11 resulted in a L373F change in the CYP3A4 protein. An additional six single nucleotide polymorphisms (SNPs) in the 5′-UTR, 13 SNPs in the introns and three SNPs in the 3′-UTR were observed. Extensive population differences were observed in the frequencies of various CYP3A4 alleles. None of the 28 CYP3A4 SNPs identified in CYP3A4 phenotyped persons (most individuals being heterozygous for any CYP3A4 variant) was associated with low hepatic CYP3A4 protein expression or low CYP3A4 activity in vivo.
Article
CYP3A4 is the adult-specific form of cytochrome P450 in human livers [Komori, M., Nishio, K., Kitada, M., Shiramatsu, K., Muroya, K., Soma, M., Nagashima, K. & Kamataki, T. (1990) Biochemistry 29, 4430–4433]. The sequences of three genomic clones for CYP3A4 were analyzed for all exons, exon-intron junctions and the 5′-flanking region from the major transcription site to nucleotide position -1105, and compared with those of the CYP3A7 gene, a fetal-specific form of cytochrome P450 in humans. The results showed that the identity of 5′-flanking sequences between CYP3A4 and CYP3A7 genes was 91%, and that each 5′-flanking region had characteristic sequences termed as NFSE (P450NF-specific element) and HFLaSE (P450HFLa specific element), respectively. A basic transcription element (BTE) also lay in the 5′-flanking region of the CYP3A4 gene as seen in many CYP genes [Yanagida, A., Sogawa, K., Yasumoto, K. & Fujii-Kuriyama, Y. (1990) Mol. Cell. Biol. 10, 1470–1475]. The BTE binding factor (BTEB) was present in both adult and fetal human livers. To examine the transcriptional activity of the CYP3A4 gene, DNA fragments in the 5′-flanking region of the gene were inserted in front of the simian virus 40 promoter and the chloramphenicol acetyltransferase structural gene, and the constructs were transfected in HepG2 cells. The analysis of the chloramphenicol acetyltransferase activity indicated that (a) specific element(s) which could bind with a factor(s) in livers was present in the 5′-flanking region of the CYP3A4 gene to show the transcriptional activity.
Article
In this review, we consider the motivation behind contemporary Single Nucleotide Polymorphism (SNP) initiatives. Many of these initiatives are projected to involve large, population-based surveys. We therefore emphasize the utility of SNPs for genetic epidemiology studies. We start by offering an overview of genetic polymorphism and discuss the historical use of polymorphism in the identification of disease-predisposing genes via meiotic mapping. We next consider some of the unique aspects of SNPs, and their relative advantages and disadvantages in human population-based analyses. In this context, we describe and critique the following six different areas of application for SNP technologies: We focus on key issues within each of these areas in an effort to point out potential problems that might plague the use of SNPs (or other forms of polymorphism) within them. However, we make no claim that our list of considerations are exhaustive. Rather, we believe that they may provide a starting point for further dialog about the ultimate utility of SNP technologies. In addition, although our emphasis is placed on applications of SNPs to the understanding of human phenotypes, we acknowledge that SNP maps and technologies applied to other species (e.g. the mouse genome, pathogen genomes, plant genomes, etc.) are also of tremendous interest.
Article
The oral contraceptive pill is associated with a modest increase in the risk of early-onset breast cancer in the general population, but it is possible that the risk is higher in certain subgroups of women. The relative risk of breast cancer associated with oral contraceptive use has been reported to be higher for African-American women than for white women. African-American women also have a higher incidence of premenopausal breast cancer than white women. Circulating levels of insulin-like growth factor-1 (IGF-I) vary between ethnic groups and are positively associated with the risk of premenopausal breast cancer. In general, the plasma level of IGF-I is lower in women who take oral contraceptives than in women who do not. In an attempt to explain the observed ethnic difference in IGF-I levels with oral contraceptive use, we sought to identify polymorphic variants of genes that are associated with IGF-I levels and estrogen metabolism. We measured IGF-I and IGFBP-3 plasma levels in 503 nulligravid women between the ages of 17 and 35. All women filled out a questionnaire that included information about ethnic background and oral contraceptive use. Samples of DNA were used to genotype the women for known polymorphic variants in the IGF1, AIB1, and CYP3A4 genes. Black women had significantly higher mean IGF-I levels than white women (330 ng/ml versus 284 ng/ml; P = 0.001, adjusted for age and oral contraceptive use). IGF-I levels were significantly suppressed by oral contraceptives in white women (301 ng/ml versus 267 ng/ml; P = 0.0003), but not in black women. Among oral contraceptive users, the IGF-I level was positively associated with the absence of the IGF1 19-repeat allele (338 ng/ml versus 265 ng/ml; P = 0.00007), with the presence of the CYP3A4 variant allele (320 ng/ml versus 269 ng/ml; P = 0.01), and with the presence of the AIB1 26-repeat allele (291 ng/ml versus 271; P = 0.08). After adjusting for genotypes, ethnic group was no longer a significant predictor of the IGF-I level. IGF-I levels are higher among black than white women. Polymorphic variants in the CYP3A4, IGF1, and AIB1 genes are associated with increases in the plasma levels of IGF-I among oral contraceptive users and the variant alleles are much more common in black women than in white women. The high incidence of premenopausal breast cancer among black women may be mediated through genetic modifiers of circulating levels of IGF-I.
Article
The model has been constructed by combining information from the three-dimensional Fourier syntheses of horse oxyhaemoglobin at 5·5 Å and of sperm whale myoglobin at 1·4 Å resolution. Between them, these two sets of experimental data were sufficient to locate the positions of each of the haem groups and amino acid residues in haemoglobin within narrow limits. The nature of the residues was known from the chemical sequence. The accuracy of atomic positions is estimated to be of the order of 1 to 2 Å. The model has not yet been checked by comparison of observed and calculated intensities of X-ray reflexions.The model is sufficiently accurate, however, to show how the different kinds of residues are distributed between the interior and the surface of the four subunits; to locate the residues lying at the interfaces between the subunits and at the contacts between neighbouring molecules in the crystal; to determine the surroundings of the haem groups; and to reveal the existence of an internal cavity which is populated by a variety of polar side-chains and is filled with water.The nature of the intra- and intermolecular forces and the dissociation properties of haemoglobin are discussed in the light of the model. Relations between structure and function are dealt with in subsequent papers.