Content uploaded by Kermit Ritland
Author content
All content in this area was uploaded by Kermit Ritland on Mar 29, 2018
Content may be subject to copyright.
Copyright 1999 by the Genetics Society of America
Estimation of Pairwise Relatedness With Molecular Markers
Michael Lynch* and Kermit Ritland
†
*Department of Biology, University of Oregon, Eugene, Oregon 97403 and
†
Department of Forest Sciences,
University of British Columbia, Vancouver, British Columbia V6T1Z4, Canada
Manuscript received June 26, 1998
Accepted for publication April 19, 1999
ABSTRACT
Applications of quantitative genetics and conservation genetics often require measures of pairwise
relationships between individuals, which, in the absence of known pedigree structure, can be estimated
only by use of molecular markers. Here we introduce methods for the joint estimation of the two-gene
and four-gene coefficients of relationship from data on codominant molecular markers in randomly
mating populations. In a comparison with other published estimators of pairwise relatedness, we find
these new “regression” estimators to be computationally simpler and to yield similar or lower sampling
variances, particularly when many loci are used or when loci are hypervariable. Two examples are given
in which the new estimators are applied to natural populations, one that reveals isolation-by-distance in
an annual plant and the other that suggests a genetic basis for a coat color polymorphism in bears.
C
OEFFICIENTS of relationship between pairs of of relatedness can only be achieved through inferences
with molecular markers (Avise 1995).individuals play a central role in many areas of
geneticsandbehavioralecology.Forexample,inquanti- A third field of inquiry within which pairwise relat-
edness plays a significant role is the evolution of socialtative genetics, thephenotypic resemblanceof relatives,
which forms the basis for the empirical estimation of behavior. Studies inthis area are largely focused around
Hamilton’s (1964) theoryof kinselection,which statescomponents of genetic variance, is a direct function of
the probability that individuals have one or two genes that the evolutionary advantage of an altruistic act de-
pends on whether the cost to the donor exceeds theidentical by descent at a locus. Given such probabilities,
causalcomponents of variance(such as theadditive and benefit to the recipient multiplied by the relatedness
between the two individuals. Because most such studiesdominance geneticvariance) canbe estimatedfromthe
phenotypic covariance (Falconer and Mackay 1996; involvefield populations whereparentage is not directly
observed, indirect inferences about relatedness mustLynch and Walsh 1998). In studies of laboratory or
domesticated populations, where investigators can be again be made with molecular markers.
In all of the above-mentionedapplications ofmolecu-certain of the degrees of relationship among observed
individuals, the application of conventional quantita- lar markers, it is an implicit assumption that such mark-
ers provide reasonable, if not excellent, estimates oftive-genetic methodology is straightforward. Major un- relatednesscoefficients.Yet,therearefewexistingmeth-
certainties about the relationships among individuals ods for the estimation of pairwise relatedness for which
from natural populations are the primary impediment the statistical properties are well understood or well
to extending quantitative-genetic analysis to field stud- behaved. Several estimators have been developed for
ies, but Ritland (1989, 1996a) has suggested how this pairwise relatedness using the rather specialized data
problemmightbeovercome by regressing pairwise mea- provided by DNA-fingerprint profiles (Lynch 1988; Li
sures of phenotypic similarity on pairwise estimates of et al. 1993; Geyer and Thompson 1995). Following
relatedness obtained with molecular markers. up on earlier work of Pamilo and Crozier (1982),
Pairwise measures of relatedness also play a role in the Queller and Goodnight (1989) developed marker-
field of conservation genetics. For example, in captive based estimators for within-group relatedness, but these
breeding programs, substantial effort is being made to are of somewhat limited applicability in the estimation
ensure that matings are minimized between close rela- of pairwise relationship because of their poor behavior
tives to reduce the loss of genetic variation by random with diallelic loci. An efficient method-of-moments esti-
geneticdrift.Ifthe potential parents are derived directly mator, recently developed by Ritland (1996b), pro-
fromwild-caughtstock or are descendantsof individuals vides a basis for the joint estimation of identity-by-
of unknown relationship, a relative ranking of degrees descent at both the genic and genotypic levels. Ritland’s
approach, which is based on a model involving joint
probabilities of the two genotypes of a pair, can be quite
Corresponding author: Michael Lynch, Department of Biology, Uni-
complex computationally and is ill-behaved with some
versity of Oregon, Eugene, OR 97403.
E-mail: mlynch@oregon.uoregon.edu
gene frequencies. Maximum-likelihood methods have
Genetics 152: 1753–1766 (August 1999)
1754 M. Lynch and K. Ritland
been developed by Thompson (1975, 1976, 1986) to edness and genetic components of variance (Cocker-
ham 1971; Jacquard 1974). Higher-order terms musttest for specific types of relationship.
In this article, we introduce a simple method for ob- also beadded tothe previousexpression whenepistatic
sources of genetic variance are present, but providedtaining unbiased estimates of pairwise relationship coef-
ficients. Its simplicity arises from the use of a regression the population is randomly mating, no relationship co-
efficients are required beyond r
xy
and D
xy
(Kempthorneapproach for inferring relationship—one individual of
a pair serves as a “reference,” and the probabilities of 1954; Lynch and Walsh 1998).
In the following analyses, we focus on the estimationthe locus-specificgenotypesintheother“proband”indi-
vidual are conditioned on those of the reference. Aside of r
xy
and D
xy
, as these are the relationship coefficients
that are of primary practical utility. Our computer simu-from its ease of application and unbiased nature, this
method has two very useful features—it generates joint lations showed that estimates of φ
xy
have much higher
sampling variance than those of r
xy
and D
xy
, enough soestimates of both the two- and four-gene coefficients
of relatedness, and it yields simple expressions for the that the accurate measurement of φ
xy
is beyond reach
unless very large numbers of informative loci can besampling variance of these coefficients. This latter fea-
ture provides a convenient means for optimizing the assayed. This large sampling variance does not carry
over greatly to estimates of the composite measure r
xy
,use of information derived from different loci. Follow-
ing our derivation of the regression method, we com- because there is also a very large negative sampling
covariance between the two component coefficients, φ
xy
pare its performance against that of other methods and
then provide two examples of its application to studies and D
xy
.
Genotypic probabilities: There are two fundamentalof natural populations. ways to set up a model for the genotypic probabilities
in a pair of individuals. The first approach, adopted by
JOINT ESTIMATION OF TWO-GENE AND
Ritland (1996b), specifies the joint probability of both
FOUR-GENE COEFFICIENTS
genotypes. The second approach, adopted here, speci-
fies the conditional genotypic probability of a proband
Throughout, we focus on the traditional definition individual y, given the genotype of the reference indi-
of relatedness for individual pairs of diploid individuals, vidual x. We refer to these two approaches as “correla-
r
xy
52Q
xy
, where the coefficient of coancestry, Q
xy
,isthe tion” and “regression” methods in the sense that they
probabilitythat,foranyautosomallocus,arandomgene are symmetrical vs. asymmetrical measures. Both ap-
taken from individual xis identical by descent with a proaches allow the joint estimation of r
xy
,φ
xy
, and D
xy
,
random gene taken from individual y. For monozygotic but as we will see, correlation and regression estimators
twins(and clonemates), r
xy
51;for parent-offspring and differ substantially in terms of complexity and statisti-
full-sibrelationships,r
xy
50.5;andfor second- and third- cal properties. It is important to note that our use of
order relationships, r
xy
is equal, respectively, to 0.25 and the terms correlation and regression refers to the un-
0.125. derlying statistical model and not to the estimators
The relatedness coefficient for two individuals (xand themselves. The estimators developed here and in Rit-
y) is a linear function of two “higher-order” coefficients, land (1996b) are more properly termed “method-of-
moments” estimators.
r
xy
5φ
xy
21D
xy
. (1) Consider a single locus with nalleles, and let xbe the
Ifweconsiderallfourgenes possessed by two individuals reference individual (with alleles aand b) and ybe the
at a locus, φ
xy
is the probability that a single gene in xproband individual (with alleles cand d). The condi-
is identical by descent with one in y, and D
xy
is the tional probabilities for the n(n11)/2 possible geno-
probability that each of the two genes in xis identical types in ycan be expressed as a function of φ
xy
,D
xy
, and
by descent with one in y. For parents and offspring, the known allele frequencies,
φ
xy
51 and D
xy
50; for full sibs, φ
xy
50.5 and D
xy
5P(y5cd|x5ab)5P
0
(cd)·(12φ
xy
2D
xy
)
0.25; and for half sibs, φ
xy
50.25 and D
xy
50. For many
applications, such a subdivision of r
xy
is unnecessary, 1P
1
(cd|ab)·φ
xy
1P
2
(cd|ab)·D
xy
,
but in quantitative genetics, a knowledge of the higher- (2)
order coefficient D
xy
is desirable because the expected
genetic covariance between individuals is defined to be where P
0
(cd) is the Hardy-Weinberg probability of geno-
typecd,andP
1
(cd|ab)andP
2
(cd|ab)denotetheprobabili-
s
xy
5r
xy
s
2
A
1D
xy
s
2
D
,ties of genotype cd in ygiven genotype ab in x, the first
being conditional on the two individuals having onewheres
2
A
ands
2
D
aretheadditiveanddominancecompo-
nents of genetic variance for a quantitative trait. This gene identical by descent and the second being condi-
tional on two genes being identical by descent.expressionassumesarandom-matingpopulation,which
we also assume throughout. Inbreeding introduces the Regression estimators: Equation 2 provides the foun-
dation for the regression-based estimators that we nowneed for additional higher-order coefficients of relat-
1755Estimation of Relatedness
explore. To illustrate the general approach, we first proband individual alleles cand d. If the reference indi-
vidual is homozygous, S
ab
51, while if it isheterozygous,derive estimators conditioned on the observation of a
homozygote reference genotype. In this straightforward S
ab
50.Likewise,if allele afrom the referenceindividual
is the same as allele cfrom the proband, S
ac
51, whilecase, twoprobabilitiesareinformativeaboutx’srelation-
ship with individual y:P(ii|ii) and P(i·|ii), the condi- S
ac
50 if it is different. In total, there are six S’s corre-
sponding tothe sixwaysof choosingtwo objectswithouttional probabilities that the two individuals have two vs.
one pair of genes identical in state at the locus, with a replacement from apool offour objects. Letting p
a
and
p
b
bethe frequencies of alleles aand bin the population,dot denoting any allele other than i. The probability of
no genes identical in state, P(··|ii), provides no addi- the fully general expressions for the two coefficients of
primary interest aretional information, as it simply equals [1 2P(ii|ii)2
P(i·|ii)]. Letting p
i
be the frequency of the ith allele,
r
ˆ
xy
5p
a
(S
bc
1S
bd
)1p
b
(S
ac
1S
ad
)24p
a
p
b
(1 1S
ab
)(p
a
1p
b
)24p
a
p
b
(5a)
from Equation 2,
P(ii|ii)5p
2
i
1p
i
(1 2p
i
)φ
xy
1(1 2p
2
i
)D
xy
(3a)
D
ˆ
xy
52p
a
p
b
2p
a
(S
bc
1S
bd
)2p
b
(S
ac
1S
ad
)1(S
ac
S
bd
)1(S
ad
S
bc
)
(1 1S
ab
)(1 2p
a
2p
b
)12p
a
p
b
.
P(i·|ii)52p
i
(1 2p
i
)1(1 2p
i
)(1 22p
i
)φ
xy
(5b)
22p
i
(1 2p
i
)D
xy
. (3b) In actual practice, there is no particular reason to use
one member of a pair of individuals as the referenceAssuming that we know the allele frequency p
i
in ad-
vance, these two equations can be rearranged to yield as opposed to the other member. Thus, the reciprocal
estimates r
ˆ
xy
and r
ˆ
yx
, etc., can be arithmetically averagedestimators for the two unknown relationship coeffi-
cients, to further refine the pairwise relationship estimates for
the pair of individuals xand y. In all of the following
analyses, we rely on such reciprocal estimates, as the
φ
ˆ
xy
5(1 1p
i
)P
ˆ(i·|ii)12p
i
P
ˆ(ii|ii)22p
i
(1 2p
i
)
2
(4a) arithmeticaverageof the two reciprocal estimates gener-
allyhas a lower statistical variancethan a single estimate.
D
ˆ
xy
5p
2
i
2p
i
P
ˆ(i·|ii)1(1 22p
i
)P
ˆ(ii|ii)
(1 2p
i
)
2
, (4b) In principle, the root of the product of the two recipro-
cal estimates could be used, but this leads to undefined
and from Equation 1, estimates in the event that one is negative.
Multilocusestimates:Estimatesofrelatednessareusu-
r
ˆ
xy
5P
ˆ(i·|ii)12P
ˆ(ii|ii)22p
i
2(1 2p
i
). (4c) allybasedondatafrommultipleloci.Undertheassump-
tionthat the marker loci are unlinked, the locus-specific
estimates are independent. However, any averaging of
Throughout, we use a ∧to distinguish an estimator the locus-specific estimates to obtain overall estimates
from its parametric value. For any pair of observed indi- of r
xy
and D
xy
should account for the dramatic among-
viduals, the two probabilities necessary for the solution locus differences of sampling variance that can arise
of these equations, P
ˆ(i·|ii) and P
ˆ(ii|ii), are estimated from both differences in reference genotypes (e.g.,com-
as 0/1 variables, with 1’s being given to observed two- mon homozygote vs. rare heterozygote) and in levels
genotype combinations and 0’s being given to unob- of variation (loci with more alleles being more informa-
served combinations. (Both probabilities are 0 if the tive).
proband has no alleles in common with the reference.) Let w
r,x
(,) and w
D
,x
(,) denote the weights to be used
Thus, for example, when individual ycontains 2, 1, and for the ,th locus in the overall estimates of r
xy
and D
xy
,
0ialleles, the estimate r
ˆ
xy
is 1, (1 22p
i
)/[2(1 2p
i
)], and let W
r,x
and W
D
,x
be the sums of the weights over
and 2p
i
/(1 2p
i
), respectively. all Lloci. The composite estimates of the relationship
The appendix provides a parallel set of results for coefficients for xand yare then
heterozygotes at diallelic and multiallelic loci. Diallelic
heterozygous reference individuals introduce no new r
ˆ
xy
51
W
r,x
o
L
,
5
1
w
r,x
(,)r
ˆ
xy
(,) (6a)
problems, but with multiallelic loci, there are six classes
of conditional probabilities for heterozygous reference D
ˆ
xy
51
W
D
,x
o
L
,
5
1
w
D
,x
(,)D
ˆ
xy
(,) . (6b)
individuals. In the latter case then, the number of ob-
served 0/1 variables exceeds the number of unknowns
(φand D). To deal with this situation, we provide a With statistically independent marker loci, the locus-
specific weights that minimize the sampling variance ofweighted least-squares approximation.
A general estimator, which covers all three cases, is the overall estimates φ
ˆ
xy
and D
ˆ
xy
are simply the inverses
ofthe sampling variancesof the locus-specific estimates.best described by introducing “indicator variables” for
the sharing of pairs of alleles (as opposed to more com- As noted in the appendix, we cannot be very certain of
the numerical values of the weights because they areplex patterns of sharing as used earlier). As before, let
the reference individual have alleles aand band the functions of the parameters that we are trying to esti-
1756 M. Lynch and K. Ritland
mate, but approximations can be obtained by simply in which 10 informative loci have been sampled. At that
assuming that xand yare unrelated. The locus-specific point, the lower asymptotic value of the single-locus
weights are then given by the inverses of the sampling sampling variance is closely approximated in most situa-
variances of estimates of the relatedness coefficients for tions, and 10 loci is a good approximation of the sam-
nonrelatives conditional on the genotype in x. General pling scheme employed in many empirical studies, with
expressions for the weights are given by diallelic locicorresponding to isozymes and multiallelic
loci corresponding to microsatellites.
w
r,x
(,)51
Var[r
ˆ
xy
(,)] 5(1 1S
ab
)(p
a
1p
b
)24p
a
p
b
2p
a
p
b
(7a) For diallelic loci, the asymptotic sampling variance
per locus for r
ˆis equal to 1 in the case of nonrelatives
and somewhat lower for related individuals (even
w
D
,x
(,)51
Var[D
ˆ
xy
(,)] 5(1 1S
ab
)(1 2p
a
2p
b
)12p
a
p
b
2p
a
p
b
,
thoughnonoptimalweights are employed withrelatives;
(7b) Figure 1). With allele frequencies approaching 0.5, the
with S
ab
equal to 1 when xis homozygous and equal to optimal weights of all reference genotypes approach
0 when xis heterozygous. equality regardless of the degree of relationship, be-
Properties of the regression estimators: Extensive cause all alleles are then equally informative. Thus, the
computer simulations demonstrated that the regression asymptotic sampling variances near allele frequencies
estimators given above are essentially unbiased, regard- of 0.5 are the best that one could expect to achieve
less of the numbers of loci or the values of φand D.even if the correct weights were used. Because even with
Thus, the primary issues of interest are the magnitudes close relatives, the sampling variance is never less than
of the sampling variances of the estimators and their about 0.4 per locus, these results imply that with a large
sensitivity to the degree of actual relationship and to number of loci, the expected standard error of r
ˆis
the allele-frequency distribution. generally on the order 1/
√
Lwhen diallelic loci are as-
We obtained estimates of the sampling variances of sayed, somewhat greater if loci with extreme allele fre-
the regression estimators by Monte Carlo simulation, quencies are included, and slightly less with close rela-
assuming gene frequencies were known without error tives.
and assuming a random mating population with un- As in the case of r
ˆ, the single-locus sampling variance
linked marker loci. Reference genotypes were drawn of D
ˆdepends on the number of loci sampled, but the
randomly according to their Hardy-Weinberg frequen- sensitivity to this is reduced at moderate allele frequen-
cies, and the genotypes of the paired individuals were cies (Figure 1). For all degrees of relationship, the as-
then obtained from the conditional genotype distribu- ymptotic single-locus sampling variance for D
ˆdeclines
tions given the reference genotype and the particular as allele frequencies become more equitable (Figure
relationship. For multiallelic loci, two types of allele- 1).It can exceed 10 whenallele frequencies are extreme
frequencydistributions were considered: uniformdistri- and is never much ,1 with any type of relationship.
butions,in which thefrequencies of each of the nalleles Thus, as in the case of r
ˆ, with diallelic loci, the best
per locus were equal to 1/n, and “triangular” distribu- that one can ever expect to achieve with the regression
tions, in which the frequencies of alleles followed the estimator is a multilocus standard error of D
ˆequal to
proportions 1, 2, ...,n. In all of the following figures, 1/
√
L.
we report the single-locus sampling variances of the In principle, an increase in the number of alleles per
relationshipcoefficients.Foranalysesinvolvingmultiple locus should reduce the sampling variance of related-
loci with identical allele frequencies, the sampling vari- ness estimates, because alleles that are identical in state
anceofmultilocus estimates can be obtained bydividing will be more reliable as indicators of identity by descent.
the plotted values by the number of loci (L). For nonrelated individuals, the asymptotic single-locus
A special property of the regression estimator is that sampling variance of r
ˆis very close to 1/(n21), regard-
the expected single-locus sampling variance declines less of the form of the allele-frequency distribution (Fig-
with increasing numbers of unlinked loci, down to an ure 2). With parents and offspring, the sampling vari-
asymptotic value (Figure 1). This dependence on num- ance is up to 50% less than this, while with other types
ber of loci arises with the regression estimator because of relatives it is somewhat higher when alleles with low
the estimation variances (the weights) differ among al- frequency are common. Again, with an even allele-fre-
ternative reference genotypes at the same locus (for quency distribution, all reference genotypes are equally
example, a reference genotype having rarer alleles gives informative regardless of the degree of relationship, so
estimates with lower variance). By contrast, the correla- the results for this case can be viewed as the minimum
tion estimator of Ritland (1996b) is not conditioned sampling variance that one can expect to achieve with
upon observed genotype, and its variance only depends the regression estimator—except in the case of parents
on the distribution of gene frequencies in the popula- and offspring, a standard error of r
ˆless than about
tion. Although Figure 1 details the influence of the 1/
√
L(n21) is not achievable. Relative to the situation
number of loci on the variance of the regression estima-
tor, for the remaining analyses we focus on the situation with r
ˆ, the rate of reduction in the asymptotic sampling
1757Estimation of Relatedness
Figure 1.—Single-locus sam-
pling variances for estimates of
pairwise rand Dfor the range
of possible gene frequencies at
diallelicloci. For each gene fre-
quency (in increments of 0.01)
and degree of relationship,
random pairs of multilocus ge-
notypes were obtained by
Monte Carlo simulation for
32,000 individuals. For each
pair of individuals, the two re-
ciprocal weighted estimates
were obtained and then aver-
agedto obtain the pairwise esti-
mates.Solid lines, large dashes,
medium dashes, and short
dashes denote estimates based
on 1, 5, 10, and 25 loci, respec-
tively.
variance of D
ˆwith increasing nis more rapid (Figure of D
xy
. However, for situations in which one can be
2).For nonrelatives, the asymptotic single-locusvariance reasonably certain that the dominance genetic variance
closely approximates 2/[n(n21)] regardless of the for a trait is negligible, or when one can be certain that
form of the allele-frequency distribution. collateral relatives (e.g., pairs of individuals, such as full
sibs and double first cousins, that share paternal and
maternal genes) are absent, D
xy
can be ignored. In addi-
COMPARISON WITH OTHER ESTIMATORS
tion, in many applications in conservation genetics and
behavioral ecology, the composite estimate r
xy
may pro-
Asnoted above, for applicationsin quantitative genet- vide all the information that is needed. Four additional
ics, there is a need for separate estimates of r
xy
and D
xy
estimators of r
xy
, all of which are unbiased, have been
because the additive genetic covariance between indi- previously described.
viduals is a function of the composite measure r
xy
,
whereas the dominance genetic covariance is a function A simple estimator based on the sharing of alleles,
1758 M. Lynch and K. Ritland
Figure 2.—Single-locus sam-
pling variances for rand Das
a function of number of alleles
at loci with uniform and trian-
gular allele-frequency distribu-
tions.Results are given fornon-
relatives (NR), half sibs (HS),
full sibs (FS), and parents and
offspring (PO). The plotted
values were obtained from
Monte Carlo simulations of 10
loci (all with the same allele-
frequency profile) for 32,000
pairs of individuals. Sampling
variances of multilocus esti-
mates of rand Dare obtained
by dividing the plotted values
by the number of loci, keeping
in mind that somewhat higher
values are expected if ,10 loci
are observed.
proposedby Lynch (1988) foranalyses employing DNA above, Equation 8 does not return estimates of r
xy
.1.
fingerprint profiles, can be generalized to any set of However, like the weighted regression estimator, Equa-
codominant markers. The following expression in- tion 8 does generate negative estimates whenever the
cludes the slight modification suggested by Li et al. observed S
xy
is ,S
0
because of sampling error. In the
(1993). Define the similarity index, S
xy
,tobetheaverage following, Equation 8 is referred to as the similarity-
fraction of genes at a locus in a reference individual index estimator.
(here either xor y) for which there is another gene in Like Equation 8, Ritland’s (1996b) method-of-
the proband that is identical in state. Thus, S
xy
51 when moments estimator for r
xy
considers the joint distri-
(x5ii,y5ii)or(x5ij,y5ij), S
xy
50.75 when (x5bution of both genotypes in a symmetrical way. The
ii,y5ij) or vice versa, S
xy
50.5 when (x5ij,y5ik), differing information provided by alternative alleles is
andS
xy
50when (x5ij,y5kl). Asingle-locus estimator incorporated by considering the incidence of each of
for r
xy
is then the npossible alleles at the locus. The observed data are
summarized as an array of nsimilarities, where the ith
r
ˆ
xy
5S
xy
2S
0
12S
0
, (8) element(S
i
)is equal to 0.0 (at most, one ofthe individu-
als contains allele i), 0.25 (both individuals contain a
where S
0
5
o
n
i
5
1
p
2
i
(2 2p
i
) is the expected value of Sat single iallele), 0.5 (one individual contains two and the
the locus for unrelated individuals in a random-mating other individual one ialleles), or 1.0 (both individuals
population.Thissimple estimator derives fromthe prin- are ii homozygotes). Estimates of r
xy
derived for each
ciple that if two individuals are related to degree r
xy
, the allele are combined into a single estimate for the locus
expected fraction of genes that they have identical in by using weights that assume zero relationship (as with
state is the sum of the fractions shared because of iden- the weighted regression estimators derived above),
tity-by-descent and because of identity-in-state (but not
identity-by-descent), E(S
xy
)5r
xy
1(1 2r
xy
)S
0
. Note that r
ˆ
xy
52
n21
31
o
n
i
5
1
S
i
p
i
2
21
4
. (9)
unlike the weighted regression estimator described
1759Estimation of Relatedness
[Note that the r
xy
in this article is twice that defined in
the Ritland (1996b) article.]
A simpler estimator, also based upon the joint distri-
bution of genotypes, was described by Ritland (1996b)
and earlier workers (Li and Horvitz 1953; Weir 1996,
Equation 2.28), primarily in relation to estimating in-
breeding coefficients. Defining an alternative similarity
index such that S9
xy
51 when (x5ii,y5ii), S9
xy
50.5
when (x5ij,y5ij)or(x5ii,y5ij), S9
xy
50.25 when
(x5ij,y5ik), and S9
xy
50 when (x5ij,y5kl), then
r
ˆ
xy
52(S9
xy
2J
0
)
12J
0
, (10)
where J
0
5
o
n
i
5
1
p
2
i
is the expected homozygosity at the
locus.Equation 10 isequivalent to anunweighted corre-
lation estimator. Because our analyses showed it to be
uniformly worse in terms of sampling variance than all
of the estimators presented here, we do not consider it
any further.
Finally, we note Queller and Goodnight’s (1989)
estimator of r
xy
. Although their index is primarily de-
signed for estimating the average degree of relatedness
withingroupsofindividuals, it can be expressed in terms
of the same parameters that we employ with our Equa-
tions5aand5btoobtainapairwiseestimator for individ-
uals xand y,
r
ˆ
xy
50.5(S
ac
1S
ad
1S
bc
1S
bd
)2p
a
2p
b
11S
ab
2p
a
2p
b
. (11)
This equation has limited utility with diallelic loci—if
individual xis a heterozygote, then S
ab
50 and Equation
11 is undefined because p
a
1p
b
51. Therefore, in the
following analyses, we consider Equation 11 only in the
context of multiallelic loci.
In comparing the performance of these alternative
methods for estimating r
xy
to that of the regression esti-
mator, we evaluated their single-locus sampling vari-
ances analytically by considering the joint probabilities
of all genotypes of pairs of individuals, conditional on
the degree of relationship and the allele-frequency dis-
tribution. With these alternative methods, the weights
depend only on the allele-frequency distribution in the
population, not on the genotypes of the reference and
proband individuals. Thus, with multiple marker loci
all with the same allele frequencies, the multilocus sam-
Figure 3.—Single-locus sampling variances for estimates of
plingvariancesaresimplythe single-locus values divided
rderived with the regression method (R), the correlation
by the number of loci. When loci have different allele-
method (C), and the similarity-index method (S) for diallelic
loci. The results for the regression method apply to analyses
frequencydistributions, as is usuallythe case inpractice,
based on 10 loci and were obtained by Monte Carlo simula-
weighted multilocus estimates can be obtained by
tions; additional loci yield slightly lower values. The results
weighting the locus-specific estimates by the inverses of
for the correlation and similarity-index methods are exact
their sampling variance.
solutions based on expected genotype combinations.
For diallelic loci, the correlation estimator yields a
sampling variance per locus equal to one in the case of
nonrelatives regardless of the allele frequency (Figure pling variance. On the other hand, for close relatives,
3). As noted above, the regression estimator asymptoti- compared to the correlation estimator, the regression
callyapproaches this same level ofefficiencyfornonrela- and similarity-index methods yield more accurate esti-
mates of rover the full range of allele frequencies attives, but the similarity-index method has higher sam-
1760 M. Lynch and K. Ritland
Figure 4.—Single-locus sam-
pling variances for estimates of r
for multiallelic loci, derived with
the regression method (R), the
correlation method (C), the simi-
larity-index method (S), and the
Queller-Goodnight method (Q)
for uniform and triangular allele-
frequency distrubutions. The re-
sults for the regression method
apply to analyses based on 10 loci
andwere obtainedbyMonte Carlo
simulations; additional loci yield
slightly lower values. The results
forthe correlationandthe similar-
ity-index methods are exact solu-
tions based on expected genotype
combinations.
diallelic loci, with the latter actually outperforming the with any estimator of distant relationships. For related
individuals,the regression and similarity-indexmethodsformer in the case of parent-offspring pairs.
A multiallelic perspective yields further insight into yield very similar sampling variances of rprovided there
are at least three alleles per locus, while the correlationthe relative efficiencies of the four techniques. With a
uniform distribution of three or more alleles per locus, and Queller-Goodnight estimators are again less effi-
cient. For the two superior methods, the single-locusthe single-locus sampling variance for r
ˆis essentially 1/
(n21) with nonrelatives regardless of the method sampling variance of estimates of r
ˆasymptotically ap-
proaches 0.14 with increasing allele number with full(Figure 4). Thus, because an even allele-frequency dis-
tribution provides the greatest power of inference, this sibs, and very slowly approaches 0 with parents and
offspring.seems to be the best that one can expect to achieve
1761Estimation of Relatedness
TABLE 1
Sampling variance properties of D
ˆ
Number of alleles
Relationship Method 2 4 6 12
Uniform frequencies
Nonrelatives R 0.999 0.168 0.067 0.015
C 1.000 0.166 0.067 0.017
Half sibs R 1.011 0.269 0.142 0.056
C 1.004 0.272 0.144 0.056
Full sibs R 0.949 0.423 0.324 0.248
C 0.948 0.440 0.336 0.256
Parent-offspring R 0.989 0.368 0.219 0.096
C 1.008 0.376 0.220 0.096
Triangular frequencies
Nonrelatives R 1.070 0.182 0.074 0.016
C 1.000 0.166 0.067 0.017
Half sibs R 1.276 0.329 0.179 0.074
C 1.240 0.360 0.240 0.080
Full sibs R 1.362 0.605 0.486 0.396
C 1.480 1.000 0.960 0.880
Parent-offspring R 1.471 0.479 0.294 0.136
C 1.520 0.640 0.640 0.280
Values are given for the single-locus sampling variances. R and C denote the regression and correlation
estimators, respectively. The regression estimates are based on Monte Carlo simulations of 10 loci per pair of
individuals.
With a triangular allele-frequency distribution, the ships, defined as family (parent-offspring, full sibs),
regression and correlation methods again yield essen- close (half sibs, uncle, etc.), remote (cousin, etc.), and
tially identical results with nonrelatives, while the simi- unrelated. This approach to inferring genealogical “re-
larity-index and Queller-Goodnight methods have lationship” is fundamentally different from our ap-
somewhat higher sampling variances. However, with re- proach to estimating “relatedness,” which is a nondis-
lated individuals, the similarity-index method is again crete numerical parameter defined in terms of
the superior of the four methods, and the correlation probabilities of identity-by-descent. Nevertheless, we
and Queller-Goodnight estimators generally yield the haveconsideredthe possibility of using likelihood meth-
highest sampling variance. By use of either the regres- ods to estimate “relatedness” under our regression
sion or similarity-index methods, up to a 50% reduction framework. Using notation developed earlier, the likeli-
in the standard error of r
ˆan be achieved. hood of data from one locus is the probability
The only other marker-based method for the estima-
P(y5cd|x5ab)5p
a
p
b
(2 2S
ab
)(2 2S
cd
)
tion of Dis the correlation-based estimator of Ritland
(1996b),whichis quite complex algebraically.Results in
· [(1 22φ
xy
1D
xy
)p
c
p
d
12(φ
xy
2D
xy
)
Table1 show that the muchsimplerregressionestimator
·((S
ac
1S
bc
)p
d
1(S
ad
1S
bd
)p
c
)/4
presented above yields essentially the same asymptotic
sampling variances as the correlation method when the
1D
xy
(S
ac
S
bd
1S
ad
S
bc
)/2]
(12)
allele-frequencydistribution is uniform.With triangular
allele-frequency distributions, the results are also very andthe multilocus likelihood is the product of Equation
12 over loci. This expression can be used for estimatingsimilar fornonrelatives, but with relatedindividuals, the
regression estimator yields more precise estimates,with relatedness by solving for the values of r
xy
and D
xy
that
maximize Equation 12, given the data.the reduction in sampling variance approaching 50%
with close relatives. Using computer simulations, we examined the behav-
ior of such maximum-likelihood estimation of related-Thompson (1975, 1986) has extensively investigated
the use of maximum likelihood for inferring pairwise ness by a standardnumericalmethod (Newton-Raphson
iteration). Convergence to a maximum was confirmedrelationship. The likelihood method allows one to take
an entirely different approach for genealogical infer- both by noting that the likelihood increased over itera-
tions and converged and by comparing the iterativeence. For example, Thompson discusses the power of
likelihood to distinguish among major types of relation- solutions to likelihood functions of the same data mapped
1762 M. Lynch and K. Ritland
by brute force. The results, and those discussed by Rit-
land(1996b), suggest that the potential for using maxi-
mum likelihood for estimating relatedness is limited.
The problem is fundamentally due to the fact that the
ideal properties of likelihood are asymptotic or apply
to “large” sample sizes. The number of loci usually avail-
able for pairwise estimation is inherently small—too
small for likelihood to avoid substantial problems with
bias (usually negative) and extremely large sampling
variance. For example, for the case of zero true relat-
edness, the average estimate of r
xy
is on the order of
21.0 or less when 40 or fewer loci are sampled, and the
sampling variance is two to three orders of magnitude
beyond that shown for the alternative estimators in Fig-
Figure 5.—Estimates of pairwise relatedness in the com-
ures 3 and 4. Interestingly, we found that there is an
mon monkeyflower plotted as a function of distance. The
approximate sample size (number of loci) above which
estimated slope of the linear regression is 20.037/m (0.005)
the maximum-likelihood estimators become “stable” or
andtheestimatedinterceptis 0.21 (0.01). The standard errors
(in parentheses) were obtained by bootstrapping over individ-
show approximately the predicted asymptotic variance.
uals, with comparisons between identical individuals being
However, this sample size is large. For the maximum-
excluded.
likelihood estimator of r
xy
, at low true relatedness, stabil-
ity occurs at z70 diallelic loci (p50.5). The maximum-
likelihood estimator of D
xy
exhibits similar behavior, tion, there is a negative regression of relatedness on
although it begins to stabilize when z30 loci have been distance (Figure 5) as expected under isolation-by-dis-
sampled. Thus, while the maximum-likelihood ap- tance. Relatedness decreased z50% over the span from
proach may provide a useful means for comparing alter- 0 to 4 m, with the average value for adjacent plants
native degrees of relationship by likelihood-ratio tests, being 0.21, nearly the level of relatedness expected be-
its applicability for estimating pairwise relatedness coef- tween half sibs (0.25).
ficients appears to be limited unless one has the luxury A second application of relatedness estimates derives
of a very large number of polymorphic markers. fromwork (D. Marshall and K. Ritland, unpublished
results) with a white-phase (termed Kermodism) of the
EXAMPLE APPLICATIONS
black bear, which is found in low to moderate (10%)
frequency along the north coast of British Columbia
As examples of how estimators of pairwise relatedness and adjacent islands. The genetic basis of the coat color
can be used in population studies and how they behave polymorphism is unknown. During late summer 1997,
with actual data, we consider two applications. First, as nearly 900 bear hair samples were collected from five
partof a study of isolation-by-distance and field heritabil- islands and the adjacent mainland of northern coastal
ities in the common monkeyflower (Mimulus guttatus), Bristish Columbia. DNA was extracted from hairs with
300 plants were randomly selected along an 84-m tran- roots and assayed for 8 highly polymorphic microsatel-
sect through a meadow adjacent to Indian Valley Reser- lite loci using the primers developed by Paetkau et al.
voir in Clear Lake County, California (this was the (1995). The number of alleles per locus ranged from
“meadow” transect of Ritland and Ritland 1996). Ex- 7 to 17, with a mean of 10.4, and locus-specific heterozy-
tracts were obtained from corollas and assayed for 10 gosities ranged from 0.72 to 0.85, with a mean of 0.79.
polymorphic isozyme loci. Eight loci were diallelic, 1 After factoring out the multiple samples for individual
was triallelic, and the other had four alleles. Using the bears, a total of 89 distinct genotypes were found in the
regressionestimator, relatedness wasestimated for pairs regions where Kermodism was of significant frequency
of plants separated by up to 4 m (with gene frequencies (17 on Gribbel Island, 13 on Hawksbury Island, 38 on
estimated from the entire sample). The estimates of Princess Royal Island, and 21 at Terrace [mainland
pairwiserelatednessfromthis dataset show considerable BC]). Bear hair color was also recorded in these sam-
scatter, with some being .11 and many ,0 (Figure 5). ples. Estimates of pairwise relatedness werefoundwithin
Such behavior is in accordance with the results pre- each of these four regions, using the pooled samples
sented above, which highlight the large sampling vari- to estimate gene frequencies. All pairs of individuals
ance expected for estimates based upon relatively few were then classified into two groups: pairs sharing coat
marker loci. Because of this large variance, significant color (both white or both black, of which there were
inferences can be made only from groups of pairwise 614 pairings) and pairs not sharing coat colors (one
relatedness estimates or from correlations of these esti- black, one white, involving 156 pairings). A comparison
mates with other quantities such as similarity for a quan-
titative trait (Ritland 1996a). In this particular applica- of the frequency distribution of r
ˆfor these two groups
1763Estimation of Relatedness
The high sampling variance of estimates of relat-
edness arises in part because of variance in identity-by-
descent among loci and in part because of variance
in identity-in-state for alleles that are not identical by
descent.Thesesourcesof sampling error are fundamen-
tal consequences of Mendelian segregation, and no
amount of statistical finesse can eliminate them. In the
actual estimation of relatedness, however, further sam-
pling error is introduced by error in inference. With
the regression and correlation estimators, for example,
large standard errors result because the estimates of
relationship coefficients derived from single loci com-
monlyfalloutsideofthetruedomainof(0, 1). Although
estimators can be designed to ensure that all estimates
lie in the range of true possibilities (e.g.,Thompson
1976), all such estimators necessarily return biased esti-
Figure 6.—Distributions of estimates of pairwise relat-
edness among bears not sharing the same coat color and
mates, and the magnitude of the bias depends on the
among bears sharing the same coat color.
actual degree of relationship. Thus, while negative sin-
gle-locusestimatesof relationship coefficients mayseem
to be an undesirable feature, it is precisely this feature
(Figure 6) shows an excess of relatedness among bears that ensures that the estimators proposed above will be
sharing coat colors (r
¯50.057 compared to 0.039 for unbiased.
unlike colors), suggesting a genetic basis for the varia- Our results suggest that the relative advantages of the
tion in this character. However, bootstrap resampling alternative estimators of relatedness depend on several
indicated that this difference of means is not significant factors. These include the number of loci, the allele-
(the excess being present in only 88 highly variable frequency distribution, the degree of actual relation-
microsatellite loci, the statistical error of relatedness is ship, and the coefficient estimated (r vs. D). In general,
considerably less than that experienced with isozyme molecular-marker approaches that yield many alleles
markers in the previous study). Further inferences and loci tend to favor use of the regression estimators
about the mode of inheritance of Kermodism are given proposed in this article over the correlation estimators
in Ritland (1999). presented by Ritland (1996b). With small numbers of
diallelic loci with extreme allele frequencies, the corre-
lation method is more efficient than the regression
DISCUSSION
method,butthe regression estimators are more efficient
Estimation of relatedness with molecular markers is in almost all other cases. In addition, the simplicity of
a statistically demanding enterprise. On the positive the regression estimators lends to easier programming
side, all of the estimators described above (except maxi- and more stability of estimates under uneven allele fre-
mum-likelihood) are essentially unbiased in the sense quency distributions. The simplicity of the regression-
that they return estimates that are on average identical based approach is underscored by our ability to obtain
to their expected values. Errors in estimates of popula- an analytical solution for D
ˆwith this method. By con-
tion allele frequencies, which were not incorporated trast, the correlation approach of Ritland (1996b) re-
into our simulations, can introduce bias, but the effects quires,foralocus with nalleles, the inversion of amatrix
of error in gene-frequency estimation will generally be of size n(n15)/2, which is 12 312 at the minimum
trivial (of order 1/Nwhen Nindividuals are censused with multiallelic loci and beyond analytical solution.
for gene frequency) compared to the additional sam- Moreover, unlike the correlation estimator for D, the
pling errors that arise in the estimation of relatedness, regression estimator for this coefficient is well behaved
provided the number of individuals sampled exceeds over the full range of allele frequencies.
100 or so (Ritland 1996a,b). Moreover, this source of As noted above, some simple statements can be made
bias can be simply removed by omitting the pair of concerning the minimum sampling variance that one
interest from the estimate of allele frequency (Queller can expect to achieve in the estimation of relationship
andGoodnight 1989), although pathological behavior coefficients. For pairs of unrelated or distantly related
will occur in the rare event that marker alleles are individuals assayed at Lloci, each containing nalleles,
unique to particular individuals, as this would lead to the standard errors of the estimates of φ(details leading
population gene-frequency estimates of zero. In addi- up to this result are not shown), D, and rwill be no less
tion, the sampling variance of the relationship coeffi- than 2
√
(n14)/[Ln(n21)],
√
2/[Ln(n21)], and
cientsowing to uncertain allelefrequencies can, inprin-
ciple, be obtained by resampling procedures.
√
1/[L(n21)], respectively. For diallelic loci, a com-
1764 M. Lynch and K. Ritland
monsituation with allozymes, these limits take on values tive-genetictechnique can beapplied to natural popula-
tions. Ritland’s (1996a) method provides a means of
of 3.5/
√
L,1/
√
L, and 1/
√
L. With large numbers of al- estimating the additive and dominance components of
leles, as can be achieved with microsatellite loci, the genetic variance for quantitative traits (and covariance
limits asymptotically approach
√
4/Ln,
√
2/Ln
2
, and between traits) in the field by regressing measures of
√
1/Ln. Fortunately, the two coefficients with the lowest phenotypic similarity on the relatedness coefficients r
ˆ
sampling error, rand D, are the ones that have the and D
ˆ. Aside from the physical labor involved, one of
greatest practical utility. thegreatestdifficultieswith this technique is the needto
One of the limitations of both the regression and eliminate the sampling variance from the total observed
correlation methods for estimating relatedness is the variance of relatedness to estimate the actual variance
use of weights that assume zero relationship. The best in relatedness. The problem is by no means trivial as
weightsare a function of theactual relationship, but this can be seen in Ritland and Ritland’s (1996) first
isan unknown. Nevertheless, the use of approximate but application of the technique with the monkeyflower
incorrect weights yields more precise estimates than the (Mimulus). With eight assayed loci, the estimates of r
useofunweightedestimators,becausedifferencesin the derived by the correlation method ranged from 23to
informationcontentofalleleswithdifferentfrequencies 15, with approximately a third of all observed values
areatleastpartiallytakenintoaccount.Onemightthink being negative. The actual variance of relatedness was
that estimates obtained with the null weights could be estimated to be on the order of only 0.04. Thus, almost
improved upon by subsequently refining the weights, all of the observed variance in r
ˆwas due to sampling
using the previous estimates of relatedness in the calcu- error. Such results clearly highlight the practical need
lation of the weights. These revised weights could then for molecular and statistical methodologiesforminimiz-
give a second round of weighted estimates, and the ing the sampling variance of relatedness.
whole process could be repeated again until a suitable
We thank John Kelley for helpful comments. This work was sup-
degree of convergence to final estimates is achieved.
ported by National Institutes of Health grant GM-36827 and National
However, simulations by us and by Ritland (1996b)
Science Foundation grant DEB-9629775 to M.L., and by a National
indicated that, even with large numbers of loci, this
Sciences and Engineering Research Council/Industry Research Chair
iterative approach has little promise. Bias is introduced,
in population genetics held by K.R.
and with the weights being as noisy as they are, the
weights themselves are often wildly unrealistic.
Generally speaking, our results show that attempts
LITERATURE CITED
to estimate relatedness with molecular markers can be
Avise, J. C., 1995 Molecular Markers, Natural History and Evolution.
Chapman and Hall, New York.
greatly improved upon by working with multiallelic loci,
Cockerham, C. C., 1971 Higher order probability functions of iden-
with the most dramatic gains in efficiency occurring
tity of alleles by descent. Genetics 69: 235–246.
with loci with relatively even distributions of allele fre-
Falconer, D. S., and T. F. C. Mackay, 1996 Introduction to Quantita-
tive Genetics, Ed. 4. Longman, Harlow, United Kingdom.
quencies. Because the sampling variance of r
ˆis in-
Geyer, C. J., and E. A. Thompson, 1995 A new approach to the
versely proportional to Ln, it is clear that roughly the
joint estimation of relationship from DNA fingerprint data, pp.
same amount of efficiency is gained by working with
245–260 in Population Management for Survival and Recovery, edited
by J. D. Ballou, M. Gilpin and T. J. Foose. Columbia University
loci with twice the number of alleles as by doubling the
Press, New York.
number of loci. For D, the sampling variance is inversely
Hamilton, W. D., 1964 The genetical evolution of social behaviour:
proportional to Ln
2
, so a much greater gain can be
I and II. J. Theor. Biol. 7: 1–52.
Jacquard, A., 1974 The Genetic Structure of Populations. Springer,
achieved by increasing numbers of alleles as opposed
Berlin.
tonumbers of loci. Thus,an early investment ina search
Kempthorne, O., 1954 The correlation between relatives in a ran-
for informative loci (those with a large number of al-
dom mating population. Proc. R. Soc. Lond. Ser. B 143: 103–113.
Li, C. C., and D. G. Horvitz, 1953 Some methods of estimating
leles, with roughly equal frequencies) can be quite ad-
the inbreeding coefficient. Am. J. Hum. Genet. 5: 107–117.
vantageous in the long term. These recommendations
Li, C. C., D. E. Weeks and A. Chakravarti, 1993 Similarity of DNA
assume that at least 10 or so loci are sampled, because
fingerprints due to chance and relatedness. Hum. Hered. 43:
45–52.
with fewer loci, the tradeoff involving rfavors more loci
Lynch, M., 1988 Estimation of relatedness by DNA fingerprinting.
over more alleles per locus.
Mol. Biol. Evol. 5: 584–599.
The results presented above indicate that even with
Lynch, M., and B. Milligan, 1994 Analysis of population genetic
structure with RAPD markers. Mol. Ecol. 3: 91–99.
fairly large numbers of loci, standard errors of relation-
Lynch,M., and J. B.Walsh, 1998 Genetics andAnalysis of Quantitative
ship coefficients will rarely be ,0.1/
√
Land often will
Traits. Sinauer Associates, Sunderland, MA.
Paetkau, D., W. Calvert, I. Stirling and C. Strobeck, 1995 Mi-
be somewhat .1/
√
L, so in general one cannot expect
crosatellite analysis of population structure in Canadian polar
to use markers to make precise statements about differ-
bears. Mol. Ecol. 4: 347–354.
ences in relatedness between particular pairs of individ-
Pamilo, P., and R. H. Crozier, 1982 Measuring genetic relatedness
in natural populations: methodology. Theor. Popul. Biol. 21:
uals. However, with enough effort applied to the right
171–193.
kinds of loci, it may be possible to reduce the sampling
Queller, D. C., and K. F. Goodnight, 1989 Estimating relatedness
using molecular markers. Evolution 43: 258–275.
variance to the extent that Ritland’s (1996a) quantita-
1765Estimation of Relatedness
Ritland, K., 1989 Marker genes and the inference of quantitative
typeij.TheconditionalprobabilitiesincludeP(ii|ij)and
geneticparameters in thefield, pp. 183–201in Population Genetics,
P(jj|ij) as given in Equations A1a and A1b plus four
Plant Breeding and Gene Conservation, edited by A. H. D. Brown,
M. T. Clegg, A. L. Kahler and B. S. Weir. Sinauer Associates,
more:
Sunderland, MA.
Ritland, K., 1996a A marker-based method for inferences about P(ij|ij)52p
i
p
j
1[0.5(p
i
1p
j
)22p
i
p
j
]
quantitative inheritance in natural populations. Evolution 50:
1062–1073. ·φ
xy
2(1 22p
i
p
j
)D
xy
(A3a)
Ritland, K., 1996b Estimators for pairwise relatedness and inbreed-
ing coefficients. Genet. Res. 67: 175–186. P(i·|ij)52p
i
(1 2p
i
2p
j
)1(1 2p
i
2p
j
)(0.5 22p
i
)
Ritland, K., 1999 Detecting inheritance with inferred relatedness
in nature, in Adaptive Genetic Variation in the Wild, edited by T. ·φ
xy
22p
i
(1 2p
i
2p
j
)D
xy
(A3b)
Mousseau. Oxford University Press, Oxford (in press).
Ritland, K., and C. Ritland, 1996 Inferences about quantitative P(j·|ij )52p
j
(1 2p
i
2p
j
)1(1 2p
i
2p
j
)(0.5 22p
j
)
inheritance based on natural population structure in the yellow
monkeyflower, Mimulus guttatus. Evolution 50: 1074–1082. ·φ
xy
22p
j
(1 2p
i
2p
j
)D
xy
(A3c)
Thompson, E. A., 1975 The estimation of pairwise relationships.
Ann. Hum. Genet. 39: 173–188. P(··|ij)5(1 2p
i
2p
j
)
2
(1 2φ
xy
2D
xy
).
(A3d)
Thompson,E. A., 1976 Arestriction on the spaceof genetic relation-
ships. Ann. Hum. Genet. 40: 201–204.
Thus,withmultiallelicloci,heterozygousreferenceindi-
Thompson,E. A.,1986 Pedigree Analysis inHuman Genetics. TheJohns
Hopkins University Press, Baltimore.
viduals generate the obvious difficulty of there being
Weir,B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Sunder-
more equations than unknowns.
land, MA.
Linear regression provides a data-fitting procedure
Communicating editor: A. H. D. Brown
for obtaining estimators for φ
xy
,D
xy
, and r
xy
in this case.
The six probabilities can be assembled into an array,
APPENDIX
Provided there are only two alleles at the locus in the
population, the approach provided in the text for a P5
1
P(ii|ij)
P(jj|ij)
P(ij|ij)
P(i·|ij)
P(j·|ij)
P(··|ij)
2
.
homozygous reference genotype can also be applied to
the case in which the reference genotype is a heterozy-
gote for alleles iand j. The conditional probabilities
of observing proband genotypes, given a heterozygous For any pair of individuals, the observed data vector
reference genotype, are (P
ˆ) will always contain a single one for the observed
two-genotypecombination with all other elements being
P(ii|ij)5p
2
i
1p
i
(0.5 2p
i
)φ
xy
2p
2
i
D
xy
(A1a) equal to zero. The linear model then becomes
P(jj|ij)5p
2
j
1p
j
(0.5 2p
j
)φ
xy
2p
2
j
D
xy
. (A1b)
P
ˆ5a1M
x
1
φ
xy
D
xy
2
1e, (A4)
The third probability, P(ij|ij), is omitted, as only two of
the three probabilities are needed for a sufficient statis-
tic because the three probabilities sum to unity. where the matrix M
x
has two columns that contain the
Equating these probabilities to their estimates and coefficients for φ
xy
and D
xy
, respectively, ais a column
rearranging, estimators for the coefficients of relation- vector containing the remaining constants (functions only
ship are obtained as of gene frequencies), and eis a vector of residuals with
expectationzero.TheelementsofM
x
andaareobtained
directly from Equations A1a and A1b and A3a–A3d.φ
ˆ
xy
52[q
2
P
ˆ(ii|ij)2p
2
P
ˆ(jj|ij)]
pq(q2p)(A2a) If the elements of the observation vector P
ˆwere inde-
pendent and identical in distribution, ordinary least-
D
ˆ
xy
512P
ˆ(ii|ij)
p2P
ˆ(jj|ij)
q, (A2b) squares analysis could be used to obtain estimates of
the relationship coefficients with minimum sampling
wherein, to emphasize that these equations apply only variance. However, because all of the elements of the
to diallelic loci, we have dropped the subscript i, letting observation vector are constrained to sum to 1, such
p5p
i
and q512p. From Equation 1, conditions are obviously violated. Although the failure
to fully account for the structure of the data in the P
r
ˆ
xy
511P
ˆ(ii|ij)2P
ˆ(jj|ij)
(q2p). (A2c) vector does not cause the estimates of the coefficients
of relationship to be biased, it does elevate the sampling
variance. Unfortunately, the variance-covariance struc-When gene frequencies are exactly equal, a reference
heterozygote at a diallelic locus yields undefined esti- ture necessary to generate the optimal weights for a
more powerful generalized least-squares framework de-mates for φ
xy
and r
xy
.
If there are more than two alleles in the population, pends onthe unknownparameters φ
xy
andD
xy
.Toobtain
approximate weights, we rely on Ritland’s (1996b) ar-there are six possible proband genotype categories con-
ditionedonobserving the heterozygous referencegeno- gument that, in the absence of prior information on
1766 M. Lynch and K. Ritland
the relationship of xand y, it is reasonable to start with and from Equation 1
the assumption that φ
xy
5D
xy
50.
Using the optimal weights given by Equation 4b of r
ˆ
xy
5p
j
P
ˆ(i|ij)1p
i
P
ˆ(j|ij)1(p
i
1p
j
)P
ˆ(ij|ij)24p
i
p
j
p
i
1p
j
24p
i
p
j
.
Ritland(1996b), we were ableto obtain analytical solu-
tions for the weighted least-squares estimators of φ
xy
and whereP
ˆ(i|ij)5P
ˆ(i·|ij)12P
ˆ(ii|ij)and P
ˆ(j|ij)5P
ˆ(j·|ij)1
D
xy
using an equation solver program. These are 2P
ˆ(jj|ij). When there are only two alleles, Equations
φ
ˆ
xy
54p
i
p
j
(1 2p
i
2p
j
)[1 2P
ˆ(ij|ij)] 22(1 22p
i
p
j
)[p
j
P
ˆ(i|ij)1p
i
P
ˆ(j|ij)]
(1 2p
i
2p
j
12p
i
p
j
)(4p
i
p
j
2p
i
2p
j
)
A5a–A5c reduce to the diallelic-locus estimates (A2a–
(A5a) A2c).
D
ˆ
xy
5(1 2p
i
2p
j
)P
ˆ(ij|ij)2p
j
P
ˆ(i|ij)2p
i
P
ˆ(j|ij)12p
i
p
j
12p
i
2p
j
12p
i
p
j
,
(A5b)