ArticlePDF Available

Genetic Mapping in the Presence of Genotyping Errors

September 2007
Genetics 176(4):2521-7

September 2007
176(4):2521-7

DOI:10.1534/genetics.106.063982

Source
PubMed

Authors:

Michela Troggio

Fondazione Edmund Mach - Istituto Agrario San Michele All'Adige

Riccardo Velasco

Council for Agricultural Research and Agricultural Economy Analysis

Genetic maps are built using the genotypes of many related individuals. Genotyping errors in these data sets can distort genetic maps, especially by inflating the distances. We have extended the traditional likelihood model used for genetic mapping to include the possibility of genotyping errors. Each individual marker is assigned an error rate, which is inferred from the data, just as the genetic distances are. We have developed a software package, called TMAP, which uses this model to find maximum-likelihood maps for phase-known pedigrees. We have tested our methods using a data set in Vitis and on simulated data and confirmed that our method dramatically reduces the inflationary effect caused by increasing the number of markers and leads to more accurate orders.

llustration of the two types of permutations used in the marker-ordering algorithm: moves (left) and flips (right). Each square represents a single marker.

…

Figures - uploaded by Michela Troggio

Content may be subject to copyright.

Content uploaded by Michela Troggio

Content may be subject to copyright.

DOI: 10.1534/genetics.106.063982

Genetic Mapping in the Presence of Genotyping Errors

Dustin A. Cartwright,*

,†,1

Michela Troggio,

†

Riccardo Velasco

†

and Alexander Gutin*

*Myriad Genetics, Salt Lake City, Utah 84108 and

†

Genetics and Molecular Biology Department, IASMA Research Center,

San Michele a/Adige (TN) 38010, Italy

Manuscript received July 27, 2006

Accepted for publication January 18, 2007

ABSTRACT

Genetic maps are built using the genotypes of many related individuals. Genotyping errors in these data

sets can distort genetic maps, especially by inﬂating the distances. We have extended the traditional

likelihood model used for genetic mapping to include the possibility of genotyping errors. Each individual

marker is assigned an error rate, which is inferred from the data, just as the genetic distances are. We have

developed a software package, called TMAP, which uses this model to ﬁnd maximum-likelihood maps for

phase-known pedigrees. We have tested our methods using a data set in Vitis and on simulated data and

conﬁrmed that our method dramatically reduces the inﬂationary effect caused by increasing the number of

markers and leads to more accurate orders.

ENETIC mapping uses the genotypes of many

related individuals at selected markers to deter-

mine the relative locations of these markers. The geno-

type data allow us to infer where recombinations have

occurred, which is directly related to the genetic dis-

tance. The purpose of a genetic mapping algorithm is

to reconstruct as accurately as possible the order of the

markers on the chromosomes and the genetic distances

between them.

Genetic mapping algorithms fall into two categories:

those that use multipoint-likelihood maximization and

those that rely only on two-point statistics. MapMaker

(Lander et al. 1987), CRI-MAP (Green et al. 1990),

CarthaGe`ne (de Givry et al. 2005), and R/qtl (Broman

et al. 2003) fall into the former category, while GMendel

(Echt et al. 1992), JoinMap (Stam 1993), and RECORD

(van Os et al. 2005b) fall into the latter. Multipoint-

likelihood maximization has theoretical advantages, but

is slower than two-point methods.

We use multipoint-likelihood maximization, because

it is more robust in the presence of missing data. Two-

point statistics derive no information when an individual’s

genotype is missing for one of the markers. However,

multipoint analysis uses nearby markers to approxi-

mate the missing genotypes, appropriately discounted

because of possible recombinations. For the same rea-

son, multipoint analysis is more powerful with markers

that are not fully informative. In backcross and in-

tercross pedigrees, this advantage is less apparent,

but in outbred pedigrees, the markers will generally

have many different segregation types, and two-point

analysis between these will not incorporate all the

information.

Without accounting for genotyping errors, each error

in a nonterminal marker causes two apparent recombi-

nations in the data set. Thus, every 1% error rate in a

marker adds 2 cM of inﬂated distance to the map. If

there is an average of one marker every 2 cM, then an

average of a 1% error rate will double the size of the

map. Markers with very high error rates will have large

distances to the adjacent markers. These cases can be

detected, either manually or automatically, and the

markers removed. However, markers with low error

levels will not be detected and, furthermore, may rep-

resent too large a portion of the data set to eliminate

completely.

Apparent double recombinations may also be due

to biological phenomena such as gene conversion or

mutation and not laboratory errors. Nevertheless, as

with laboratory genotyping errors, these phenomena

are not indicative of recombination and treating

them as recombinations inﬂates the map distances

(Castiglione et al. 1998). For the purpose of this article,

we use the term error to refer to any process that causes

changes to single genotypes at a time, as opposed to

recombination, which also affects all subsequent

genotypes.

Previous work has presented methods for detecting

errors in genotype data once the marker order has been

decided (Lincoln and Lander 1992; Douglas et al.

2000; van Os et al. 2005a). The suspect genotypes can

be checked and corrected if necessary. However, this

veriﬁcation procedure can be time consuming and

not necessarily fully effective because some combina-

tions of markers and individuals may consistently pro-

duce the same erroneous genotypes. Alternatively, the

Corresponding author: Myriad Genetics, 320 Wakara Way, Salt Lake City,

UT 84108. E-mail: dcartwri@myriad.com

Genetics 176: 2521–2527 (August 2007)

veriﬁcation step may be skipped and the markers

recoded solely on the basis of the error detection algo-

rithm. This method may itself introduce errors, unless

the parameters are chosen very conservatively, in which

case it may miss errors. Finally, since the map itself has

been built using the error-containing data set, those

errors may be less apparent with that map.

In contrast, our approach integrates error detection

and compensation into the map-building procedure.

Furthermore, we use a likelihood model that does not

force a dichotomy between correcting or not correcting

particular genotypes. Instead, we have a probability dis-

tribution over the possible genotypes, which depends

on both the observed genotype and the estimated prob-

ability of error. Thus, even genotypes that are only pos-

sibly erroneous can be correctly utilized in constructing

the map.

Previous work modeling errors within the map-

ordering process has not incorporated both indepen-

dent error probabilities for the markers and estimation

of the parameters from the data. MapMaker 3.0 includes

an optional genotyping error rate for the entire linkage

group but has no provisions for estimating this param-

eter from the data (Lincoln and Lander 1992). R/qtl

is a software package that primarily performs QTL

analysis, but includes a model for building maps with a

ﬁxed, uniform error rate, similar to MapMaker (Broman

et al. 2003). Thallman et al. (2001) presented a model

with independent error rates for each marker, but without

provisions for estimating these from the data. On the

other hand, Rosa et al. (2002) presented a method that

estimates a global error rate from the data while ordering,

but they use Gibbs sampling and not the EM algorithm,

and thus their approach requires many more iterations to

converge to a solution.

In the context of linkage analysis, the notion of

complex-valued recombination fractions has been in-

troduced (Go¨ ring and Terwilliger 2000; see also

Abkevich et al. 2001). The purpose was to account for

errors in the phenotype models. Our approach is simi-

lar, except that our errors are in the genotypes, not in

the model, and we account for errors at every locus, not

just at the disease locus.

We have developed a software package that uses

the error-compensating likelihood model to ﬁnd the

maximum-likelihood map under that model. We have

named the package TMAP after the tlod statistic of

Abkevich et al. (2001). Although this method could

apply to any pedigree type, TMAP works only with pedi-

grees where all parents are completely genotyped and

phase known. This includes backcross, intercross, and

phase-known outbred pedigrees. For phase-unknown

outbred pedigrees, it is possible to determine the phases

with sufﬁciently many offspring, as was done with

the Vitis data used in this article (D. A. Cartwright,

unpublished results). TMAP is freely available from

http://math.berkeley.edu/



dustin/tmap/.

METHODS

Likelihood model: In our likelihood model, each

marker has both an observed genotype, which is speci-

ﬁed in a data ﬁle, and a true genotype, which is not

observed directly and can only be inferred. The re-

lationship between the two genotypes is parameterized

by an error rate e. In each haplotype, the true and ob-

served genotypes coincide with probability 1  e. Thus,

the overall genotypes coincide with probability (1  e)

and differ only in the maternal haplotype with proba-

bility (1  e )e, only in the paternal haplotype also with

probability (1  e)e, and in both haplotypes with prob-

ability e

. This error model is completely analogous to

the probability distribution of recombinations between

a pair of markers. Of course, the true genotype cannot

be known a priori, and in many cases the observed geno-

types are not fully known either. Thus when computing

the likelihood, we sum over the likelihoods of all pos-

sible values for these genotypes.

Explicitly, the equation is as follows. Let n and m

denote the number of individuals and markers, respec-

tively. Let u

denote the recombination rate between

markers i and i 1 1, and let e

denote the error rate for

marker i. Then, the likelihood is a function of these two

sets of parameters,

g 2G

g 9 2G9

m1

i¼1

‘ðrðg

; g

i11

Þ; u

i¼1

‘ðrðg

; g 9

Þ; e

; ð1Þ

where G is the set of all possible genotypes, G9 is the set of

all genotypes that are consistent with the observations,

each element g consists of the true genotypes g

, each

element g 9 consists of the observed genotypes g 9

, r(g

) is the number of recombinations between genotypes

and g

, and

‘ðr; uÞ¼u

ð1  uÞ

2nr

is the likelihood of having exactly r recombinations

between two markers separated by a recombination

fraction u (or equivalently, exactly r errors in a marker

with error rate u).

We can represent this model visually as shown in

Figure 1. Each node represents an abstract marker, i.e.,

genotypes for all individuals in the pedigree. The leaf

nodes are the known, observed, possibly erroneous

markers, and the internal nodes are the inferred, un-

observed, error-free markers. Thus, except for the

terminal markers, each physical marker corresponds

to two nodes, one error free and one observed. Each arc

represents separation between two markers, either

because of recombination (vertical) or because of errors

(horizontal).

As shown in the graph (Figure 1), there is no point

in computing an error rate for the markers at either

end. For these markers, errors and recombinations are

2522 D. A. Cartwright et al.

indistinguishable in the model, so we conservatively

assume that all the apparent recombinations are true

recombinations and not errors.

Thus, the error rates effectively add m  2 parameters

to each linkage group of m markers. The maximum-

likelihood values of these additional parameters can be

estimated along with the genetic distances using the EM

algorithm (Lander and Green 1987). In the notation of

Equation 1, we can use approximate values of u

and e

to compute the joint probability distribution over G and

G9 (E step), which can then be used to compute better

approximations of u

and e

(M step). Iterating these two

steps typically converges to the maximum-likelihood

solution.

Finally, the recombination rates are translated into

map distances using the Kosambi map function. The

Kosambi map function models recombination interfer-

ence, even though the model assumes that each of the u

is independent of the others, meaning that recombina-

tion events separated by markers have independent

probabilities.

Since errors are deﬁned in a way that is mathemati-

cally equivalent to recombinations, the position at one

end of the map is equivalent to the neighboring posi-

tion in this model. Any pair of maps that differs only by

switching these two markers will have the same likeli-

hood. Therefore, any likelihood maximization of the

order will leave each of these two pairs in an arbitrary

order. These symmetries are analogous to the equiva-

lence of any given order and the reverse order, except

that reversing a map is a physical as well as a mathemat-

ical symmetry, but reversing the ﬁnal two markers is not

a physical symmetry. For the ﬁnal map, we can pick the

order that minimizes the error, again assuming that

recombinations are more likely than errors, all else

being equal. However, while building the map, it is

useful to explicitly acknowledge these symmetries.

Marker order: We begin building our maps by trying

all possible orders of s seed markers. Because of the

additional symmetries, there are only s!/8 unique or-

ders. Then, we provisionally insert the next marker in all

possible positions, keeping the t highest likelihoods.

Each additional marker is added in the same way. On

the basis of our experiments, we have chosen s ¼ 6 and

t ¼ 3 to provide a good balance between speed and

accuracy.

When inserting a new marker near either end of the

map, the symmetries described above complicate the

possibilities. When adding a marker C to a map that

begins AB ..., there would seem to be three places to

add it: ABC ..., ACB ..., CAB . . . . However, the last two

are equivalent orders. Furthermore, the order of A

and B was arbitrary, so the orders BAC ..., BCA ..., and

CBA ...are just as plausible. In fact, these six orders

consist of three pairs of equivalent orders, where each

equivalent pair is deﬁned by the marker in the third

position. Thus we try each of the three equivalent pairs

of orders only once.

After building an initial order, we use a simple Monte

Carlo algorithm to ﬁnd the maximum-likelihood order.

At each iteration, a random permutation from the

neighborhood is applied to the marker order, and the

log likelihood is computed. If the new log likelihood is

less than the old one, the new order is accepted. If the

new is greater then the old, it is nonetheless accepted

with probability e

dL/T

, where dL is the difference in log

likelihood, and T, known as the temperature, is a pa-

rameter of the algorithm. This is similar to simulated

annealing but with a ﬁxed temperature (Kirkpatrick

et al. 1983). We use two phases of Monte Carlo optimiza-

tion, ﬁrst with T ¼ 0.5 and then with T ¼ 0.05.

We deﬁne our neighborhood to have two different

kinds of permutations, which we call ﬂips and moves. A

ﬂip consists of taking a stretch of the map consisting of

two or more markers and reversing its orientation in

place, which is equivalent to a 2-change from the theory

of the traveling salesman problem (Schiex and Gaspin

1997). A move consists of removing a marker from one

location and inserting it in another. These are illus-

trated in Figure 2. Rather than consider each permuta-

tion equally, we bias the neighborhood toward the more

local, smaller-scale alterations, which are more likely to

Figure 1.—Graphical representation of the error model.

Each node represents an abstract marker, i.e., genotypes for

all individuals in the pedigree. The leaf nodes are the known,

observed, possibly erroneous markers, and the internal nodes

are the inferred, unobserved, error-free markers. Thus, except

for the terminal markers, each physical marker corresponds

to two nodes, one error free and one observed. Each arc rep-

resents separation between two markers, either because of

recombination (vertical) or because of errors (horizontal).

Genetic Mapping With Data Errors 2523

have similar likelihoods. Within each family of permu-

tations, each permutation has probability C

‘

, where ‘

represents the size of the subsection in a ﬂip and the

length of the move, and C

is a constant to make the

total probability 1. We use a value of r ¼ 0.9 for both sets

of permutations.

Implementation: The core algorithms in TMAP are

implemented in C. There is a command-line interface

for Unix and a Java graphical interface that has been

tested on Solaris, Linux, Windows, and Mac OS X.

Validation: We tested TMAP using data from 94 pro-

geny of a cross in Vitis vinifera, which were genotyped

at 1006 markers (Troggio et al. 2007, accompanying

article in this issue), as well as simulated data sets. Two

facets of the program were assessed: ﬁrst, the likelihood

model for compensating for genotyping errors; second,

the Monte Carlo search algorithm for ﬁnding optimal

solutions.

To test the ability of the error model to counteract the

inﬂationary effect of genotyping errors, we performed

the simple experiment of removing every other marker

in each linkage group and measuring the change in the

linkage group’s size. In the presence of uncompensated

errors, removing markers will cause the distances to

shrink because there will be fewer apparent double

recombinations, but not if the errors are properly

compensated. First, we used the Monte Carlo algorithm

to determine the maximum-likelihood order of each

group. Then, we computed the size of each group

and the size of each group after removing every other

marker. We modiﬁed TMAP to not take errors into ac-

count and repeated the last step.

In some cases, we observed that error compensation

also improved the ordering. Both with and without

compensation, markers with many errors tend to be

placed at the ends of the linkage groups, because they

do not ﬁt well anywhere in the middle. However, with

error compensation, this effect is less pronounced.

To verify this phenomenon, we simulated a backcross

pedigree consisting of 19 markers and 94 individuals

with a distance of 5 cM between adjacent markers and

5% of the genotypes missing. We added a varying amount

of simulated errors to the 10th marker. Then, we ordered

the markers using both TMAP, the modiﬁed version that

didnot compensatefor errors, and a versionthat assumed

a ﬁxed error rate of 2%, similar to MapMaker and R/qtl

(Lincoln and Lander 1992; Broman et al. 2003).

To validate the parameters in the Monte Carlo

iterative improvement algorithm we experimented with

many variant parameters. First, we used a long run of the

improving algorithm to determine the maximum likeli-

hood, or at least a close approximation of it, for each

linkage group of the grapevine data. Then, for a variety

of parameters, the Monte Carlo improvement algorithm

was applied to each linkage group until the log

like-

lihood was within 0.1 of the optimum or until a maxi-

mum number of iterations was reached. This operation

was repeated 10 times for each set of parameters, and we

recorded the average number of iterations required.

RESULTS

Error model: The results of removing every other

marker from linkage groups in the Vitis data set are

shown in Figure 3. Without error compensation, the

linkage groups always decreased in size when markers

were removed, and, furthermore, there is not a lot of

correlation between the sizes, but with error compen-

sation the sizes typically remained very consistent.

Figure 4 shows the proportion of incorrect place-

ments of a marker with a varying error rate. The results

show that the error compensation method helps cor-

rectly position markers with signiﬁcant error rates. Fur-

thermore, the plot underestimates the relative accuracy

of error compensation, because, with error compensa-

tion, many of the incorrect placements were only one

or two positions away from the correct position, but

without error compensation most of the incorrect place-

ments were at the ends of the group.

Monte Carlo parameters: Figure 5 shows the effect of

removing one class of permutations on the time to

converge to an optimal solution. Each point represents

a single linkage group. On the x-axis is the average

number of steps needed to converge using the standard

parameter set, and on the y-axis is the average number of

Figure 2.—Illustration of the two types of permutations

used in the marker-ordering algorithm: moves (left) and ﬂips

(right). Each square represents a single marker.

2524 D. A. Cartwright et al.

steps needed to converge for a variant that had one

of the two permutation types (ﬂips or moves) disabled.

On some linkage groups, the optimization performed

poorly with only one of the permutation types, justifying

the inclusion of both. Note that in some of these cases

the maximum number of iterations was reached before

convergence, so this plot underestimates the difference

between the parameter choices.

Similarly, we experimented with varying the parameter

r for one or both permutation types and the temperature

of T, to arrive at our choices for these parameters, al-

though the differences are less dramatic. In particular,

convergence was slower with r ¼ 1, justifying the non-

uniform distribution of permutations.

Error rate distribution: The distribution of the non-

zero error rates in the Vitis data set is shown in Figure 6.

Among the markers with nonzero errors, most have an

error rate of ,5%. Without error compensation, the

cumulative effect of these markers would be to inﬂate

the map distances, but to remove all of them would

signiﬁcantly reduce the usefulness of the map. Fur-

thermore, an additional 67% of the markers had an

estimated error rate of exactly 0%. In these cases, the

error-compensating likelihood model reduces to the

traditional one, and there is no loss of information.

Finally, the distribution clearly shows that the error rate

is not the same for all markers, which has been the

assumption in all previous models of genotyping errors.

There are a handful of markers with error rates in the

range 15–35%. Their presence did not signiﬁcantly

affect the other markers in their linkage groups, so

we did not remove them from the map. These markers

with high error rates are analogous to phenotypes with

Figure 4.—Simulation of the effect of errors on marker or-

dering. In a linkage group of 19 markers, the 10th marker was

simulated with errors, and the markers were ordered, using

three different likelihood models. The ﬁrst uses TMAP with

the error model described in this article. The second uses a

version of TMAP that assumes a ﬁxed error rate of 2% for

every marker. The third does not model any error at all.

Figure 5.—Effect of removing one of the two permutation

types on the speed of convergence to the correct order.

Figure 3.—Effect on linkage group size of removing every

other marker both with and without compensation for errors.

Error compensation leads to more consistent genetic distances.

Figure 6.—Distribution of nonzero error rates in the Vitis

data set. In addition, 625 markers (67%) had an estimated

error rate of exactly 0%.

Genetic Mapping With Data Errors 2525

incomplete penetrance. The error rate reduces the

informativeness of the markers, but it is still possible

to localize them to a speciﬁc area of the linkage group.

We extended the analogy between markers with high

error rates and phenotypes in linkage analysis to esti-

mate the accuracy of the positions of these markers. In

linkage analysis, the range of positions with log

likeli-

hood 1 unit less than the maximum log

likelihood

measures the uncertainty in a marker’s position. For

each marker, a similar analysis was performed by hold-

ing the rest of the linkage group ﬁxed and computing

the log likelihood with the marker positioned every

0.1 cM along the length of the linkage group. The error

rate and the size of the 1-unit-down interval for each

marker are plotted in Figure 7. In general, markers with

higher error rates are localized less precisely in the

linkage group. However, even for the markers with the

largest error rates, the 1-unit-down interval was never

.21 cM.

DISCUSSION

We have deﬁned our error model to be the same as

the recombination model. This means that we treat the

correct genotyping of the haplotype from the mother

and of the haplotype from the father as independent

events. An alternative error model would be to treat

each individual’s genotype as a whole as either correct

or incorrect. However, a different error model would

remove the symmetry between the recombination frac-

tion of a terminal marker and the error of the adjacent

marker for many, but not all, segregation types. Thus,

the relative position of these two markers would be

decided by the likelihoods and not by the error-

minimizing rule above. Furthermore, the processes that

cause genotyping errors are more likely to produce

errors in only one haplotype than in both. For example,

it is more likely to misread an AA genotype as AB than

as BB.

More complex classes of genotyping errors are not

detected by this model. For example, in one linkage

group of the Vitis data, there was a pair of markers that

each had the same set of errors in their genotype data.

Because the genotypes from each marker seemed to

conﬁrm the genotypes from the other, the method did

not detect the errors. However, there were large gaps on

either side of the pair, and removing either one caused

the gaps to disappear and be absorbed in the error rate

of the remaining marker. This linkage group gave rise to

one of the outliers in Figure 3.

CarthaGe`ne and GMendel have both previously ap-

plied Monte Carlo techniques to the marker ordering

problem. CarthaGe`ne uses a neighborhood consisting

of ﬂips and a permutation based on a 3-change that

moves whole blocks of markers at a time, but does

not bias either permutation toward smaller changes.

GMendel only swaps pairs of markers and does include a

bias toward nearby markers that is active only during the

later phases of the improvement. However, as our results

show, both a richer neighborhood and a bias toward

small-scale permutations improve convergence.

We have used only two temperatures in our Monte

Carlo improving algorithm, rather than the more com-

mon steady decrease in temperature used in simulated

annealing. Simulated annealing starts with a high initial

temperature that effectively randomizes the marker

order. Thus, it is not possible to take advantage of the

result of the incremental ordering algorithm as a starting

point. However, we found that the incremental algorithm

can often quickly ﬁnd good approximate solutions, so

we chose a Monte Carlo algorithm that could take

advantage of this.

We have shown that genotyping errors can be ac-

commodated by a simple extension to the mapping-

likelihood model, which gives a more accurate marker

order and especially distances.

This work was supported by the ‘‘Grapevine Physical Mapping’’ and

‘‘A.M.I.CA. Vitis’’ projects funded by the Provincia Autonoma di

Trento.

LITERATURE CITED

Abkevich, V., N. J. Camp,A.Gutin,J.Farnham,L.Cannon-

Albright et al., 2001 A robust multipoint linkage statistic (tlod)

for mapping complex trait loci. Genet. Epidemiol. 21(Suppl. 1):

S492–S497.

Broman, K. W., H. Wu,S.Sen and G. A. Churchill, 2003 R/qtl:

QTL mapping in experimental crosses. Bioinformatics 19:

889–890.

Castiglione, P., C. Pozzi,M.Heun,V.Terzi,K.J.Mu

ller et al.,

1998 An AFLP-based procedure for the efﬁcient mapping

of mutations and DNA probes in barley. Genetics 149: 2039–

2056.

de Givry, S., M. Bouchez,P.Chabrier,D.Milan and T. Schiex,

2005 CarthaGe`ne: multipopulation integrated genetic and ra-

diation hybrid mapping. Bioinformatics 21: 1703–1704.

Figure 7.—Comparison of the estimated marker error

rates and the size of the 1-unit-down intervals. The 1-unit-

down intervals are computed by placing the marker at regular

steps along the length of the linkage group and computing

the interval where the log

likelihood is 1 unit less than

the maximum. These approximate the 90% conﬁdence inter-

vals for the marker’s position.

2526 D. A. Cartwright et al.

Douglas, J. A., M. Boehnke and K. Lange, 2000 A multipoint

method for detecting genotyping errors and mutations in sibling-

pair linkage data. Am. J. Hum. Genet. 66: 1287–1297.

Echt, C., S. Knapp and B.-H. Liu, 1992 Genome mapping with non-

inbred crosses using GMendel 2.0. Maize Genet. Coop. Newsl. 66:

27–29.

Go¨ ring, H. H., and J. D. Terwilliger, 2000 Linkage analysis in the

presence of errors I: complex-valued recombination fractions

and complex phenotypes. Am. J. Hum. Genet. 66: 1095–1106.

Green, P., K. Falls and S. Crooks, 1990 CRI-MAP Documentation,

Version 2.4. Washington University School of Medicine, St. Louis.

Kirkpatrick, S., C. D. Gelatt Jr. and M. P. Vecchi, 1983 Op-

timization by simulated annealing. Science 220: 671–680.

Lander, E. S., and P. Green, 1987 Construction of multilocus

genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA

84: 2363–2367.

Lander, E. S., P. Green,J.Abrahamson,A.Barlow,M.J.Daly et al.,

1987 MAPMAKER: an interactive computer package for con-

structing primary genetic linkage maps of experimental and nat-

ural populations. Genomics 1: 174–181.

Lincoln, S. E., and E. S. Lander, 1992 Systematic detection of

errors in genetic linkage data. Genomics 14: 604–610.

Rosa, G. J. M., B. S. Yandell and D. Gianola, 2002 A Bayesian ap-

proach for constructing genetic maps when markers are mis-

coded. Genet. Sel. Evol. 34: 353–369.

Schiex, T., and C. Gaspin, 1997 CarthaGe`ne: constructing and join-

ing maximum likelihood genetic maps. Proceedings of Intelligent

Systems of Molecular Biology ’97, June 1997, Halkidiki, Greece.

tam, P., 1993 Construction of integrated genetic linkage maps

by means of a new computer package: JoinMap. Plant J. 3:

739–744.

Thallman, R. M., G. L. Bennet,J.W.Keele and S. M. Kappes,

2001 Efﬁcient computation of genotype probabilities for loci

with many alleles: II. Iterative method for large, complex pedi-

grees. J. Anim. Sci. 79: 34–44.

Troggio, M.,G. Malacarne,G.Coppola,C.Segala,D.A.Cartwright

et al., 2007 A dense single-nucleotide polymorphism-based genetic

linkage map of grapevine (Vitis vinifera L.) anchoring Pinot noir

bacterial artiﬁcial chromosome contigs. Genetics 176: 2637–2650.

van Os, H., P. Stam,R.G.F.Visser and H. J. van Eck,

2005a RECORD: a novel method for ordering loci on a genetic

linkage map. Theor. Appl. Genet. 112: 30–40.

van Os, H., P. Stam,R.G.F.Visser and H. J. van Eck,

2005b SMOOTH: a statistical method for successful removal of

genotyping errors from high-density genetic linkage data. Theor.

Appl. Genet. 112: 187–194.

Communicating editor: R. W. Doerge

Genetic Mapping With Data Errors 2527

Ultra-High-Density Genetic Maps of Jatropha curcas × Jatropha integerrima and Anchoring Jatropha curcas Genome Assembly Scaffolds

Article

Full-text available

Sep 2023

Genetic maps facilitate an understanding of genome organization and the mapping of genes and QTLs for traits of interest. Our objective was to develop a high-density genetic map of Jatropha and anchoring scaffolds from genome assemblies. We developed two ultra-high-density genetic linkage maps of Jatropha curcas × Jatropha intergerrima using a backcross (BC1) population using SNP, AFLP and SSR markers. First, SNPs were identified through genotyping-by-sequencing (GBS). The polymorphic SNPs were mapped to 3267 Jat_r4.5 scaffolds and 484 Wu_JatCur_1.0 scaffolds, and then these genomic scaffolds were mapped/anchored to the genetic linkage groups along with the AFLP and SSR markers for each genome assembly separately. We successfully mapped 7284 polymorphic SNPs, and 54 AFLP and SSR markers on 11 linkage groups using the Jat_r4.5 genomic scaffolds, resulting in a genome length of 1088 cM and an average marker interval of 0.71 cM. We mapped 7698 polymorphic SNPs, and 99 AFLP and SSR markers on 11 linkage groups using the Wu_JatCur_1.0 genomic scaffolds, resulting in a genome length of 870 cM and an average marker interval of 1.67 cM. The mapped SNPs were annotated to various regions of the genome, including exon, intron and intergenic regions. We developed two ultra-high-density linkage maps anchoring a high number of genome scaffolds to linkage groups, which provide an important resource for the structural and functional genomics as well as for molecular breeding of Jatropha while also serving as a framework for assembling and ordering whole genome scaffolds.

Smooth Descent: A ploidy-aware algorithm to improve linkage mapping in the presence of genotyping errors

Article

Full-text available

Mar 2023

Linkage mapping is an approach to order markers based on recombination events. Mapping algorithms cannot easily handle genotyping errors, which are common in high-throughput genotyping data. To solve this issue, strategies have been developed, aimed mostly at identifying and eliminating these errors. One such strategy is SMOOTH, an iterative algorithm to detect genotyping errors. Unlike other approaches, SMOOTH can also be used to impute the most probable alternative genotypes, but its application is limited to diploid species and to markers heterozygous in only one of the parents. In this study we adapted SMOOTH to expand its use to any marker type and to autopolyploids with the use of identity-by-descent probabilities, naming the updated algorithm Smooth Descent (SD). We applied SD to real and simulated data, showing that in the presence of genotyping errors this method produces better genetic maps in terms of marker order and map length. SD is particularly useful for error rates between 5% and 20% and when error rates are not homogeneous among markers or individuals. With a starting error rate of 10%, SD reduced it to ∼5% in diploids, ∼7% in tetraploids and ∼8.5% in hexaploids. Conversely, the correlation between true and estimated genetic maps increased by 0.03 in tetraploids and by 0.2 in hexaploids, while worsening slightly in diploids (∼0.0011). We also show that the combination of genotype curation and map re-estimation allowed us to obtain better genetic maps while correcting wrong genotypes. We have implemented this algorithm in the R package Smooth Descent.

Improving precision and accuracy of genetic mapping with genotyping‐by‐sequencing data in outcrossing species

Article

Full-text available

Jun 2024

Genotyping‐by‐sequencing (GBS) is a widely used strategy for obtaining large numbers of genetic markers in model and non‐model organisms. In crop plants, GBS‐derived marker datasets are frequently used to perform quantitative trait locus (QTL) mapping. In some plant species, however, high heterozygosity and complex genome structure mean that researchers must use care in handling GBS data to conduct QTL mapping most effectively. Such outbred crops include most of the perennial grass and tree species used for bioenergy. To identify strategies for increasing accuracy and precision of QTL mapping using GBS data in outbred crops, we conducted an empirical study of SNP‐calling and genetic map‐building pipeline parameters in a Miscanthus sinensis population, and a complementary simulation study to estimate the relationship between genome‐wide error rate, read depth, and marker number. The bioenergy grass Miscanthus is an obligate outcrossing species with a recent (diploidized) whole‐genome duplication. For the study of empirical M. sinensis data, we compared two SNP‐calling methods (one non‐reference‐based and one reference‐based), a series of depth filters (12×, 20×, 30×, and 40×) and two map‐construction methods (i.e., marker ordering: linkage‐only and order‐corrected based on a reference genome). We found that correcting the order of markers on a linkage map by using a high‐quality reference genome improved QTL precision (shorter confidence intervals). For typical GBS datasets of between 1000 and 5000 markers to build a genetic map for biparental populations, a depth filter set at 30× to 40× applied to outbred populations provided a genome‐wide genotype‐calling error rate of less than 1%, improved accuracy of QTL point estimates and minimized type I errors for identifying QTL. Based on these results, we recommend using a reference genome to correct the marker order of genetic maps and a robust genotype depth filter to improve QTL mapping for outbred crops.

Simultaneous estimation of genotype error and uncalled deletion rates in whole genome sequence data

Article

Full-text available

May 2024
PLOS GENET

Effect of genotyping errors on linkage map construction based on repeated chip analysis of two recombinant inbred line populations in wheat (Triticum aestivum L.)

Article

Full-text available

Apr 2024
BMC PLANT BIOL

Linkage maps are essential for genetic mapping of phenotypic traits, gene map-based cloning, and marker-assisted selection in breeding applications. Construction of a high-quality saturated map requires high-quality genotypic data on a large number of molecular markers. Errors in genotyping cannot be completely avoided, no matter what platform is used. When genotyping error reaches a threshold level, it will seriously affect the accuracy of the constructed map and the reliability of consequent genetic studies. In this study, repeated genotyping of two recombinant inbred line (RIL) populations derived from crosses Yangxiaomai × Zhongyou 9507 and Jingshuang 16 × Bainong 64 was used to investigate the effect of genotyping errors on linkage map construction. Inconsistent data points between the two replications were regarded as genotyping errors, which were classified into three types. Genotyping errors were treated as missing values, and therefore the non-erroneous data set was generated. Firstly, linkage maps were constructed using the two replicates as well as the non-erroneous data set. Secondly, error correction methods implemented in software packages QTL IciMapping (EC) and Genotype-Corrector (GC) were applied to the two replicates. Linkage maps were therefore constructed based on the corrected genotypes and then compared with those from the non-erroneous data set. Simulation study was performed by considering different levels of genotyping errors to investigate the impact of errors and the accuracy of error correction methods. Results indicated that map length and marker order differed among the two replicates and the non-erroneous data sets in both RIL populations. For both actual and simulated populations, map length was expanded as the increase in error rate, and the correlation coefficient between linkage and physical maps became lower. Map quality can be improved by repeated genotyping and error correction algorithm. When it is impossible to genotype the whole mapping population repeatedly, 30% would be recommended in repeated genotyping. The EC method had a much lower false positive rate than did the GC method under different error rates. This study systematically expounded the impact of genotyping errors on linkage analysis, providing potential guidelines for improving the accuracy of linkage maps in the presence of genotyping errors. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-024-05005-8.

Simultaneous estimation of genotype error and uncalled deletion rates in whole genome sequence data

Preprint

Full-text available

Feb 2024

Genotype data include errors that may influence conclusions reached by downstream statistical analyses. Previous studies have estimated genotype error rates from discrepancies in human pedigree data, such as Mendelian inconsistent genotypes or apparent phase violations. However, uncalled deletions, which generally have not been accounted for in these studies, can lead to biased error rate estimates. In this study, we propose a genotype error model that considers both genotype errors and uncalled deletions when calculating the likelihood of the observed genotypes in parent-offspring trios. Using simulations, we show that when there are uncalled deletions, our model produces genotype error rate estimates that are less biased than estimates from a model that does not account for these deletions. We applied our model to SNVs in 77 sequenced White British parent-offspring trios in the UK Biobank. We use the Akaike information criterion to show that our model fits the data better than a model that does not account for uncalled deletions. We estimate the genotype error rate at SNVs with minor allele frequency > 0.001 in these data to be 3.2 × 10 ⁻⁴ (90% CI: [2.8 × 10 ⁻⁴ , 6.2 × 10 ⁻⁴ ]). We estimate that 77% of the genotype errors at these markers are attributable to uncalled deletions (90% CI: [73%, 88%]). Author summary A genotype error occurs when the genotype identified through molecular analysis does not match the actual genotype of the individual being analyzed. Because genotype errors can influence downstream statistical results, previous studies have attempted to estimate the rate of genotype errors in a study sample. However, uncalled deletions, which generally have not been accounted for in these studies, can lead to biased error rate estimates. In this study, we formulate a model adjusting for uncalled deletions when estimating genotype error rates. We show that when uncalled deletions are present, this model results in less biased estimates of genotype error rates compared to a model that does not adjust for uncalled deletions. We apply this model to SNVs in 77 sequenced White British parent-offspring trios in the UK Biobank and estimate the genotype error rate and the proportion of genotype errors that are attributable to uncalled deletions at SNVs with minor allele frequency > 0.001.

QTL mapping of the narrow-branch “Pendula” phenotype in Norway spruce (Picea abies L. Karst.)

Article

Full-text available

May 2023
TREE GENET GENOMES

Pendula-phenotyped Norway spruce has a potential forestry interest for high-density plantations. This phenotype is believed to be caused by a dominant single mutation. Despite the availability of RAPD markers linked to the trait, the nature of the mutation is yet unknown. We performed a quantitative trait loci (QTL) mapping based on two different progenies of F1 crosses between pendula and normal crowned trees using NGS technologies. Approximately 25% of all gene bearing scaffolds of Picea abies genome assembly v1.0 were mapped to 12 linkage groups and a single QTL, positioned near the center of LG VI, was found in both crosses. The closest probe markers placed on the maps were positioned 0.82 cm and 0.48 cm away from the Pendula marker in two independent pendula-crowned × normal-crowned wild-type crosses, respectively. We have identified genes close to the QTL region with differential mutations on coding regions and discussed their potential role in changing branch architecture.

Unique Salt-Tolerance-Related QTLs, Evolved in Vigna riukiuensis (Na+ Includer) and V. nakashimae (Na+ Excluder), Shed Light on the Development of Super-Salt-Tolerant Azuki Bean (V. angularis) Cultivars

Article

Full-text available

Apr 2023

Wild relatives of crops have the potential to improve food crops, especially in terms of improving abiotic stress tolerance. Two closely related wild species of the traditional East Asian legume crops, Azuki bean (Vigna angularis), V. riukiuensis “Tojinbaka” and V. nakashimae “Ukushima” were shown to have much higher levels of salt tolerance than azuki beans. To identify the genomic regions responsible for salt tolerance in “Tojinbaka” and “Ukushima”, three interspecific hybrids were developed: (A) azuki bean cultivar “Kyoto Dainagon” × “Tojinbaka”, (B) “Kyoto Dainagon” × “Ukushima” and (C) “Ukushima” × “Tojinbaka”. Linkage maps were developed using SSR or restriction-site-associated DNA markers. There were three QTLs for “percentage of wilt leaves” in populations A, B and C, while populations A and B had three QTLs and population C had two QTLs for “days to wilt”. In population C, four QTLs were detected for Na+ concentration in the primary leaf. Among the F2 individuals in population C, 24% showed higher salt tolerance than both wild parents, suggesting that the salt tolerance of azuki beans can be further improved by combining the QTL alleles of the two wild relatives. The marker information would facilitate the transfer of salt tolerance alleles from “Tojinbaka” and “Ukushima” to azuki beans.

Genetic basis of maize kernel protein content revealed by high-density bin mapping using recombinant inbred lines

Article

Full-text available

Dec 2022

Maize with a high kernel protein content (PC) is desirable for human food and livestock fodder. However, improvements in its PC have been hampered by a lack of desirable molecular markers. To identify quantitative trait loci (QTL) and candidate genes for kernel PC, we employed a genotyping-by-sequencing strategy to construct a high-resolution linkage map with 6,433 bin markers for 275 recombinant inbred lines (RILs) derived from a high-PC female Ji846 and low-PC male Ye3189. The total genetic distance covered by the linkage map was 2180.93 cM, and the average distance between adjacent markers was 0.32 cM, with a physical distance of approximately 0.37 Mb. Using this linkage map, 11 QTLs affecting kernel PC were identified, including qPC7 and qPC2-2, which were identified in at least two environments. For the qPC2-2 locus, a marker named IndelPC2-2 was developed with closely linked polymorphisms in both parents, and when tested in 30 high and 30 low PC inbred lines, it showed significant differences (P = 1.9E-03). To identify the candidate genes for this locus, transcriptome sequencing data and PC best linear unbiased estimates (BLUE) for 348 inbred lines were combined, and the expression levels of the four genes were correlated with PC. Among the four genes, Zm00001d002625, which encodes an S-adenosyl-L-methionine-dependent methyltransferase superfamily protein, showed significantly different expression levels between two RIL parents in the endosperm and is speculated to be a potential candidate gene for qPC2-2. This study will contribute to further research on the mechanisms underlying the regulation of maize PC, while also providing a genetic basis for marker-assisted selection in the future.

Drought stress tolerance in wheat: Recent QTL mapping advances

Chapter

Jan 2023

Wheat is belonging to grass family and one of the most cultivated field crops growing world widely. Wheat is considered as a major crop to meet the food and nutrition requirement of rapidly increasing population and therefore helping to meet the challenges of global food security. However, climate changes increasing the spells of abiotic stresses from which drought is the most prevalent and damaging stress factor effecting the overall production and nutritious value of wheat globally. To cope with this problem, more resilient and stress tolerant wheat genotypes are required to fulfill the world's food demand. Advancement in molecular breeding technologies provide an efficient way forward to improve the wheat. More robust and economical sequencing coupled with quantitative trait loci (QTL) mapping led to the discovery of novel drought tolerant alleles/genes. These unique QTLs can be used in breeding programs to develop drought-tolerant wheat genotypes.

A Bayesian approach for constructing genetic maps when markers are miscoded

Article

Full-text available

May 2002

Abstract The advent of molecular markers has created opportunities for a better understanding of quantitative inheritance and for developing novel strategies for genetic improvement of agricultural species, using information on quantitative trait loci (QTL). A QTL analysis relies on accurate genetic marker maps. At present, most statistical methods used for map construction ignore the fact that molecular data may be read with error. Often, however, there is ambiguity about some marker genotypes. A Bayesian MCMC approach for inferences about a genetic marker map when random miscoding of genotypes occurs is presented, and simulated and real data sets are analyzed. The results suggest that unless there is strong reason to believe that genotypes are ascertained without error, the proposed approach provides more reliable inference on the genetic map.

CARTHAGENE: Multipopulation integrated genetic and radiation hybrid mapping

Article

Full-text available

May 2005

Carh ta Gene: is an integrated genetic and radiation hybrid (RH) mapping tool which can deal with multiple populations, including mixtures of genetic and RH data. Carh ta Gene: performs multipoint maximum likelihood estimations with accelerated expectation–maximization algorithms for some pedigrees and has sophisticated algorithms for marker ordering. Dedicated heuristics for framework mapping are also included. Carh ta Gene: can be used as a C++ library, through a shell command and a graphical interface. The XML output for companion tools is integrated. Availability: The program is available free of charge from www.inra.fr/bia/T/CarthaGene for Linux, Windows and Solaris machines (with Open Source). Contact: tschiex{at}toulouse.inra.fr

Optimization by Simulated Annealing

Article

Full-text available

Jan 1983

There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

Corrigendum to “MAPMAKER: An interactive computer package for constructing primary genetic linkage maps of experimental and natural populations” [Genomics 1 (1987) 174–181]

Article

Apr 2009

Construction of Integrated Genetic-Linkage Maps by Means of a New Computer Package - Joinmap

Article

Feb 2005
PLANT J

P. Stam

A computerized procedure to construct integrated genetic maps is presented. The computer program (Join Map) can handle raw data from F 2 s, backcrosses and recombinant inbred lines, as well as listed pair‐wise recombination frequencies. The procedure is useful for combining linkage data that have been collected in different experiments; the result is a mathematical alignment of the distinct genetic maps. Data from single experiments can be dealt with as well. In view of the fast growing amount of linkage information for molecular markers, which is often being generated by different research groups, integrated maps provide useful information on the map position of genes and DNA markers. The procedure performs a sequential build‐up of the map and, at each step, a numerical search for the best fitting order of markers. Weighted least squares is used for the estimation of map distances.

Construction of integrated genetic linkage maps by means of a new computer package: JOINMAP

Article

Nov 1992

P. Stam

A computerized procedure to construct integrated genetic maps is presented. The computer program (JOINMAP) can handle raw data from F2s, backcrosses and recombinant inbred lines, as well as listed pair-wise recombination frequencies. The procedure is useful for combining linkage data that have been collected in different experiments; the result is a mathematical alignment of the distinct genetic maps. Data from single experiments can be dealt with as well. In view of the fast growing amount of linkage information for molecular markers, which is often being generated by different research groups, integrated maps provide useful information on the map position of genes and DNA markers. The procedure performs a sequential build-up of the map and, at each step, a numerical search for the best fitting order of markers. Weighted least squares is used for the estimation of map distances.

Lincoln SE, Lander ES. Systematic detection of errors in genetic linkage data. Genomics 14: 604-610

Article

Dec 1992

Construction of dense genetic linkage maps is hampered, in practice, by the occurrence of laboratory typing errors. Even relatively low error rates cause substantial map expansion and interfere with the determination of correct genetic order. Here, we describe a systematic method for overcoming these difficulties, based on incorporating the possibility of error into the usual likelihood model for linkage analysis. Using this approach, it is possible to construct genetic maps allowing for error and to identify the typings most likely to be in error. The method has been implemented for F2 intercrosses between two inbred strains, a situation relevant to the construction of genetic maps in experimental organisms. Tests involving both simulated and real data are presented, showing that the method detects the vast majority of errors.

Construction of Multilocus Genetic Linkage Maps in Humans

Article

May 1987

Human genetic linkage maps are most accurately constructed by using information from many loci simultaneously. Traditional methods for such multilocus linkage analysis are computationally prohibitive in general, even with supercomputers. The problem has acquired practical importance because of the current international collaboration aimed at constructing a complete human linkage map of DNA markers through the study of three-generation pedigrees. We describe here several alternative algorithms for constructing human linkage maps given a specified gene order. One method allows maximum-likelihood multilocus linkage maps for dozens of DNA markers in such three-generation pedigrees to be constructed in minutes.

Lander, E. S., Green, P., Abrahamson, J., Barlow, A., Daly, M. J., Lincoln, S. E. and Newburg, L.. MAPMAKER; An interactive Computer Package for Constructing Primary Genetic Linkage Maps of Experimental and Natural Populations. Genomics, 1: 174-181

Article

Nov 1987

With the advent of RFLPs, genetic linkage maps are now being assembled for a number of organisms including both inbred experimental populations such as maize and outbred natural populations such as humans. Accurate construction of such genetic maps requires multipoint linkage analysis of particular types of pedigrees. We describe here a computer package, called MAPMAKER, designed specifically for this purpose. The program uses an efficient algorithm that allows simultaneous multipoint analysis of any number of loci. MAPMAKER also includes an interactive command language that makes it easy for a geneticist to explore linkage data. MAPMAKER has been applied to the construction of linkage maps in a number of organisms, including the human and several plants, and we outline the mapping strategies that have been used.

CARTHAGENE: constructing and joining maximum likelihood genetic maps

Article

Feb 1997

Genetic mapping is an important step in the study of any organism. An accurate genetic map is extremely valuable for locating genes or more generally either qualitative or quantitative trait loci (QTL). This paper presents a new approach to two important problems in genetic mapping: automatically ordering markers to obtain a multipoint maximum likelihood map and building a multipoint maximum likelihood map using pooled data from several crosses. The approach is embodied in an hybrid algorithm that mixes the statistical optimization algorithm EM with local search techniques which have been developed in the artificial intelligence and operations research communities. An efficient implementation of the EM algorithm provides maximum likelihood recombination fractions, while the local search techniques look for orders that maximize this maximum likelihood. The specificity of the approach lies in the neighborhood structure used in the local search algorithms which has been inspired by an analogy between the marker ordering problem and the famous traveling salesman problem. The approach has been used to build joined maps for the wasp Trichogramma brassicae and on random pooled data sets. In both cases, it compares quite favorably with existing softwares as far as maximum likelihood is considered as a significant criteria.

Genetic Mapping in the Presence of Genotyping Errors

Abstract and Figures

Recommended publications

Lod scores for gene mapping in the presence of marker map uncertainty

Improving Estimates of Genetic Maps: A Maximum Likelihood Approach

Multipoint gene mapping using seriation. II. Analysis of simulated and empirical data

Combined analysis of data from two granddaughter designs: A simple strategy for QTL confirmation and...