A layout of five clones in a contig based on their actual genome coordinates 

A layout of five clones in a contig based on their actual genome coordinates 

Source publication
Conference Paper
Full-text available
We study the problem of selecting the minimum tiling path (MTP) from a set of clones arranged in a physical map. We formulate the constraints of the MTP problem in a graph theoretical framework, and we derive an optimiza- tion problem that is solved via integer linear programming. Experimental results show that when we compare our algorithm to the...

Contexts in source publication

Context 1
... algorithm is used to partition it by removing the weakest set of edges (i.e., minimum total edge weight) [11]. After a component is partitioned into two subgraphs, we check both subgraphs against the three conditions. If at least one is satisfied, we partition that component again. This iterative process terminates as soon as there are no more components that need to be split. Elimination of spurious components. In order to introduce the notion of spurious components, let us assume for the time being that the exact locations of the clones in a contig are known. Let U be a connected component of G , and let N ( U ) be the corresponding connected component cloneset. We call U spurious if the overlapping region for the clones in N ( U ) is spanned by another clone in the contig. An example is illustrated in the Appendix. This step is crucial to reduce the number of redundant clones in the preliminary MTP produced by MTP-ILP. Recall that MTP-ILP selects the smallest set of clones that cover each connected component in the MFG. If the spurious components are not removed, at least one clone in each spurious component is added to the MTP without contributing to the overall coverage. Since the exact ordering of the clones is not available, we employ a probabilistic method to detect spurious components. More specifically, we compute the probability that a connected component is observed purely by chance. If this probability is high then we mark the component as spurious. We expect the number of spurious components to be low, since missing fragments or extra fragments are unlikely to occur frequently for a specific cloneset. In our model, we assume that fragment sizes are distributed uniformly, and that the probability that two clones share a fragment is P = 2 T / gellen [19], where T is the tolerance and gellen is the number of possible fragment size values. The probability that two clones share f fragments is P f and the probability that c clones share f fragments is P ( c − 1) f . This suggests that the probability that a connected component U is observed Removing clones that are buried into multiple clones. Recall that the goal of the first step in the construction of the MFG is to bury clones into other clones to reduce the problem size. In this step, we reduce the problem size further by removing clones that are buried into multiple clones. For example, in Figure 3 in the Appendix, clone C is buried into B ∪ D , thus C can be removed. For each clone in a contig, we compute the ratio between the number of its fragments in the MFG and the total number of its fragments. If at least B % of its fragments are already in the MFG, then we mark that clone buried. However, there is a complication. It is possible that two clones are mutually buried and therefore we cannot remove both. In order to solve this problem, we first identify all possible candidates that can be buried into multiple clones. Then, we examine each connected component U of G , and save N ( U ) in a list L if all clones in N ( U ) are candidates to be buried. All candidates that do not exist in any N ( U ) ∈ L can be buried. However, we need to make sure that at least one clone in each N ( U ) ∈ L is not buried, otherwise the region in the target genome that is represented by U is not covered. So the task at hand is to remove as many candidate clones as possible with the constraint that at least one clone survives from each cloneset N ( U ) ∈ L . This problem is called minimum hitting set and it is NP-complete (polynomial reduction from the vertex cover problem [8]). Here, we solve it sub-optimally using a greedy approach [8]. At each iteration, we (1) select the clone c that occurs in the maximum number of clone sets in L , (2) remove all clone sets that contain c , (3) save c into minimum hitting set H , and (4) repeat. The iterative process terminates when the list L is empty. Buried clones selected in this way are removed from the MFG permanently. If this removal introduces components with only one node, they are also removed from the MFG permanently. The algorithm is summarized as Algorithm 1 in the Appendix. in Removing Adding the MFG mandatory clones is P that clones , are where buried to the f is MFG. into number multiple Clones of times clones. that that have Recall N to ( U be ) forms selected that the a connected goal as MTP of the clones com- first ponent. step are called in the In mandatory our construction algorithm, clones. of if the ( | U More MFG | − 1) specifically, f is is to lower bury than clones if at a least threshold into 50% other Q of clones then the fragments we to mark reduce U of the as a spurious. problem clone are size. Spurious not In present this components step, in the we MFG, reduce are removed then the problem that from clone the size is MFG further marked permanently. by as removing mandatory. clones When that a are clone buried is marked into multiple as mandatory, clones. it For is example, immediately in Figure 3 stored in in the the MTP Appendix, and ignored clone C for is buried further into analysis B ∪ D by , thus removing C can be all removed. of its fragments from the MFG. For each clone in a contig, we compute the ratio between the number of its fragments in the MFG and the total number of its fragments. If at least B % of its fragments are already in the MFG, then we mark that clone buried. However, there is a complication. It is possible that two clones are mutually buried and therefore we cannot remove both. In order to solve this problem, we first identify all possible candidates that can be buried into multiple clones. Then, we examine each connected component U of G , and save N ( U ) in a list L if all clones in N ( U ) are candidates to be buried. All candidates that do not exist in any N ( U ) ∈ L can be buried. However, we need to make sure that at least one clone in each N ( U ) ∈ L is not buried, otherwise the region in the target genome that is represented by U is not covered. So the task at hand is to remove as many candidate clones as possible with the constraint that at least one clone survives from each cloneset N ( U ) ∈ L . This problem is called minimum hitting set and it is NP-complete (polynomial reduction from the vertex cover problem [8]). Here, we solve it sub-optimally using a greedy approach [8]. At each iteration, we (1) select the clone c that occurs in the maximum number of clone sets in L , (2) remove all clone sets that contain c , (3) save c into minimum hitting set H , and (4) repeat. The iterative process terminates when the list L is empty. Buried clones selected in this way are removed from the MFG permanently. If this removal introduces components with only one node, they are also removed from the MFG permanently. The algorithm is summarized as Algorithm 1 in the Appendix. Adding mandatory clones to the MFG. Clones that have to be selected as MTP clones are called mandatory clones. More specifically, if at least 50% of the fragments of a clone are not present in the MFG, then that clone is marked as mandatory. When a clone is marked as mandatory, it is immediately stored in the MTP and ignored for further analysis by removing all of its fragments from the MFG. Recall that the MTP can be computed by selecting the smallest set of clones that cover all connected components of the MFG. This problem is a special case of the minimum hitting set problem. Although this problem is NP-complete on general graphs [8], it can be solved optimally in polynomial time on overlap (or interval) graphs [3]. Here, we solve this problem optimally by expressing it as an integer linear program (ILP), but first we remove any connected component from the MFG that does not a ff ect the solution. We reduce the problem size by removing some of the connected components of G according to the following Lemma (the proof is immediate). This simplification step dras- tically reduces the execution time required to solve the ILP (about 200-fold for ...
Context 2
... to our experiments, B should be at least 80 to avoid false positive buried clones. If two clones can be buried into each other, the smaller of the two (i.e., the one with fewer fragments) is buried into the other one. FPC also buries clones during the process of building the physical map, but it does not discard them during MTP computation. Building the preliminary MFG. First, we align the fingerprint data for each pair of overlapping clones. For each clone pair ( c i , c j ) for which S ( c i , c j ) ≤ C , we build a bipartite graph G i , j = ( L i ∪ R j , E i , j ), where L i and R j consist of the fragments of c i and c j , respectively, and E i , j = { ( u , v ) | u ∈ L i , v ∈ R j such that | b ( u ) − b ( v ) | ≤ T } . In order to align clones c i and c j , we search for the maximum bipartite matching in G i , j . The matching of maximum cardinality is found by solving max flow on the corresponding flow network [6]. Let M i , j be set of matched edges. For all clone pairs c i and c j for which S ( c i , c j ) ≤ C , the matching edges in M i , j are used to create the (preliminary) matching fragment graph G . Specifically, for each edge ( u , v ) ∈ M i , j , nodes u , v and edge ( u , v ) are added to G (unless they have been already added). The weight of ( u , v ) is set to be the negative logarithm of the Sulston score between clone c i and clone c j . The objective of the bipartite matching is to attempt to group together clone fragments that are located at the same location on the genome. Because of the noise in the fingerprint data, some of the matched fragments might not represent the same region in the target genome. In the following steps, we try to eliminate as many false matches as possible. MFG pruning. In this step, some of the components of G that might represent more than one unique region in the target genome are split. Specifically, we examine all the connected components of G and mark the ones that satisfy at least one of the following conditions as candidates . 1. Extra fragment: The connected component contains multiple fragments of a clone. 2. Unmatched fragments: The di ff erence between the length of the second shortest and the second longest fragment in the connected component is more than the tolerance value T . We ignore the shortest and the longest fragments to allow two outliers per component. 3. Weak overlap: The connected component contains at least one pair of clones that are very unlikely to overlap (i.e., have Sulston score of at least 1e-2.5 for MTP-ILP or 1e-1 for MTP-MST). For each candidate component, a min-cut algorithm is used to partition it by removing the weakest set of edges (i.e., minimum total edge weight) [11]. After a component is partitioned into two subgraphs, we check both subgraphs against the three conditions. If at least one is satisfied, we partition that component again. This iterative process terminates as soon as there are no more components that need to be split. Elimination of spurious components. In order to introduce the notion of spurious components, let us assume for the time being that the exact locations of the clones in a contig are known. Let U be a connected component of G , and let N ( U ) be the corresponding connected component cloneset. We call U spurious if the overlapping region for the clones in N ( U ) is spanned by another clone in the contig. An example is illustrated in the Appendix. This step is crucial to reduce the number of redundant clones in the preliminary MTP produced by MTP-ILP. Recall that MTP-ILP selects the smallest set of clones that cover each connected component in the MFG. If the spurious components are not removed, at least one clone in each spurious component is added to the MTP without contributing to the overall coverage. Since the exact ordering of the clones is not available, we employ a probabilistic method to detect spurious components. More specifically, we compute the probability that a connected component is observed purely by chance. If this probability is high then we mark the component as spurious. We expect the number of spurious components to be low, since missing fragments or extra fragments are unlikely to occur frequently for a specific cloneset. In our model, we assume that fragment sizes are distributed uniformly, and that the probability that two clones share a fragment is P = 2 T / gellen [19], where T is the tolerance and gellen is the number of possible fragment size values. The probability that two clones share f fragments is P f and the probability that c clones share f fragments is P ( c − 1) f . This suggests that the probability that a connected component U is observed Removing clones that are buried into multiple clones. Recall that the goal of the first step in the construction of the MFG is to bury clones into other clones to reduce the problem size. In this step, we reduce the problem size further by removing clones that are buried into multiple clones. For example, in Figure 3 in the Appendix, clone C is buried into B ∪ D , thus C can be removed. For each clone in a contig, we compute the ratio between the number of its fragments in the MFG and the total number of its fragments. If at least B % of its fragments are already in the MFG, then we mark that clone buried. However, there is a complication. It is possible that two clones are mutually buried and therefore we cannot remove both. In order to solve this problem, we first identify all possible candidates that can be buried into multiple clones. Then, we examine each connected component U of G , and save N ( U ) in a list L if all clones in N ( U ) are candidates to be buried. All candidates that do not exist in any N ( U ) ∈ L can be buried. However, we need to make sure that at least one clone in each N ( U ) ∈ L is not buried, otherwise the region in the target genome that is represented by U is not covered. So the task at hand is to remove as many candidate clones as possible with the constraint that at least one clone survives from each cloneset N ( U ) ∈ L . This problem is called minimum hitting set and it is NP-complete (polynomial reduction from the vertex cover problem [8]). Here, we solve it sub-optimally using a greedy approach [8]. At each iteration, we (1) select the clone c that occurs in the maximum number of clone sets in L , (2) remove all clone sets that contain c , (3) save c into minimum hitting set H , and (4) repeat. The iterative process terminates when the list L is empty. Buried clones selected in this way are removed from the MFG permanently. If this removal introduces components with only one node, they are also removed from the MFG permanently. The algorithm is summarized as Algorithm 1 in the Appendix. in Removing Adding the MFG mandatory clones is P that clones , are where buried to the f is MFG. into number multiple Clones of times clones. that that have Recall N to ( U be ) forms selected that the a connected goal as MTP of the clones com- first ponent. step are called in the In mandatory our construction algorithm, clones. of if the ( | U More MFG | − 1) specifically, f is is to lower bury than clones if at a least threshold into 50% other Q of clones then the fragments we to mark reduce U of the as a spurious. problem clone are size. Spurious not In present this components step, in the we MFG, reduce are removed then the problem that from clone the size is MFG further marked permanently. by as removing mandatory. clones When that a are clone buried is marked into multiple as mandatory, clones. it For is example, immediately in Figure 3 stored in in the the MTP Appendix, and ignored clone C for is buried further into analysis B ∪ D by , thus removing C can be all removed. of its fragments from the MFG. For each clone in a contig, we compute the ratio between the number of its fragments in the MFG and the total number of its fragments. If at least B % of its fragments are already in the MFG, then we mark that clone buried. However, there is a complication. It is possible that two clones are mutually buried and therefore we cannot remove both. In order to solve this problem, we first identify all possible candidates that can be buried into multiple clones. Then, we examine each connected component U of G , and save N ( U ) in a list L if all clones in N ( U ) are candidates to be buried. All candidates that do not exist in any N ( U ) ∈ L can be buried. However, we need to make sure that at least one clone in each N ( U ) ∈ L is not buried, otherwise the region in the target genome that is represented by U is not covered. So the task at hand is to remove as many candidate clones as possible with the constraint that at least one clone survives from each cloneset N ( U ) ∈ L . This problem is called minimum hitting set and it is NP-complete (polynomial reduction from the vertex cover problem [8]). Here, we solve it sub-optimally using a greedy approach [8]. At each iteration, we (1) select the clone c that occurs in the maximum number of clone sets in L , (2) remove all clone sets that contain c , (3) save c into minimum hitting set H , and (4) repeat. The iterative process terminates when the list L is empty. Buried clones selected in this way are removed from the MFG permanently. If this removal introduces components with only one node, they are also removed from the MFG permanently. The algorithm is summarized as Algorithm 1 in the Appendix. Adding mandatory clones to the MFG. Clones that have to be selected as MTP clones are called mandatory clones. More specifically, if at least 50% of the fragments of a clone are not present in the MFG, then that clone is marked as mandatory. When a clone is marked as mandatory, it is immediately stored in the MTP and ignored for further analysis by removing all of its fragments from the MFG. Recall that the MTP can be computed by selecting the smallest set of clones that cover all connected components of the MFG. This problem is a special case ...
Context 3
... The contig-wise coverage is computed for each contig and then an overall score is computed as the weighted average of contig- wise coverage, where the weight is the number of MTP clones in each contig. Although contig-wise coverage appears very similar to global coverage, it may produce di ff erent results when a region in the genome is covered by multiple contigs. Both FPC and FMTP have six parameters. Depending on the fingerprinting method (i.e., agarose or High Information Content Fingerprinting (HICF)), FMTP provides default values for its parameters. Using values for these parameters close to the defaults is crucial to obtain good performance. For example, the cuto ff parameter C should be changed slightly. By default, MTP-ILP uses a low C value (1e-10 for agarose or 1e-40 for HICF). Since MTP-ILP processes the original contigs which usually contain many clones, a higher value of C would introduce many false positive overlaps. On the other hand, to detect shorter overlaps, MTP-MST uses a high C value (1e-2 for agarose or 1e-10 for HICF). We have generated a large number of MTPs using both tools with several parameter sets, however we only recorded the best possible MTP for a given size (i.e., number of clones). If we obtained two MTPs M i and M j using di ff erent parameter values, if the size of M i is greater than the size of M j then the coverage of M i must be greater than coverage of M j , or otherwise M i is disregarded. As a result, in the experimental results as the size of the MTP increases, the coverage increases monotonically. In order to have a fair comparison between FMTP and FPC, we used the same B and T values used by FPC when building the physical map ( B = 90, T = 7 for rice, and T = 3 for barley). We set the C for the MTP-MST to values between 9e-2 and 5e-4. For all other parameters, we used the default values ( Q = 3, B = 80, C = 1e-2.5 for MTP-ILP, 1e-1 for MTP-MST, C = 1e-10 for MTP-ILP). The graphs summarizing the results are shown in Figure 2. Each point in the graphs represents an MTP. Figure 2-LEFT shows the contig-wise coverage as a function of the number of MTP clones; Figure 2-RIGHT illustrates optimal, global, and random coverage of the MTPs as a function of the MTP size. First, observe that when all clones in the rice contigs are selected as MTP clones, the global coverage is only 94.45% of the genome. As shown in Figure 2-LEFT and RIGHT, FMTP produces MTPs with significantly better contig-wise and global coverage than FPC, sometimes even with fewer clones. For instance, the highest possible contig-wise coverage that we were able to obtain by using FPC is 84.96%, whereas FMTP’s is 85.11% with about 460 (12%) less clones. This would imply 12% reduction in the sequencing costs. Also, global coverage of MTPs produced by FMTP converges to the optimum coverage much faster than FPC. The number of redundant clones and gaps produced by FPC and FMTP are almost identical and very small. The average overlap size between consecutive clones is smaller in FMTP than FPC (data not shown). This explains the di ff erence in coverages in Figure 2. Another interesting observation can be made by comparing the optimal coverage in Figure 2-RIGHT. The optimal MTP for FMTP has higher coverage and has a smaller number of clones than the optimal MTP of FPC. Recall that the optimal coverage is computed by selecting a set of clones for each contig that covers the widest possible region. A higher coverage (even when the number of clones is smaller) suggests that FMTP selects relatively more MTP clones from the “big” contigs than FPC. We ran FMTP and FPC on the barley physical map generated by our group at the University of California, Riverside and several other institutions [14]. FPC generated MTPs that contain between 11,000 and 21,000 clones. When default values are used, FMTP generated MTPs that contain about 18,000 clones. In terms of running time, FMTP and FPC are comparable. Both tools compute MTP in a couple of hours. We presented a set of novel algorithms to compute the MTP of a physical map by using a two-step approach. In the first step, we used a stringent threshold to reduce the problem size by generating a preliminary MTP without compromising the coverage of the contigs. In the second step, we attempted to order the clones in the preliminary MTP by computing MST of an overlap graph. Then, we ran a shortest path algorithm to compute the MTP. Our experimental results show that our method generates MTPs with significantly higher coverage than the most commonly used software FPC, even using a smaller number of MTP clones. Our experimental results also show that FMTP could reduce substantially the cost of clone-by-clone sequencing projects. The authors would like to thank Prof. Carol Soderlund, Dr. William Nelson, and the anonymous reviewers for insightful comments that helped improving the manuscript. This project was supported in part by NSF CAREER IIS-0447773 and NSF DBI- 0321756. In Figure 3, an illustration of five clones in a contig are shown based on their real coordinates in the genome. Suppose now that the MFG of this contig contains a connected component that contains fragments only from clones B and D . That component is spurious (i.e., does not represent a region in the target genome) because the overlap between clones B and D is completely covered by clone C . Spurious components arise for two reasons, namely missing fragments (in clone C in the example) or extra fragments (in clone B or D in the example). A sketch of the algorithm that removes clones that are buried into multiple clones is presented as Algorithm 1. A sketch of the algorithm that detects clone overlap is presented as Algorithm 2. At each iteration four conditions are checked to determine if u i and u i + d are overlapping where u i , 1 ≤ i ≤ | | , is the ith clone in . All conditions have to be true to add the edge ( u , u ) to . In line 7 and 8 of Algorithm 2 we check whether the clone pairs u i u i + d − 1 and u i + 1 , u i + d are overlapping. Obviously, if at least one of these pairs do not overlap then u i and u i + d cannot be overlapping (assuming that no clone is completely contained in another clone). In line 9, we check if clones u i , u i + 1 , . . . , u i + d have fragments together in at least one connected component of G . If u i and u i + d are overlapping then u i , u i + 1 , . . . , u i + d have to be overlapping, and therefore they should share at least one fragment in G . At the end, we check if S ( u , u ) ≤ C ...

Similar publications

Article
Full-text available
We consider the maximum k-cut problem that involves partitioning the vertex set of a graph into k subsets such that the sum of the weights of the edges joining vertices in different subsets is maximized. The associated semidefinite programming (SDP) relaxation is known to provide strong bounds, but it has a high computational cost. We use a cutting...
Article
Full-text available
The Capacitated m-Ring-Star Problem (CmRSP) is the problem of design-ing a set of rings that pass through a central depot and through some transi-tion points and/or customers, and then assigning each non-visited customer to a visited point or customer. The number of customers visited and assigned to a ring is bounded by an upper limit: the capacity...
Article
Full-text available
A trend in software testing is reducing the size of a test suite while preserving its overall quality. Given a test suite and a set of requirements covered by the suite, test suite reduction aims at selecting a subset of test cases that cover the same set of requirements. Even though this problem has received considerable attention, finding the sma...
Article
Full-text available
The Roman domination problem is considered. An improvement of two existing Integer Linear Programing (ILP) formulations is proposed and a comparison between the old and new ones is given. Correctness proofs show that improved linear programing formulations are equivalent to the existing ones regardless of the variables relaxation and usage of lesse...
Article
Full-text available
Kidney exchange allows a potential living donor whose kidney is incompatible with his intended recipient to donate a kidney to another patient so that the donor's intended recipient can receive a compatible kidney from another donor. These exchanges can include cycles of longer than two donor-patient pairs and chains produced by altruistic donors....

Citations

... A. Obtain a clone library for the target individual; B. Fingerprint clones and build a physical map; C. Select a minimum tiling path (MTP) from the physical map [54,55]; D. Pool the MTP clones according to the shifted transversal design (STD) [16]; ...
Article
Owing to rapid advances in the next-generation sequencing technology, the cost of DNA sequencing has been reduced by over several orders of magnitude. However, genomic sequencing of individuals at the population scale is still restricted to a few model species due to the huge challenge of constructing libraries for thousands of samples. Meanwhile, pooled sequencing provides a cost-effective alternative to sequencing individuals separately, which could vastly reduce the time and cost for DNA library preparation. Technological improvements, together with the broad range of biological research questions that require large sample sizes, mean that pooled sequencing will continue to complement the sequencing of individual genomes and become increasingly important in the foreseeable future. However, simply mixing samples together for sequencing makes it impossible to identify reads that belongs to each sample. Barcoding technology could help to solve this problem, nonetheless, currently, barcoding every sample is costly especially for large-scale samples. An alternative to barcoding is combinatorial pooled sequencing which employs pooling pattern rather than short DNA barcodes to encode each sample. In combinatorial pooled sequencing, samples are mixed into few pools according to a carefully designed pooling strategy which allows the sequencing data to be decoded to identify the reads that belongs to the sample that are unique or rare in the population. In this review, we mainly survey the experiment design and decoding procedure for the combinatorial pooled sequencing applied in rare variant and rare haplotype carriers screening, complex genome assembling and single individual haplotyping.
... The steps in our combinatorial clone-by-clone sequencing method are illustrated inFigure 1 and described next in detail. A. Obtain a BAC library for the target organism B. Select gene-enriched BACs from the library (optional) C. Fingerprint BACs and build a physical map D. Select a minimum tiling path (MTP) from the physical map [13,14] E. Pool the MTP BACs according to the shifted transversal design [15] F. Sequence the DNA in each pool, trim/clean sequenced reads G. Assign reads to BACs (deconvolution) H. Assemble reads BAC-by-BAC using a short-read assembler ...
... The construction of a physical library and the selection of a MTP from a physical map are well-known procedures, and many organisms now have these resources available. More details can be found in, e.g. [14,16171819, and references therein. Once the set of clones to be sequenced has been identified, they must be pooled according to a scheme that allows the deconvolution of the sequenced reads back to their corresponding BACs. ...
... From this map, we selected only BACs whose sequence could be uniquely mapped to the rice genome. We computed an MTP of this smaller map using our tool FMTP [14]. The resulting MTP contained 3,827 BACs with an average length of &150 kb, and spanned 91% of the rice genome (which is &390 Mb). ...
Article
Full-text available
For the vast majority of species - including many economically or ecologically important organisms, progress in biological research is hampered due to the lack of a reference genome sequence. Despite recent advances in sequencing technologies, several factors still limit the availability of such a critical resource. At the same time, many research groups and international consortia have already produced BAC libraries and physical maps and now are in a position to proceed with the development of whole-genome sequences organized around a physical map anchored to a genetic map. We propose a BAC-by-BAC sequencing protocol that combines combinatorial pooling design and second-generation sequencing technology to efficiently approach denovo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when preparing sequencing libraries for hundreds or thousands of DNA samples, such as in this case gene-bearing minimum-tiling-path BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundred millions of short reads and assign them to the correct BAC clones (deconvolution) so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is very accurate, and the resulting BAC assemblies have high quality. Results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate and the BAC assemblies have good quality. While our method cannot provide the level of completeness that one would achieve with a comprehensive whole-genome sequencing project, we show that it is quite successful in reconstructing the gene sequences within BACs. In the case of plants such as barley, this level of sequence knowledge is sufficient to support critical end-point objectives such as map-based cloning and marker-assisted breeding.
Article
The problem of computing the minimum tiling path (MTP) from a set of clones arranged in a physical map is a cornerstone of hierarchical (clone-by-clone) genome sequencing projects. We formulate this problem in a graph theoretical framework, and then solve by a combination of minimum hitting set and minimum spanning tree algorithms. The tool implementing this strategy, called FMTP, shows improved performance compared to the widely used software FPC. When we execute FMTP and FPC on the same physical map, the MTP produced by FMTP covers a higher portion of the genome, and uses a smaller number of clones. For instance, on the rice genome the MTP produced by our tool would reduce by about 11 percent the cost of a clone-by-clone sequencing project. Source code, benchmark data sets, and documentation of FMTP are freely available at >http://code.google.com/p/fingerprint-based-minimal-tiling-path/ under MIT license.
Article
Full-text available
We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).