Fig 1 - uploaded by Boas Pucker
Content may be subject to copyright.
Genome-wide distribution of sequence variants between Col-0 and Nd-1. 126

Genome-wide distribution of sequence variants between Col-0 and Nd-1. 126

Source publication
Preprint
Full-text available
Once a suitable reference sequence has been generated, intraspecific variation is often assessed by re-sequencing. Variant calling processes can reveal all differences between strains, accessions, genotypes, or individuals. These variants can be enriched with predictions about their functional implications based on available structural annotations,...

Similar publications

Article
Full-text available
The advent of genomic big data and the statistical need for reaching significant results have led genome-wide association studies to be ravenous of a huge number of genetic markers scattered along the whole genome. Since its very beginning, the so-called genotype imputation served this purpose; this statistical and inferential procedure based on a...

Citations

... These systematic genetic differences can be small sequence variants or presence/absence variants affecting entire genes. Tools like SnpEff (Cingolani et al., 2012) and NAVIP (Baasner et al., 2024) enable a prediction of the functional consequences of a sequence variant. GWAS and MBS have been deployed to study A. thaliana gene functions and to provide insights into crop genes determining important traits (Mascher et al., 2014;Sasaki et al., 2021;Schilbert et al., 2022;Naake et al., 2023;Sielemann et al., 2023). ...
Preprint
Full-text available
This review provides an overview of advancements in plant genomics, emphasizing key stages in genomics projects and addressing associated challenges. Long read sequencing enables the cost-effective sequencing of plant DNA and assembly of highly continuous genome sequences - often even separating haplophases. Incorporating external hints, such as cDNA sequences from RNA-seq or full length cDNA sequencing, enhances the identification of gene models. While these steps enable high-throughput exploration of numerous plant genomes, a significant bottleneck lies in elucidating gene functions. The classical approach based on wet lab methods is impractical when dealing with thousands of genes in a new genome sequence. To overcome this challenge, computational tools harnessing existing information for cross species knowledge transfer are essential for expediting the functional annotation process. In support of researchers entering the field of plant genomics, a collection of recommended tools has been curated and is accessible at https://github.com/bpucker/ToolOverview.
... Mappings were filtered with samtools to remove spurious hits, low quality alignments, and reads that are not properly mapped in pairs (-q 30 -b -F 0 × 900 -f 0 × 2). GATK v3.8 [46,47] was applied for the detection of small sequence variants as previously described [48]. Sequence variants were filtered to obtain a reduced set with high confidence. ...
Article
Full-text available
Background Infection by beet cyst nematodes (BCN, Heterodera schachtii) causes a serious disease of sugar beet, and climatic change is expected to improve the conditions for BCN infection. Yield and yield stability under adverse conditions are among the main breeding objectives. Breeding of BCN tolerant sugar beet cultivars offering high yield in the presence of the pathogen is therefore of high relevance. Results To identify causal genes providing tolerance against BCN infection, we combined several experimental and bioinformatic approaches. Relevant genomic regions were detected through mapping-by-sequencing using a segregating F2 population. DNA sequencing of contrasting F2 pools and analyses of allele frequencies for variant positions identified a single genomic region which confers nematode tolerance. The genomic interval was confirmed and narrowed down by genotyping with newly developed molecular markers. To pinpoint the causal genes within the potential nematode tolerance locus, we generated long read-based genome sequence assemblies of the tolerant parental breeding line Strube U2Bv and the susceptible reference line 2320Bv. We analyzed continuous sequences of the potential locus with regard to functional gene annotation and differential gene expression upon BCN infection. A cluster of genes with similarity to the Arabidopsis thaliana gene encoding nodule inception protein-like protein 7 (NLP7) was identified. Gene expression analyses confirmed transcriptional activity and revealed clear differences between susceptible and tolerant genotypes. Conclusions Our findings provide new insights into the genomic basis of plant-nematode interactions that can be used to design and accelerate novel management strategies against BCN.
... The overall workflow of our benchmarking study is presented in Figure 6. We applied a previously described pipeline to validate sequence variants against the Nd-1 de novo assembly based on PacBio reads [57], which is crucial in order to assess the performance of each variant calling pipeline. This Nd-1 genome sequence assembly is of high quality due to a high PacBio read coverage of about 112-fold and additional polishing with about 120-fold coverage of accurate short reads [58]. ...
Article
Full-text available
High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.
... doi: bioRxiv preprint reads (https://github.com/bpucker/variant_calling) 46 , which is crucial in order to assess the performance of each variant calling pipeline. A gold standard was generated from all validated variants by combining them into a single VCF file (https://docs.cebitec.uni-bielefeld.de/s/GG4CYJ7PcEwMFAF). ...
... Next, variants were called and saved in VCF files. All variants were subjected to a previously described validation process based on the Nd-1 genome sequence 46 ...
Preprint
Full-text available
High-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.
... As previous studies revealed that Illumina short reads have a higher resolution for such coverage analysis [42], we focused on the Illumina read data set for these analyses. Sequence variants were detected based on this read mapping as previously described [43]. The number of heterozygous variants per gene was calculated and compared between the groups of putatively phase-separated and merged genes. ...
Article
Full-text available
Trifoliate yam (Dioscorea dumetorum) is one example of an orphan crop, not traded internationally. Post-harvest hardening of the tubers of this species starts within 24 h after harvesting and renders the tubers inedible. Genomic resources are required for D. dumetorum to improve breeding for non-hardening varieties as well as for other traits. We sequenced the D. dumetorum genome and generated the corresponding annotation. The two haplophases of this highly heterozygous genome were separated to a large extent. The assembly represents 485 Mbp of the genome with an N50 of over 3.2 Mbp. A total of 35,269 protein-encoding gene models as well as 9941 non-coding RNA genes were predicted, and functional annotations were assigned.
... Genomic sequencing reads were retrieved from the SRA via fastq-dump, as described above. BWA MEM v.0.7 [66] was applied with the -M parameter for mapping of the reads, and GATK v.3.8 [67,68] was used for variant detection, as described previously [69]. The positions of variants were compared to the positions of splice sites using compare_variation_rates.py [10]. ...
Article
Full-text available
Most protein-encoding genes in eukaryotes contain introns, which are interwoven with exons. Introns need to be removed from initial transcripts in order to generate the final messenger RNA (mRNA), which can be translated into an amino acid sequence. Precise excision of introns by the spliceosome requires conserved dinucleotides, which mark the splice sites. However, there are variations of the highly conserved combination of GT at the 5' end and AG at the 3' end of an intron in the genome. GC-AG and AT-AC are two major non-canonical splice site combinations, which have been known for years. Recently, various minor non-canonical splice site combinations were detected with numerous dinucleotide permutations. Here, we expand systematic investigations of non-canonical splice site combinations in plants across eukaryotes by analyzing fungal and animal genome sequences. Comparisons of splice site combinations between these three kingdoms revealed several differences, such as an apparently increased CT-AC frequency in fungal genome sequences. Canonical GT-AG splice site combinations in antisense transcripts are a likely explanation for this observation, thus indicating annotation errors. In addition, high numbers of GA-AG splice site combinations were observed in Eurytemoraaffinis and Oikopleuradioica. A variant in one U1 small nuclear RNA (snRNA) isoform might allow the recognition of GA as a 5' splice site. In depth investigation of splice site usage based on RNA-Seq read mappings indicates a generally higher flexibility of the 3' splice site compared to the 5' splice site across animals, fungi, and plants.
... The genome-wide distribution of SNVs and InDels was assessed based on previously 98 developed scripts (Baasner et al. 2019). The length distribution of InDels inside coding 99 sequences was compared to the length distribution of InDels outside coding sequences using a 100 customized Python script (Pucker et al. 2016). ...
Article
Full-text available
Different Musa species, subspecies, and cultivars are currently investigated to reveal their genomic diversity. Here, we compare the genome sequence of one of the commercially most important cultivars, Musa acuminata Dwarf Cavendish, against the Pahang reference genome assembly. Numerous small sequence variants were detected and the ploidy of the cultivar presented here was determined as triploid based on sequence variant frequencies. Illumina sequence data also revealed a duplication of a large segment on the long arm of chromosome 2 in the Dwarf Cavendish genome. Comparison against previously sequenced cultivars provided evidence that this duplication is unique to Dwarf Cavendish. Although no functional relevance of this duplication was identified, this example shows the potential of plants to tolerate such aneuploidies.
... Illumina sequencing reads of At7 were aligned to the TAIR9 reference sequence of Col-0 via BWA MEM v0.7.13 [25]. Next, GATK v.3.8 [27,28] was applied for the identification of small sequence variants as previously described [29]. In contrast to previous studies, variant positions with multiple different alleles were kept as they are biologically possible. ...
... SnpEff [30] was deployed to assign predictions of the functional impact to all small sequence variants. Previously developed Python scripts [8,29] were customized to investigate the distribution of variants and to check for patterns. SVIM [31] was deployed to identify large variants based on a Minimap2 mapping of all ONT reads against the Col-0 reference sequence. ...
... Alignments against the Col-0 reference sequence revealed 160,348 deletions and 5902 insertions ( Figure S1). Previous comparisons of natural A. thaliana accessions revealed equal numbers of insertions and deletions [2,8,29]. ...
Article
Full-text available
Arabidopsis thaliana is one of the best studied plant model organisms. Besides cultivation in greenhouses, cells of this plant can also be propagated in suspension cell culture. At7 is one such cell line that was established about 25 years ago. Here, we report the sequencing and the analysis of the At7 genome. Large scale duplications and deletions compared to the Columbia-0 (Col-0) reference sequence were detected. The number of deletions exceeds the number of insertions, thus indicating that a haploid genome size reduction is ongoing. Patterns of small sequence variants differ from the ones observed between A. thaliana accessions, e.g., the number of single nucleotide variants matches the number of insertions/deletions. RNA-Seq analysis reveals that disrupted alleles are less frequent in the transcriptome than the native ones.
Chapter
Full-text available
Recent progress in sequencing technologies facilitates plant science experiments through the availability of genome and transcriptome sequences. Genome assemblies provide details about genes, transposable elements, and the general genome structure. The availability of a reference genome sequence for a species enables and supports numerous wet lab analyses and comprehensive bioinformatic investigations e.g. genome-wide investigations of gene families. After generating a genome sequence, gene prediction and the generation of functional annotations are the major challenges. Although these methods were improved substantially over the last years, incorporation of external hints like RNA-Seq reads is beneficial. Once a high-quality sequence and annotation is available for a species, diversity between accessions can be assessed by re-sequencing. This helps in revealing single nucleotide variants, insertions and deletions, and larger structural variants like inversions and transpositions. Identification of these variants requires sophisticated bioinformatic tools and many of them were developed during past years. Sequence variants can be harnessed for the genetic mapping of traits. Several mapping-by-sequencing approaches were developed to find underlying genes for relevant traits in crops. These genomic approaches are complemented by various transcriptomic methods dominated by a very popular RNA-Seq technology. Transcript abundance is measured via sequencing of the corresponding cDNA molecules. RNA-Seq reads can be subjected to transcriptome assembly or gene expression analysis, e.g. for the identification of transcripts abundance between different tissues, conditions, or genotypes.