ArticlePDF Available

Hybracter: enabling scalable, automated, complete and accurate bacterial genome assemblies

Authors:

Abstract and Figures

Improvements in the accuracy and availability of long-read sequencing mean that complete bacterial genomes are now routinely reconstructed using hybrid (i.e. short- and long-reads) assembly approaches. Complete genomes allow a deeper understanding of bacterial evolution and genomic variation beyond single nucleotide variants. They are also crucial for identifying plasmids, which often carry medically significant antimicrobial resistance genes. However, small plasmids are often missed or misassembled by long-read assembly algorithms. Here, we present Hybracter which allows for the fast, automatic and scalable recovery of near-perfect complete bacterial genomes using a long-read first assembly approach. Hybracter can be run either as a hybrid assembler or as a long-read only assembler. We compared Hybracter to existing automated hybrid and long-read only assembly tools using a diverse panel of samples of varying levels of long-read accuracy with manually curated ground truth reference genomes. We demonstrate that Hybracter as a hybrid assembler is more accurate and faster than the existing gold standard automated hybrid assembler Unicycler. We also show that Hybracter with long-reads only is the most accurate long-read only assembler and is comparable to hybrid methods in accurately recovering small plasmids.
Content may be subject to copyright.
1
OPEN
DATA
Hybracter: enabling scalable, automated, complete and accurate
bacterial genomeassemblies
GeorgeBouras1,2,*, GhaisHoutak1,2, Ryan R.Wick3, VijiniMallawaarachchi4, Michael J.Roach4,5, BhavyaPapudeshi4,
Lousie M.Judd3, Anna E.Sheppard6, Robert A.Edwards4 and SarahVreugde1,2
RESEARCH ARTICLE
Bouras etal., Microbial Genomics 2024;10:001244
DOI 10.1099/mgen.0.001244
Received 08 January 2024; Accepted 16 April 2024; Published 08 May 2024
Author aliations: 1Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, Australia; 2The Department
of Surgery – Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central
Adelaide Local Health Network, Adelaide, South Australia, Australia; 3Department of Microbiology and Immunology, University of Melbourne at the
Peter Doherty Institute for Infection and Immunity, Melbourne, Australia; 4Flinders Accelerator for Microbiome Exploration, College of Science and
Engineering, Flinders University, Adelaide, Australia; 5Adelaide Centre for Epigenetics and South Australian Immunogenomics Cancer Institute, The
University of Adelaide, Adelaide, Australia; 6School of Biological Sciences, The University of Adelaide, Adelaide, Australia.
*Correspondence: George Bouras, george. bouras@ adelaide. edu. au
Keywords: assembly; long- reads; plasmids.
Abbreviations: CDS, coding sequence; DBG, de Bruijn graph; HPC, high- performace computing; InDel, insertion and deletion; SNV, single nucleotide
variant.
Data statement: Three supplementary figures and 13 supplementary tables are available with the online version of this article.
001244 © 2024 The Authors
This is an open- access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between
the Microbiology Society and the corresponding author’s institution.
Abstract
Improvements in the accuracy and availability of long- read sequencing mean that complete bacterial genomes are now rou-
tinely reconstructed using hybrid (i.e. short- and long- reads) assembly approaches. Complete genomes allow a deeper under-
standing of bacterial evolution and genomic variation beyond single nucleotide variants. They are also crucial for identifying
plasmids, which often carry medically significant antimicrobial resistance genes. However, small plasmids are often missed
or misassembled by long- read assembly algorithms. Here, we present Hybracter which allows for the fast, automatic and
scalable recovery of near- perfect complete bacterial genomes using a long- read first assembly approach. Hybracter can be
run either as a hybrid assembler or as a long- read only assembler. We compared Hybracter to existing automated hybrid and
long- read only assembly tools using a diverse panel of samples of varying levels of long- read accuracy with manually curated
ground truth reference genomes. We demonstrate that Hybracter as a hybrid assembler is more accurate and faster than the
existing gold standard automated hybrid assembler Unicycler. We also show that Hybracter with long- reads only is the most
accurate long- read only assembler and is comparable to hybrid methods in accurately recovering small plasmids.
Impact Statement
Complete bacterial genome assembly using hybrid sequencing is a routine and vital part of bacterial genomics, especially for
identification of mobile genetic elements and plasmids. As sequencing becomes cheaper, easier to access and more accurate,
automated assembly methods are crucial. With Hybracter, we present a new long- read first automated assembly tool that is
faster and more accurate than the widely used Unicycler. Hybracter can be used both as a hybrid assembler and with long- reads
only. Additionally, it solves the problems of long- read assemblers struggling with small plasmids, with plasmid recovery from
long- reads only performing on par with hybrid methods. Hybracter can natively exploit the parallelization of high- performance
computing clusters and cloud- based environments, enabling users to assemble hundreds or thousands of genomes with one
line of code. Hybracter is available freely as source code on GitHub, via Bioconda or PyPi.
OPEN
ACCESS
2
Bouras etal., Microbial Genomics 2024;10:001244
DATA SUMMARY
(1) Hybracter is developed using Python and Snakemake as a command- line soware tool for Linux and MacOS systems.
(2)
Hybracter is freely available under an MIT licence on GitHub (https://github.com/gbouras13/hybracter) and the documenta-
tion is available at Read the Docs (https://hybracter.readthedocs.io/en/latest/).
(3)
Hybracter is available to install via PyPI (https://pypi.org/project/hybracter/) and Bioconda (https://anaconda.org/bioconda/
hybracter). A Docker/Singularity container is also available at https://quay.io/repository/gbouras13/hybracter.
(4) All code used to benchmark Hybracter, including the reference genomes, is publicly available on GitHub (https://github.
com/gbouras13/hybracter_benchmarking) with released DOI https://zenodo.org/doi/10.5281/zenodo.10910108 available
at Zenodo.
(5) e subsampled FASTQ les used for benchmarking are publicly available at Zenodo with DOI https://doi.org/10.5281/
zenodo.10906937.
(6)
All super accuracy simplex ATCC FASTQ reads sequenced as a part of this study can be found under BioProject
PRJNA1042815.
(7) All Hall et al. fast accuracy simplex and super accuracy duplex ATCC FASTQ read les can be found in the SRA under
BioProject PRJNA1087001.
(8) All raw Lermaniaux et al. FASTQ read les and genomes can be found in the SRA under BioProject PRJNA1020811.
(9) All Staphylococcus aureus JKD6159 FASTQ read les and genomes can be found under BioProject PRJNA50759.
(10) All Mycobacterium tuberculosis H37R2 FASTQ read les and genomes can be found under BioProject PRJNA836783.
(11) e complete list of BioSample accession numbers for each benchmarked sample can be found in Table S1, available in the
online version of this article.
(12)
e benchmarking assembly output les are publicly available on Zenodo with DOI https://doi.org/10.5281/zenodo.
10906937.
(13) All Pypolca benchmarking outputs and code are publicly available on Zenodo with DOI https://zenodo.org/doi/10.5281/
zenodo.10072192.
INTRODUCTION
Reconstructing complete bacterial genomes using de novo assembly methods had been considered too costly and time-
consuming to be widely recommended in most cases, even as recently as 2015 [1]. is was due to the reliance on short- read
sequencing technologies, which does not allow for reconstructing regions with repeats and extremely high GC content
[2]. However, since then, advances in long- read sequencing technologies have allowed for the automatic construction of
complete genomes using hybrid assembly approaches. Originally, this involved starting with a short- read assembly followed
by scaolding the repetitive and dicult to resolve regions with long- reads [3, 4]. is approach was implemented in the
command- line tool Unicycler, which remains the most popular tool for generating complete bacterial genome assemblies
[5]. As long- read sequencing has improved in accuracy and availability, with the latest Oxford Nanopore Technologies reads
recently reaching Q20 (99 %+) median accuracy, a long- read rst assembly approach supplemented by short- read polishing
has recently been favoured for recovering accurate complete genomes. Long- read rst approaches provide greater accuracy
and contiguity than short- read rst approaches in dicult regions [6–11]. e current gold standard manual assembly tool
Trycycler even allows for the potential recovery of perfect genome assemblies [7]. However, Trycycler requires signicant
microbial bioinformatics expertise and involves manual decisionmaking, creating a signicant barrier to useability, scalability
and automation [12].
Several tools exist that generate automated long- read rst genome assemblies, such as MicroPIPE [13], ASA3P [14], Bactopia
[15] and Dragonye [16]. However, these tools do not consider factors such as genome reorientation [17] and recent polishing
best- practices [18], and oen contain the assembly workow as a sub- module within a more expansive end- to- end pipeline.
Additionally, none of the existing tools consider the targeted recovery of plasmids. As long- read assemblers struggle particularly
with small plasmids, this leads to incorrectly recovered or missing plasmids in bacterial assemblies [19].
We introduce Hybracter, a new command- line tool for automated near- perfect long- read rst complete bacterial genome
assembly. It implements a comprehensive and exible workow allowing for long- read assembly polished with long- and short-
reads (with subcommand ‘hybracter hybrid’ for one or more samples and subcommand ‘hybracter hybrid- single’ for a single
sample) or long- read only assembly polished with long- reads (with subcommand ‘hybracter long’ for one or more samples and
subcommand ‘hybracter long- single’ for a single sample) (Table1). For ease of use and familiarity, Hybracter has been designed
with a command- line interface containing parameters similar to Unicycler. Additionally, thanks to its Snakemake [20] and
Snaketool [21] implementation, Hybracter seamlessly scales from a single isolate to hundreds or thousands of genomes with
high computational eciency and supports deployment on high- performance computing (HPC) clusters and cloud- based
environments.
3
Bouras etal., Microbial Genomics 2024;10:001244
METHODS
Assembly workflow
Hybracter implements a long- read rst automated assembly workow based on current best practices [12]. e main subcom-
mands available in Hybracter can be found in Table1 and the workow is outlined in Fig.1. Hybracter begins with long- reads
for all subcommands, and uses short- reads for polishing for ‘Hybracter hybrid’ and ‘Hybracter hybrid- single’ subcommands.
First, long- read input FASTQs are input and long- read sets are ltered and subsampled to a depth of 100× with Filtlong [22],
which prioritizes the longest and highest quality reads, outperforming random subsampling (see Table S11). e reads also have
adapters trimmed using Porechop_ABI [23], with optional contaminant removal against a host genome using modules from
Trimnami (e.g. if the bacterium has been isolated from a host) [24]. Quality control of short- read input FASTQs is performed
with fastp [25] (Fig.1a). e estimated depth of the short- reads is determined using Seqkit [26].
Long- reads are then assembled with Flye [27]. If at least one contig is recovered above the cut- o ‘-c’ chromosome length specied
by the user for the sample, that sample will be denoted as ‘complete. All such contigs will then be marked as chromosomes and
kept for downstream polishing and reorientation if marked as circular by Flye. If zero contigs are above the cut- o chromosome
length, the assembly will be denoted as ‘incomplete, and all contigs will be kept for downstream polishing (Fig.1b).
For all complete samples, targeted plasmid assembly is then conducted using Plassembler [28] (Fig.1c). All samples (i.e. complete
and incomplete) are then polished once with Medaka [29], which can be turned o using ‘--no_medaka’ (Fig.1d). It is recom-
mended to turn o Medaka using ‘--no_medaka’ for highly accurate Q20+ read sets where Medaka has been shown to introduce
false positive changes [11]. For all complete samples only, chromosome(s) marked as circular by Flye will then be reoriented
to begin with the dnaA chromosomal replication initiator gene using Dnaapler [30]. ese reoriented chromosomes are then
polished for a second time with Medaka to ensure the sequence around the original chromosome breakpoint is polished.
If the user has provided short- reads with Hybracter hybrid, all sample assemblies (complete and incomplete) are then polished
with Polypolish [18] followed by Pypolca [31, 32] (Fig.1f). e exact parameters depend on the depth of short- read sequencing
[31]. If the estimated short- read coverage is below 5×, only Polypolish with ‘--careful’ is run, as Pypolca can rarely introduce false
positive errors at low depths. If the estimated short- read coverage is between 5× and 25×, Polypolish with --careful parameter is
Table 1. Summary of the four primary Hybracter commands
Command Input No. of samples Description Workow elements included
by default (from Fig.1)
Hybracter hybrid Five- column csv sample
sheet specied with ‘--input’
containing:
sample name
long- read FASTQ path
estimated chromosome
length
R1 short- read FASTQ path
R2 short- read FASTQ path
1+ Long- read rst assembly
followed by long- then
short- read polishing for
multiple isolates. Snakemake
implementation ensures
ecient use of available
resources
a, b, c, d, e, f, g, h
Hybracter hybrid- single sample name (- s)
long- read FASTQ path (- l)
estimated chromosome
length (- c)
R1 short- read FASTQ path
(−1)
R2 short- read FASTQ path
(−2)
1 Long- read rst assembly
followed by long- then short-
read polishing for a single
isolate. Similar command line
interface to Unicycler
a, b, c, d, e, f, g, h
Hybracter long ree- column csv sample
sheet specied with ‘--input’
containing:
sample name
long- read FASTQ path
estimated chromosome
length
1+ Long- read rst assembly
followed by long- read
polishing for multiple
isolates. Snakemake
implementation ensures
ecient use of available
resources
a (no fastp), b, c, d, e, g, h
Hybracter long- single sample name (- s)
long- read FASTQ path (- l)
estimated chromosome
length (- c)
1 Long- read rst assembly
followed by long- read
polishing on a single isolate.
a (no fastp), b, c, d, e, g, h
4
Bouras etal., Microbial Genomics 2024;10:001244
run followed by Pypolca with --careful parameter. Above 25× coverage, Polypolish with default parameters followed by Pypolca
with --careful is run. is is because Pypolca --careful has been shown to be the best polisher at depths above 5×, and because
Polypolish is able to x potential errors in repeats Pypolca may miss. By default, the last short- read polishing round is chosen as the
nal assembly. Alternatively, users can choose the highest scoring polishing round according to the reference- free ALE [33] score.
Fig. 1. Outline of the Hybracter workflow.
5
Bouras etal., Microbial Genomics 2024;10:001244
If only long- reads are available (Hybracter long), the mean coding sequence (CDS) length is calculated for each assembly using
Pyrodigal [34, 35], with larger mean CDS lengths indicating a better quality assembly. e polishing round with the highest mean
CDS length is chosen as the nal assembly (Fig.1g).
For each sample, a nal output assembly FASTA le is created, along with per contig and overall summary statistic TSV les,
as well as separate chromosome and plasmid FASTA les for samples denoted as complete (Fig.1h). An overall ‘ hybracter_
summary. tsv’ le is also generated, which summarizes outputs for all samples. All main output les are explained in more
detail in Table2. All the main outputs can be found in the ‘FINAL_OUTPUT’ subdirectory, while all other intermediate
output les are available in other subdirectories for users who would like extra information about their assemblies, including
all assembly assessments, comparisons of all changes introduced by polishing, and Flye and Plassembler output summaries.
A full list of these supplementary outputs can be found in Hybracter’s Documentation (https://hybracter.readthedocs.io/
en/latest/output/).
Tool selection
Tools were selected for inclusion in Hybracter either based on benchmarking from the literature, or they were specically
developed for inclusion in Hybracter. Flye [27] was chosen as the long- read assembler because it is more accurate for bacterial
genome assembly than other long- read assemblers with comparable runtimes, such as Raven [36], Redbean [37] and Miniasm
[38], while being dramatically faster than the comparably accurate Canu [6, 39]. Medaka [29] was chosen as the long- read polisher
because of its ability to improve assembly continuity in addition to accuracy [12, 40]. e benchmarking results of this study
also emphasize that it is particularly good at xing insertion and deletion (InDel) errors, which cause problematic frameshis
and frequently lead to fractured or truncated gene predictions. However, it should be re- iterated that for modern Q20+ datasets,
Medaka may introduce errors [11] and should not be used (using --no_medaka with Hybracter). Polypolish and Pypolca in
various combinations depending on short- read depth were selected as short- read polishers, as these have been shown to achieve
the highest performance with the lowest chance of introducing errors when used in combination [31].
We developed three standalone programs included in Hybracter. ese are Dnaapler [30], Plassembler [28] and Pypolca [31].
Dnaapler was developed to ensure the chromosome(s) identied by Hybracter are reoriented to consistently begin with the dnaA
chromosomal replication initiator gene. Full implementation details can be found in the manuscript, with expanded functionality
beyond this use case [30]. Plassembler was developed to improve the runtime and accuracy when assembling plasmids in bacterial
isolates. Full implementation details can be found in the manuscript for hybrid mode [28]. Hybracter long utilizes Plassembler
containing a post- publication improvement for long- reads only (‘Plassembler long’) released in v1.3. Plassembler long assembles
plasmids from only long- reads by treating long- reads as both short- reads and long- reads. Plassembler long does this by utilizing
Unicycler in its pipeline to create a de Bruijn graph- based assembly, treating the long- reads as unpaired single- end reads, which
are then scaolded with the same long- read set.
Table 2. Description of the primary Hybracter output files
Output le Description
{sample}_nal.fasta Final assembly FASTA le for the sample. Contains all chromosome(s) and plasmids for complete
isolates and all contigs for incomplete isolates
{sample}_chromosome.fasta Final assembly FASTA le for the chromosomes(s) in a complete sample
{sample}_plasmid.fasta Final assembly FASTA le for the plasmids in a complete sample
hybracter_summary.tsv A TSV le combining the {sample}_summary.tsv les for all samples
{sample}_summary.tsv A TSV le containing columns denoting for the sample:
Assembly completeness
Total assembly length
Number of contigs assembled
e polishing round deemed to be most accurate and selected as the nal assembly
e length of the longest contig
e estimated coverage of the longest contig
e number of circular plasmids recovered by Plassembler
{sample}_per_contig_stats.tsv A TSV le containing columns denoting for the sample:
Contig name
Contig type (chromosome or plasmid) (complete samples only)
Contig length
Contig GC%
Contig circularity (complete samples only)
6
Bouras etal., Microbial Genomics 2024;10:001244
e third tool is Pypolca [31, 32]. Pypolca is a Python re- implementation of the POLCA short- read genome polisher, originally
created specically for inclusion in Hybracter and with an almost identical output format and performance. Compared to POLCA,
Pypolca features improved useability with a simplied command line interface, allows the user to specify an output directory
and introduces a ‘--careful’ parameter. e performance of Pypolca, and particularly Pypolca with the --careful parameter, is
described in the manuscript [31].
Benchmarking
To compare Hybracter’s functionality and performance, we benchmarked its performance against other soware tools. We
focused on the most popular state- of- the- art assembly tools for automated hybrid and long only bacterial genome assemblies.
All code to replicate these analyses can be found at the repository (https://github.com/gbouras13/hybracter_benchmarking). All
programs and dependency versions used for benchmarking can be found in Table S4. For the hybrid tools, we chose Unicycler and
Dragonye with both long- read and short- read polishing (denoted ‘Dragonye hybrid’). Dragonye was chosen as it is a popular
long- read rst assembly pipeline [16]. Both tools were run using default parameters. By default, Dragonye conducts a long- read
assembly with Flye that is polished by Racon [41] followed by Polypolish. For the long- read only tool, we chose Dragonye with
long- read Racon- based polishing only (denoted ‘Dragonye long’).
We used 30 samples for benchmarking, representing genomes from a variety of Gram- negative and Gram- positive bacteria. We
chose these samples as they have real hybrid read sets in combination with manually curated genome assemblies produced using
either Trycycler or Bact- builder [42], a consensus- building pipeline based on Trycycler. ese samples came from ve dierent
studies below. We used the published genomes from these studies as representatives of the ‘ground truth’ for these samples. Where
read coverage exceeded 100× samples were subsampled to approximately 100× coverage of the approximate genome size with
Rasusa v0.7.0 [43], as this better reects more realistic read depth of real life isolate sequencing. Nanoq v0.10.0 [44] was used to
generate quality control statistics for the subsampled long- read sets. Four isolates did not have 100× long- read coverage – the
entire long- read set was used instead. A full summary table of the read lengths, quality, Nanopore kit and base- calling models used
in these studies can be found in Table S2. Hybracter v0.7.0 was used to conduct benchmarking. Medaka long- read polishing was
used for all samples except the ve ATCC super- accuracy model basecalled duplex read samples, where ‘--no_medaka’ was used.
ese samples contained varying levels of long- read quality (reecting improvements in Oxford Nanopore Technologies long- read
technology), with the median Q score of long- read sets ranging from 10.6 to 26.8. e ve studies are:
(1) Five ATCC strain isolates (Salmonella enterica ATCC 10708, Vibrio paragaemolyticus ATCC 17802, Escherichia coli ATCC
25922, Campylobacter jejuni ATCC 33560 and Listeria monocytogenes ATCC- BAA- 679) with R10 chemistry super- accuracy
model basecalled simplex long- reads made available as a part of this study.
(2)
e same ve ATCC isolates with R10 chemistry fast model basecalled long- reads, and R10 chemistry super- accuracy model
basecalled duplex long- reads from Hall et al. [45].
(3) Twelve diverse carbapenemase- producing Gram- negative bacteria from Lerminiaux et al. [9].
(4) Staphylococcus aureus JKD6159 sequenced with both R9 and R10 chemistry long- read sets from Wick et al. [46].
(5) Mycobacterium tuberculosis HR37v from Chitale et al. [42].
e full details for each individual isolate used can be found in Tables S1 and S2.
Chromosome accuracy
e assembly accuracy of the chromosomes recovered by each benchmarked tool was compared using Dnadi v1.3 packaged
with MUMmer v3.23 [47]. Comparisons were performed on the largest assembled contig (denoted as the chromosome) by
each method, other than for Vibrio parahaemolyticus ATCC 17802, where the two largest contigs were chosen as it has two
chromosomes.
Plasmid recovery performance and accuracy
Plasmid recovery performance for each tool was compared using the following methodology. Summary statistics are presented
in Table3. See Table S7 for a full sample- by- sample analysis. All samples were analysed using the four- step approach outlined
below using summary length and GC% statistics for all contigs and the output of Dnadi v1.3 comparisons generated for each
sample and tool combination against the reference genome plasmids:
(1) e number of circularized plasmid contigs recovered for each isolate was compared to the reference genome. If the tool
recovered a circularized contig homologous to that in the reference, it was denoted as completely recovered. Specically, a
contig was denoted as completely recovered if it had a genome length within 250 bp of the reference plasmid, a GC% within
0.1 % of the reference plasmid and whether the Total Query Bases covered was within 250 bp of the Total Reference Bases
from Dnadi. For Dragonye assemblies, some plasmids were duplicated or multiplicated due to known issues with the
7
Bouras etal., Microbial Genomics 2024;10:001244
long- read rst assembly approach for small plasmids [6, 19, 48]. Any circularized contigs that were multiplicated compared
to the reference plasmid were therefore denoted as misassembled.
(2) For additional circularized contigs not found in the reference recovered, these were tested for homology against the NCBI
nt database using the web version of blastn [48]. If there was a hit to a plasmid, the Plassembler output within Hybracter
was checked for whether the contig had a Mash hit (i.e. a Mash distance of 0.2 or lower) to plasmids in the PLSDB [49].
If there was a hit, the contig was denoted as an additional recovered plasmid. ere were two in total (see Table S7 and
supplementary data).
(3)
Plasmids with contigs that were either not circularized but homologous to a reference plasmid, or circularized but incomplete
(failing the genome length and Dnadi criteria in 1) were denoted as partially recovered or misassembled.
(4) Reference plasmids without any homologous contigs in the assembly were denoted as missed.
Additional non- circular contigs that had no homology with reference plasmids and were not identied as plasmids in step 2 were
analysed on a contig- by- contig basis and denoted as additional non- plasmid contigs (see Table S7 for contig- by- contig analysis
details).
Runtime performance comparison
To compare the performance of Hybracter, we compared wall- clock runtime consumption on a machine with an Intel Core
i9- 13900 CPU at 5.60 GHz on a machine running Ubuntu 20.04.6 LTS with a total of 32 available threads (24 total cores). We
ran all tools with eight and 16 threads and with 32 GB of memory to provide runtime metrics comparable to commonly available
consumer hardware. Hybracter hybrid and long were run with ‘hybracter hybrid- single’ and ‘hybracter long- single’ for each isolate
to generate a comparable per- sample runtime for comparison with the other tools. e summary results are available in Table4
and the detailed results for each specic tool and thread combination are found in Table S8.
Table 3. The total number of plasmids recovered by each tool; there were 59 total reference plasmids in the 30 samples
Tool Complete plasmids
recovered
Total plasmids partially
recovered or misassembled
Total plasmids missed Additional plasmids
recovered not in
reference
Samples with
additional non-
plasmid contigs
recovered
Hybracter hybrid 65 4 0 2 10
Unicycler 60 6 3 1 2
Dragonye hybrid 44 16 9 1 10
Hybracter long 60 5 4 2 3
Dragonye long 44 16 9 1 10
Table 4. Wall- clock runtime summary statistics for each tool
Tool Typ e 8 reads (h:min:s) 16 reads (h:min:s)
Hybracter hybrid Hybrid Median=00 : 15 : 03
Minimum=00 : 04 : 29
Maximum=00 : 54 : 41
Median=00 : 13 : 44
Minimum=00 : 03 : 27
Maximum=00 : 44 : 36
Dragonye hybrid Hybrid Median=00 : 04 : 34
Minimum=00 : 01 : 32
Maximum=00 : 07 : 27
Median=00 : 03 : 46
Minimum=00 : 01 : 22
Maximum=00 : 06 : 01
Unicycler Hybrid Median=00 : 50 : 25
Minimum=00 : 12 : 04
Maximum=01 : 13 : 32
Median=00 : 34 : 10
Minimum=00 : 08 : 36
Maximum=00 : 48 : 23
Hybracter long Long Median=00 : 11 : 46
Minimum=00 : 03 : 26
Maximum=00 : 36 : 09
Median=00 : 10 : 20
Minimum=00 : 03 : 17
Maximum=00 : 29 : 50
Dragonye long Long Median=00 : 04 : 10
Minimum=00 : 01 : 22
Maximum=00 : 06 : 01
Median=00 : 04 : 34
Minimum=00 : 01 : 32
Maximum=00 : 07 : 27
8
Bouras etal., Microbial Genomics 2024;10:001244
Depth analysis
To assess the eect of long- read depth on assembly accuracy, we chose Lerminiaux Isolate B (Enterobacter cloacae) and subsampled
the long- read depth at each interval of 5× from 10× to 100× estimated genome size. All ve tools were run on these read sets.
Where a complete chromosome was assembled, Dnadi (as described above) was used to compare the chromosome assembly
to the reference.
Sequencing
DNA extraction was performed with the DNeasy Blood and Tissue kit (Qiagen). Illumina library preparation was performed
using Illumina DNA prep (Illumina) according to the manufacturer’s instructions. Short- read whole genome sequencing was
performed n an Illumina MiSeq with a 250 bp paired- end kit. An Oxford Nanopore Technologies library preparation ligation
sequencing library was prepared using the ONT SQK- NBD114- 96 kit and the resultant library was sequenced using an R10.4.1
MinION ow cell (FLO- MIN114) on a MinION Mk1b device. Data were base- called with Super- Accuracy Basecalling (SUP)
using the basecaller model dna_r10.4.1_e8.2_sup@v3.5.1.
Pypolca benchmarking
Pypolca v0.2.0 was benchmarked against POLCA (in MaSuRCA v4.1.0) [32] using the 18 isolates described above. ese were
all 12 Lerminiaux et al. isolates, the R10 JKD6159 isolate [46] and the ve ATCC samples we sequenced as a part of this study.
Benchmarking was conducted on an Intel Core i7- 10700K CPU at 3.80 GHz on a machine running Ubuntu 20.04.6 LTS. All
short- read FASTQs used for benchmarking are identical to those used to benchmark Hybracter. e assemblies used for polishing
were intermediate chromosome assemblies from Flye v2.9.2 [50] generated within Hybracter. e outputs from Pypolca and
POLCA were compared using Dnadi v1.3 packaged with MUMmer v3.23 [47]. Overall, Pypolca and POLCA yielded extremely
similar results. In total, 16/18 assemblies were identical. ATCC 33560 had two SNPs between Pypolca and POLCA and Lerminiaux
Isolate I also had two SNPs.
RESULTS
Chromosome accuracy performance
All tools recovered complete circular contigs for each chromosome. Single nucleotide variants (SNVs), small InDels (<60 bp)
and large InDels (>60 bp) were compared as a measure of assembly accuracy. To account for dierences in genomic size between
isolates, SNVs and small InDel counts were normalized by genome length.
e summary results are presented in Table5 and visualized in Figs2 and S1- 3. e detailed results for each tool and sample are
presented in Table S5. Of the hybrid tools, Dragonye hybrid and Hybracter hybrid produced the fewest SNVs (both with median
Table 5. Small (<60 bp) InDels, SNVs and large (>60 bp) InDels of chromosome assemblies for all benchmarked Isolates
Tool Typ e Small InDels SNVs Small InDels+SNVs Large InDels
Hybracter hybrid Hybrid Median=0
Minimum=0
Maximum=41
Median=0
Minimum=0
Maximum=26
Median=1
Minimum=0
Maximum=67
Total=9
Median=0
Minimum=0
Maximum=2
Dragonye hybrid Hybrid Median=2.5
Minimum=0
Maximum=112
Median=0
Minimum=0
Maximum=64
Median=4.5
Minimum=0
Maximum=154
Total=70
Median=2
Minimum=0
Maximum=12
Unicycler Hybrid Median=11
Minimum=0
Maximum=125
Median=34
Minimum=0
Maximum=165
Median=57.5
Minimum=3
Maximum=290
Total=87
Median=1
Minimum=0
Maximum=16
Hybracter long Long Median=16
Minimum=1
Maximum=743
Median=21.5
Minimum=0
Maximum=156
Median=54
Minimum=1
Maximum=852
Total=11
Median=1
Minimum=0
Maximum=3
Dragonye long Long Median=125
Minimum=2
Maximum=4814
Median=34.5
Minimum=0
Maximum=2172
Median=170.5
Minimum=2
Maximum=6332
Total=68
Median=2
Minimum=0
Maximum=12
9
Bouras etal., Microbial Genomics 2024;10:001244
0) followed by Unicycler (median 34). Hybracter hybrid produced the fewest InDels (median 0), followed by Dragonye hybrid
(median 2.5) and Unicycler (median 11). Hybracter hybrid also produced the fewest InDels plus SNVs (median 1), followed by
Dragonye hybrid (median 4.5) and Unicycler (median 57.5).
Additionally, Hybracter hybrid showed superior performance in terms of large InDels, with a median of 0 and a total of 9 large
InDels across the 30 samples, compared to 2 and 70 for Dragonye hybrid, and 1 and 87 for Unicycler.
Overall, Hybracter hybrid produced the most accurate chromosome assemblies. For 12 isolates, Hybracter assembled a perfect
chromosome (Lerminiaux et al. [9] isolates A, B, C, D, G, H, I, J, L, Staphylococcus aureus JKD6159 with R10 chemistry and L.
monocytogenes ATCC BAA- 679 with simplex and duplex super- accuracy model basecalled reads).
Hybracter hybrid also produced several near- perfect assemblies (dened as<10 total SNVs plus InDels with no large insertions
or deletions), including on some lower quality fast model basecalled reads (Table S5).
Similar results were found in the long- read only tool comparison. Hybracter long produced the fewest SNVs (median 21.5)
compared to Dragonye long (median 34.5). Hybracter long consistently had far fewer small InDels (median 16) and large InDels
(total 11 across 30 samples) compared to Dragonye long (median 125 and total 68 respectively). No perfect chromosomes were
assembled by either long- only tool, though Hybracter long did assemble three near- perfect chromosomes (L. monocytogenes
ATCC BAA- 679 with simplex and duplex super- accuracy model basecalled reads and Salmonella enterica ATCC 10708 with
duplex super- accuracy model basecalled reads) and several chromosomes with fewer than 50 total small InDels plus SNVs and
Fig. 2. Comparison of the counts of single nucleotide variants (SNVs) and small (<60 bp) insertions and deletions (InDels) (a)and the total number of
large (>60 bp) InDels (b) for the hybrid tools benchmarked (Hybracter hybrid in dark blue, Dragonflye hybrid in orange and Unicycler in green). The
counts of SNVs and small InDels (c)and the total number of large InDels (d)for the long tools benchmarked (Hybracter long in light blue, Dragonflye
long in grey) are also shown. All data presented are from the benchmarking output run with eight threads.
10
Bouras etal., Microbial Genomics 2024;10:001244
0 large InDels (Lerminiaux isolates A, G, H, L, J, and Staphylococcus aureus JKD6159 with R10 chemistry, Salmonella enterica
ATCC 10708 with simplex super- accuracy model basecalled reads).
Overall, Hybracter long showed consistently worse performance than the hybrid tools Hybracter hybrid and Dragonye hybrid
tools (though not Unicycler) as measured by SNVs and small InDels. Combined with the lack of perfect assemblies even for
duplex super- accuracy model basecalled read assemblies, this suggests the continuing utility of short- read polishing for the
isolates surveyed.
Plasmid recovery performance and accuracy
Hybracter in both hybrid and long modes was superior at recovering plasmids compared to the other tools in the same class
(Table3). Hybracter hybrid was able to completely recover 65/69 possible plasmids (the other four were partially recovered),
compared to 60/69 for Unicycler and only 44/69 for Dragonye hybrid. Hybracter hybrid did not miss a single plasmid, while
Unicycler missed 3/69 (all in Klebsiella pneumoniae Isolate E from Lerminiaux et al.) and Dragonye hybrid completely missed
9/69. In terms of plasmid accuracy, Hybracter hybrid and Unicycler were similar in terms of SNVs plus small InDels, with
medians of 1.62 and 2.02 per 100 kb respectively (Table S9), while Hybracter hybrid produced fewer large InDels than Unicycler
(44 vs. 63 in total).
Interestingly, Hybracter long showed strong performance at recovering plasmids despite using only long- reads, completely recov-
ering 60/69 plasmids and completely missing only 4/69. is performance was far superior to Dragonye long (44/69 completely
recovered, 9/69 missed). In terms of accuracy, both long tools were similar and unsurprisingly less accurate than the hybrid tools
in terms of SNVs plus small InDels (medians of 8.74 per 100 kb for Hybracter long and 7.66 per 100 kb for Dragonye long).
All ve tools detected an additional 5411 bp plasmid in Lerminiaux Isolate G not found in the reference sequence and Hybracter
in both hybrid and long modes detected a further 2519 bp small plasmid from this genome.
Hybracter hybrid recovers more plasmids than either Unicycler or Dragonye because it uses a dedicated plasmid assembler,
Plassembler. In addition, Hybracter long using only long- reads had an identical complete plasmid recovery rate to Unicycler,
which uses both long- and short- reads (60/69 for both). ese results suggest that Hybracter long, by applying algorithms designed
for short- reads on long- reads, largely solves the existing diculties of recovering small plasmids from long- reads, at least on the
benchmarking dataset of predominantly R10 Nanopore reads [19, 51]. Even on the lower quality fast basecalled ATCC reads,
Hybracter long performed well, with only one sample failing to produce a plasmid assembly similar to higher quality datasets
(Salmonella enterica ATCC 10708 – see Tables S6 and S7).
Another notable result from Hybracter hybrid is that in 10/30 samples, it assembled additional non- plasmid contigs, which
occurred in only 2/30 isolates for Unicycler. is is a limitation of Hybracter hybrid, as the extra sensitivity to recover plasmids
comes with the cost of more false positive non- plasmid contigs that may be low- depth artefacts of sequencing. Hybracter has a
depth_lter’ parameter (defaulting to 0.25× of the chromosome depth) that lters out all non- circular putative plasmid contigs
below this value.
It should be noted, however, that these contigs are not always an assembly artefact and can provide additional information
regarding the quality control and similarity of short- and long- read sets. In Plassembler implemented within Hybracter hybrid,
the existence of such contigs is oen indicative of mismatches between long- and short- read sets [28], suggesting that there may
be some heterogeneity between long- and short- reads in those samples.
Runtime performance comparison
As shown in Table4 and Fig.3, median wall- clock times with eight threads for Dragonye hybrid (4 min 34 s) were smaller than
Hybracter hybrid (15 min 03 s), which were in turn smaller than Unicycler (50 min 25 s). For the long- only tools, Dragonye
long (4 min 10 s) was faster than Hybracter long (11 min 46 s). Hybracter long was consistently slightly faster than Hybracter
hybrid (Table4).
e dierence in runtime performance between Hybracter and Dragonye is predominantly the result of the included targeted
plasmid assembly and the reorientation and assessment steps in Hybracter that are not included in Dragonye. Additionally, the
results suggest limited benets to running Hybracter with more than eight threads. As explained in the following section, if a
user has multiple isolates to assemble, a superior approach is to modify the conguration le specifying more ecient resource
requirements for each job in Hybracter.
Parallelization allows for improved eciency
Hybracter allows users to specify and customize a conguration le to maximize resource usage and runtime eciency. Users
can modify the desired threads, memory and time requirements for each type of job that is run within Hybracter to suit their
11
Bouras etal., Microbial Genomics 2024;10:001244
computational resources. So that resources are not idle for most users on single sample assemblies, large jobs such as the Flye
and Plassembler assembly steps default to 16 threads and 32 GB of memory.
To emphasize the eciency benets of parallelization, the 12 Lerminiaux et al. isolates were also assembled using ‘hybracter
hybrid’ with a customized conguration le designed to improve eciency on the machine used for benchmarking. Specically,
the conguration was changed to specify eight threads and 16 GB of memory allocated to large jobs (assembly, polishing and
assessment) and four threads and 8 GB of memory allocated to medium jobs (reorientation). More details on changing Hybracter’s
conguration le to suit specic systems can be found in the documentation (https://hybracter.readthedocs.io/en/latest/congura-
tion/). We limited the overall ‘hybracter hybrid’ run with 32 GB of memory and 16 threads to provide a fair comparison. e
overall ‘hybracter hybrid’ run was then compared to the sum of the 12 ‘hybracter hybrid- single’ runs. Overall, the 12 isolates took
01 h 48 min 57 s in the combined run, as opposed to 04 h 38 min 45 s from the sum of the 12 ‘hybracter hybrid- single’ and 07 h
04 min 04 s from the sum of the 12 Unicycler runs. is inbuilt parallelization of Hybracter provides signicant eciency benets
if multiple samples are assembled simultaneously. e performance benet of Hybracter aorded by Snakemake integration in
parallel computing systems may be variable over dierent architectures, but this provides an example case of potential eciency
and convenience benets.
Long-read depth does not aect hybrid assembly accuracy if a complete chromosome is assembled
Finally, we tested the eect of long- read depth on the accuracy of assemblies with all ve tools at an estimated long- read depth
from 10× to 100× at every interval of 5× for an example isolate (Lerminiaux Isolate B, Enterobacter cloacae) with super- accuracy
model basecalled simplex reads (Fig.4 and Table S12). At 10× and 15× sequencing depth, only Unicycler was able to assemble a
complete chromosome. From 20× and above, all ve tools were able to assemble complete chromosomes. For the hybrid tools,
once a complete chromosome was assembled, increasing long- read depth had a negligible impact on accuracy results (Fig.4).
Notably, Hybracter hybrid was able to produce perfect assemblies from as low as 20× long- read depth. For long- read only tools,
increasing long- read depth did aect accuracy. Increasing depth improved SNV accuracy for both Hybracter long and Dragonye
long (Fig.4c). For small InDels, Hybracter long improved with extra depth, while Dragonye long actually performed worse
(Fig.4a). Depth had minimal impact on large InDels (Fig.4b).
Fig. 3. Comparison of wall- clock runtime (in seconds) of Hybracter hybrid, Dragonflye hybrid, Unicycler, Hybracter long and Dragonflye long when run
with eight and 16 threads.
12
Bouras etal., Microbial Genomics 2024;10:001244
DISCUSSION
As long- read sequencing has improved in accuracy with reduced costs, it is now routine to use a combination of long- and
short- reads to generate complete bacterial genomes [3, 5]. Recent advances in assembly algorithms and accuracy improvements
mean that a long- read rst hybrid assembly should be favoured with short- reads being used aer assembly for polishing [12],
as opposed to the short- read rst assembly approach (where long- reads are only used for scaolding a short- read assembly)
utilized by the current gold standard automated assembler Unicycler. e Unicycler approach is more prone to larger scale InDel
errors as well as smaller scale errors such as those caused by homopolymers or methylation motifs [6, 11, 52, 53]. Additionally,
it should be noted that it is already possible (while perhaps not routine) to generate perfect hybrid bacterial genome assemblies
using manual consensus approaches requiring human intervention, such as Trycycler [7, 54] . While manual approaches such as
Trycycler generally yield superior results to automated approaches, manually assembling many complete genomes is challenging
as considerable time, resources and bioinformatics expertise are required.
e results of this study emphasize that the long- read rst hybrid approach consistently yields superior assemblies than the
short- read rst hybrid approach and should therefore be preferred going forward. e only exception where a short- read rst
approach is to be preferred is where a limited depth of long- read sequencing data is available (<20× depth). In this instance,
long- read rst hybrid approaches may struggle to assemble a complete chromosome, while short- read rst approaches like
Unicycler may be able to (Fig.4).
Interestingly, in the course of conducting benchmarking for this study, we found a large number of discrepancies between older
short- read rst assembled ‘reference genomes’ for Staphylococcus aureus JKD6159 [55] and the ve ATCC genomes benchmarked
Fig. 4. Comparison of the counts of small (<60 bp) (a) and large (>60 bp) (b) insertions and deletions (InDels) and SNVs (c) for Hybracter hybrid,
Dragonflye hybrid, Unicycler, Hybracter long and Dragonflye long chromosome assemblies of Lerminiaux Isolate B (Enterobacter cloacae) at 5×
intervals of sequencing depth from 10× to 100×.
13
Bouras etal., Microbial Genomics 2024;10:001244
compared to updated Trycycler long- read rst references (see Table S13). e number of discrepancies ranged from 44 to 8255
across the six genomes. erefore, we recommend that older short- read rst reference genomes be updated if possible using a
long- read assembly approach (such as with Trycycler).
is study also shows that automated perfect hybrid genome assemblies are already possible with Hybracter. is study and
others [9, 54] also conrm that a long- read rst hybrid approach remains preferable to long- read only assembly with Nanopore
reads, as short- reads continue to provide accuracy improvements in polishing steps. However, it is foreseeable that short- reads
will soon provide little or no accuracy improvements and will not be needed to polish long- read only assemblies to perfection.
Already, perfect long- read only assemblies are possible, at least with manual intervention using Trycycler [7]. Accordingly,
automated perfect bacterial genome assemblies may soon become possible from long- reads only. Hybracter also allows users to
turn long- read polishing o altogether. It is already established that long- read polishing can introduce errors and make long- read
only assemblies worse with highly accurate Nanopore and PacBio reads [11, 31]. erefore, this feature may become increasingly
useful as long- read sequencing continues to improve in accuracy and we recommend its use for highly accurate Q20+ long- reads.
Hybracter was created to bridge the gap from the present to the future of automated perfect hybrid and long- read only bacterial
genome assemblies. e results of this study show that Hybracter in hybrid mode is both faster and more accurate than the current
gold standard tool for hybrid assembly, Unicycler, and is more accurate than Dragonye in both modes. It should be noted that
if users want fast chromosome only assemblies where accuracy is not essential (for applications such as species identication or
sequence typing), Dragonye remains a good option due to its speed.
Hybracter especially excels in recovering complete plasmid genomes compared to other tools. By incorporating Plassembler,
Hybracter recovers more complete plasmid genomes than Unicycler in hybrid mode. Further, Hybracter long is comparable to
Unicycler and Hybracter hybrid when using long- reads only for plasmid recovery.
e high error rates of long- read sequencing technologies have prevented the application of assembly approaches designed for
highly accurate short- reads, such as constructing de Bruijn graphs (DBGs) based on strings of a particular length k (k- mers)
[56–58]. is resulted in bioinformaticians initially utilizing less ecient algorithms designed with long- reads in mind, such
as utilizing overlap graphs in place of DBGs [27, 37, 39, 59, 60]. While DBGs have been used for long- read assembly in some
applications [61–63], adoption, especially in microbial genomics, has been limited.
Although long- read rst assembly methods enable complete chromosome and large plasmid reconstruction, it is well established
that long- read only assemblers struggle to assemble small (<20 kbp) plasmids accurately, oen leading to missing or multiplicated
assemblies [6, 51]. ese errors may be exacerbated if ligation chemistry- based sequencing kits are used [51]. erefore, hybrid
DBG based short- read rst assemblies are traditionally recommended for plasmid recovery [12].
Implemented in our post- publication changes to Plassembler described in this study, Hybracter solves the problem of small
plasmid recovery using long- reads. It achieves this by implementing a DBG- based assembly approach with Unicycler. e same
read set is used twice, rst as unpaired pseudo ‘short’-reads and then as long- reads; the long- read set scaolds a DBG- based
assembly based on the same read set. is study demonstrates that current long- read technologies, such as R10 Nanopore reads, are
now accurate enough that some short- read algorithms are applicable. Our results also suggest that similar DBG- based algorithmic
approaches could be used to enhance the recovery of small replicons in long- read datasets beyond the use case presented here
of plasmids in bacterial isolate assemblies. is could potentially enhance the recovery of replicons such as bacteriophages [64]
or other small contigs from metagenomes using only long- reads [10, 50].
Finally, consistent and resource- ecient assemblies that are as accurate as possible in recovering both plasmids and chromosomes
are crucial, particularly for larger studies investigating plasmid epidemiology and evolution. Antimicrobial resistance genes carried
on plasmids can have complicated patterns of transmission involving horizontal transfer between dierent bacterial species and
lineages, transfer between dierent plasmid backbones, and integration into and excision from the bacterial chromosome [65–67].
Accurate plasmid assemblies are crucial in genomic epidemiology studies investigating transmission of antimicrobial- resistant
bacteria within outbreak settings, as well as in a broader One Health context, where hundreds or even thousands of assemblies
may be analysed [68–71]. Hybracter will facilitate the expansion of such studies, allowing for faster and more accurate automated
complete genome assemblies than existing tools. Additionally, by utilizing Snakemake [20] with a Snaketool [21] command line
interface, Hybracter is easily and eciently parallelized to optimize available resources over various large- scale computing archi-
tectures. Individual jobs (such as each assembly, reorientation, polishing or assessment step) within Hybracter are automatically
sent to dierent resources on an HPC cluster using the HPC’s job scheduling system like Slurm [72]. Hybracter can natively use
any Snakemake- supported cloud- based deployments such as Kubernetes, Google Cloud Life Sciences, Tibanna and Azure Batch.
CONCLUSION
Hybracter is substantially faster than the current gold standard automated tool Unicycler, assembles chromosomes more accurately
than existing methods and is superior at recovering complete plasmid genomes. By applying DBG- based algorithms designed
14
Bouras etal., Microbial Genomics 2024;10:001244
for short- reads on current generation long- reads, Hybracter long also solves the problem of long- read only assemblers entirely
missing or duplicating small circular elements such as plasmids. Hybracter is resource ecient and natively supports deployment
on HPC clusters and cloud environments for massively parallel analyses. We believe Hybracter will prove to be an extremely
useful tool for the automated recovery of complete bacterial genomes from hybrid and long- read only sequencing data suitable
for massive datasets.
Funding information
G.H. was supported by The University of Adelaide International Scholarships and a THRF Postgraduate Top- up Scholarship. A.E.S. was supported by a
University of Adelaide Barbara Kidman Women’s Fellowship. R.A.E. was supported by an award from the NIH NIDDK RC2DK116713 and an award from
the Australian Research Council DP220102915. S.V. was supported by a Passe and Williams Foundation senior fellowship.
Acknowledgements
This work was supported with supercomputing resources provided by the Phoenix HPC service at the University of Adelaide. We would particularly like
to thank Fabien Voisin for his integral role in maintaining and running Phoenix. We would also like to thank Brad Hart for useful comments in testing
Hybracter and Simone Pignotti, Yu Wan and Oliver Schwengers for providing helpful comments and GitHub pull requests.
Conflicts of interest
The authors declare that there are no conflicts of interest.
References
1. Land M, Hauser L, Jun S- R, Nookaew I, Leuze MR, et al. Insights
from 20 years of bacterial genome sequencing. Funct Integr
Genomics 2015;15:141–161.
2. Goldstein S, Beka L, Graf J, Klassen JL. Evaluation of strategies
for the assembly of diverse bacterial genomes using MinION long-
read sequencing. BMC Genomics 2019;20:23.
3. De Maio N, Shaw LP, Hubbard A, George S, Sanderson ND,
et al. Comparison of long- read sequencing technologies in the
hybrid assembly of complex bacterial genomes. Microb Genom
2019;5:e000294.
4. Wick RR, Judd LM, Gorrie CL, Holt KEY. Completing bacterial
genome assemblies with multiplex MinION sequencing. Microb
Genom 2017;3:e000132.
5. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: resolving bacte-
rial genome assemblies from short and long sequencing reads.
PLoS Comput Biol 2017;13:e1005595.
6. Wick RR, Holt KE. Benchmarking of long- read assemblers for
prokaryote whole genome sequencing. F1000Res 2019;8:2138.
7. Wick RR, Judd LM, Cerdeira LT, Hawkey J, Méric G, etal. Trycycler:
consensus long- read assemblies for bacterial genomes. Genome
Biol 2021;22:266.
8. Wick R. ONT- only accuracy with R10.4.1. Ryan Wick’s bioinfor-
matics blog; 2023. https://rrwick.github.io/2023/05/05/ont-
only-accuracy-with-r10.4.1.html https://doi.org/10.5281/zenodo.
7898220
9. Lerminiaux N, Fakharuddin K, Mulvey MR, Mataseje L. Do we still
need Illumina sequencing data? Evaluating Oxford Nanopore Tech-
nologies R10.4.1 flow cells and the Rapid v14 library prep kit for
Gram negative bacteria whole genome assemblies. Can J Microbiol
2024.
10. Sereika M, Kirkegaard RH, Karst SM, Michaelsen TY, Sørensen EA,
et al. Oxford Nanopore R10.4 long- read sequencing enables the
generation of near- finished bacterial genomes from pure cultures
and metagenomes without short- read or reference polishing. Nat
Methods 2022;19:823–826.
11. Wick R. ONT- only accuracy: 5 kHz and Dorado. Ryan Wick’s
bioinformatics blog; 2023. https://rrwick.github.io/2023/10/24/
ont-only-accuracy-update.html https://doi.org/10.5281/zenodo.
10038672
12. Wick RR, Judd LM, Holt KE. Assembling the perfect bacterial
genome using Oxford Nanopore and Illumina sequencing. PLoS
Comput Biol 2023;19:e1010905.
13. Murigneux V, Roberts LW, Forde BM, Phan M- D, Nhu NTK,
et al. MicroPIPE: validating an end- to- end workflow for high-
quality complete bacterial genome construction. BMC Genomics
2021;22:474.
14. Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T,
etal. ASA3P: an automatic and scalable pipeline for the assembly,
annotation and higher- level analysis of closely related bacterial
isolates. PLoS Comput Biol 2020;16:e1007134.
15. Petit RA, Read TD. Bactopia: a flexible pipeline for complete anal-
ysis of bacterial genomes. mSystems 2020;5:e00190- 20.
16. Petit III RA. Dragonflye: Assemble Bacterial Isolate Genomes from
Nanopore Reads.
17. Hunt M, Silva ND, Otto TD, Parkhill J, Keane JA, etal. Circlator:
automated circularization of genome assemblies using long
sequencing reads. Genome Biol 2015;16:294.
18. Wick RR, Holt KE. Polypolish: short- read polishing of long-
read bacterial genome assemblies. PLoS Comput Biol
2022;18:e1009802.
19. Johnson J, Soehnlen M, Blankenship HM. Long read genome
assemblers struggle with small plasmids. Microb Genom
2023;9:001024.
20. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins- Tinch CH, etal.
Sustainable data analysis with Snakemake. F1000Res 2021;10:33.
21. Roach MJ, Pierce- Ward NT, Suchecki R, Mallawaarachchi V,
Papudeshi B, et al. Ten simple rules and a template for creating
workflows- as- applications. PLoS Comput Biol 2022;18:e1010705.
22. Wick RR. Filtlong; 2018
23. Bonenfant Q, Noé L, Touzet H. Porechop_ABI: discovering unknown
adapters in Oxford Nanopore Technology sequencing reads for
downstream trimming. Bioinform Adv 2023;3:vbac085.
24. Roach MJ. Trimnami: Trim Lots of Metagenomics Samples All at
Once. 2023.
25. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra- fast all- in- one FASTQ
preprocessor. Bioinformatics 2018;34:i884–i890.
26. Shen W, Le S, Li Y, Hu F. SeqKit: a cross- platform and
ultrafast toolkit for FASTA/Q file manipulation. PLoS One
2016;11:e0163962.
27. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-
prone reads using repeat graphs. Nat Biotechnol 2019;37:540–546.
28. Bouras G, Sheppard AE, Mallawaarachchi V, Vreugde S. Plassem-
bler: an automated bacterial plasmid assembly tool. Bioinformatics
2023;39:btad409.
29. medaka: Sequence correction provided by ONT Research; (n.d.)
30. Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V, Roach MJ.
Dnaapler: a tool to reorient circular microbial genomes. JOSS
2024;9:5968.
31. Bouras G, Judd LM, Edwards RA, Vreugde S, Stinear TP, etal. How
low can you go? Short- read polishing of Oxford Nanopore bacterial
genome assemblies. Bioinformatics 2024.
15
Bouras etal., Microbial Genomics 2024;10:001244
32. Zimin AV, Salzberg SL. The genome polishing tool POLCA makes
fast and accurate corrections in genome assemblies. PLoS Comput
Biol 2020;16:e1007981.
33. Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly like-
lihood evaluation framework for assessing the accuracy of genome
and metagenome assemblies. Bioinformatics 2013;29:435–443.
34. Larralde M. Pyrodigal: python bindings and interface to prodigal,
an ecient method for gene prediction in prokaryotes. JOSS
2022;7:4296.
35. Hyatt D, Chen G- L, Locascio PF, Land ML, Larimer FW, etal. Prod-
igal: prokaryotic gene recognition and translation initiation site
identification. BMC Bioinformatics 2010;11:119.
36. Vaser R, Šikić M. Time- and memory- ecient genome assembly
with Raven. Nat Comput Sci 2021;1:332–336.
37. Ruan J, Li H. Fast and accurate long- read assembly with wtdbg2.
Nat Methods 2020;17:155–158.
38. Li H. Minimap and miniasm: fast mapping and de novo assembly
for noisy long sequences. Bioinformatics 2016;32:2103–2110.
39. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, etal. Canu:
scalable and accurate long- read assembly via adaptive k- mer
weighting and repeat separation. Genome Res 2017;27:722–736.
40. Zhang X, Liu C- G, Yang S- H, Wang X, Bai F- W, etal. Benchmarking
of long- read sequencing, assemblers and polishers for yeast
genome. Brief Bioinformatics 2022;23:bbac146.
41. Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo
genome assembly from long uncorrected reads. Genome Res
2017;27:737–746.
42. Chitale P, Lemenze AD, Fogarty EC, Shah A, Grady C, et al. A
comprehensive update to the Mycobacterium tuberculosis H37Rv
reference genome. Nat Commun 2022;13:7068.
43. Hall MB. Rasusa: randomly subsample sequencing reads to a
specified coverage. JOSS 2022;7:3941.
44. Steinig E, Coin L. Nanoq: ultra- fast quality control for nanopore
reads. JOSS 2022;7:2991.
45. Hall MB, Wick RR, Judd LM, Nguyen ANT, Steinig EJ, et al. Bench-
marking reveals superiority of deep learning variant callers on
bacterial nanopore sequence data. Bioinformatics 2024.
46. Wick RR, Judd LM, Monk IR, Seemann T, Stinear TP. Improved
genome sequence of Australian methicillin- resistant Staphy-
lococcus aureus strain JKD6159. Microbiol Resour Announc
2023;12:e0112922.
47. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, etal. Versa-
tile and open software for comparing large genomes. Genome Biol
2004;5:R12.
48. Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, etal. Database
resources of the national center for biotechnology information.
Nucleic Acids Res 2022;50:D20–D26.
49. Galata V, Fehlmann T, Backes C, Keller A. PLSDB: a resource of
complete bacterial plasmids. Nucleic Acids Res 2019;47:D195–D202.
50. Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, etal.
metaFlye: scalable long- read metagenome assembly using repeat
graphs. Nat Methods 2020;17:1103–1110.
51. Wick RR, Judd LM, Wyres KL, Holt KEY. Recovery of small plasmid
sequences via Oxford Nanopore sequencing. Microb Genom
2021;7:000631.
52. Marinus MG, Løbner- Olesen A. DNA Methylation. EcoSal Plus
2014;6:10.
53. Wick RR, Judd LM, Holt KE. Performance of neural network
basecalling tools for Oxford Nanopore sequencing. Genome Biol
2019;20:129.
54. Sanderson ND, Kapel N, Rodger G, Webster H, Lipworth S, et al.
Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore
flowcells and chemistries in bacterial genome reconstruction.
Microb Genom 2023;9:000910.
55. Chua K, Seemann T, Harrison PF, Davies JK, Coutts SJ, etal. Complete
genome sequence of Staphylococcus aureus strain JKD6159,
a unique Australian clone of ST93- IV community methicillin-
resistant Staphylococcus aureus. J Bacteriol 2010;192:5556–5557.
56. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, et al.
SPAdes: a new genome assembly algorithm and its applications to
single- cell sequencing. J Comput Biol 2012;19:455–477.
57. Li D, Liu C- M, Luo R, Sadakane K, Lam T- W. MEGAHIT: an ultra-
fast single- node solution for large and complex metagen-
omics assembly via succinct de Bruijn graph. Bioinformatics
2015;31:1674–1676.
58. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs
to genome assembly. Nat Biotechnol 2011;29:987–991.
59. Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, et al. Linear time
complexity de novo long read genome assembly with GoldRush.
Nat Commun 2023;14:2906.
60. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, etal. Oppor-
tunities and challenges in long- read sequencing data analysis.
Genome Biol 2020;21:30.
61. Ekim B, Berger B, Chikhi R. Minimizer- space de Bruijn graphs:
whole- genome assembly of long reads in minutes on a personal
computer. Cell Syst 2021;12:958–968.
62. Bankevich A, Bzikadze AV, Kolmogorov M, Antipov D, Pevzner PA.
Multiplex de Bruijn graphs enable genome assembly from long,
high- fidelity reads. Nat Biotechnol 2022;40:1075–1081.
63. Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, etal. Assembly
of long error- prone reads using de Bruijn graphs. Proc Natl Acad
Sci U S A 2016;113:E8396–E8405.
64. Mallawaarachchi V, Roach MJ, Decewicz P, Papudeshi B, Giles SK,
etal. Phables: from fragmented assemblies to high- quality bacte-
riophage genomes. Bioinformatics 2023;39:btad586.
65. Mathers AJ, Stoesser N, Chai W, Carroll J, Barry K, et al. Chro-
mosomal integration of the Klebsiella pneumoniae carbapenemase
gene, blaKPC, in Klebsiella species is elusive but not rare. Antimicrob
Agents Chemother 2017;61:e01823- 16.
66. Houtak G, Bouras G, Nepal R, Shaghayegh G, Cooksley C, etal. The
intra- host evolutionary landscape and pathoadaptation of persis-
tent Staphylococcus aureus in chronic rhinosinusitis. Microb Genom
2023;9:001128.
67. Sheppard AE, Stoesser N, Wilson DJ, Sebra R, Kasarskis A, et al.
Nested Russian doll- like genetic mobility drives rapid dissemina-
tion of the carbapenem resistance gene blaKPC. Antimicrob Agents
Chemother 2016;60:3767–3778.
68. Hawkey J, Wyres KL, Judd LM, Harshegyi T, Blakeway L, et al.
ESBL plasmids in Klebsiella pneumoniae: diversity, transmission
and contribution to infection burden in the hospital setting. Genome
Med 2022;14:97.
69. Matlock W, Lipworth S, Chau KK, AbuOun M, Barker L, etal. Entero-
bacterales plasmid sharing amongst human bloodstream infec-
tions, livestock, wastewater, and waterway niches in Oxfordshire,
UK. Elife 2023;12:e85302.
70. Roberts LW, Enoch DA, Khokhar F, Blackwell GA, Wilson H, etal.
Long- read sequencing reveals genomic diversity and associated
plasmid movement of carbapenemase- producing bacteria in a UK
hospital over 6 years. Microb Genom 2023;9:001048.
71. Lerminiaux N. Plasmid genomic epidemiology of blaKPC
carbapenemase- producing Enterobacterales in Canada, 2010–
2021. Antimicrob Agents Chemother2023:e00860- 23.
72. Yoo AB, Jette MA, Grondona M. (eds). SLURM: Simple Linux Utility
for Resource Management. in Job Scheduling Strategies for Parallel
Processing. Berlin, Heidelberg: Springer, 2003. pp. 44–60.
Preprint
Full-text available
While conducting genomic surveillance of carbapenemase-producing Enterobacteraecae (CPEs) from patient colonisation and clinical infections at Birmingham's Queen Elizabeth Hospital (QE), we identified an N-type plasmid lineage, pQEB1, carrying several antibiotic resistance genes including the carbapenemase gene blaKPC-2. The pQEB1 lineage is concerning due to its conferral of multi-drug resistance, its host range and apparent transmissibility, and its potential for acquiring further resistance genes. Representatives of pQEB1 were found in three sequence types (STs) of Citrobacter freundii, two STs of Enterobacter cloacae and three species of Klebsiella. Hosts of pQEB1 were isolated from 11 different patients who stayed in various wards throughout the hospital complex over a 13-month period from January 2023 to February 2024. At present, the only representatives of the pQEB1 lineage in GenBank were carried by an Enterobacter hormaechei isolated from a blood sample at the QE in 2016 and a Klebsiella pneumoniae isolated from a urine sample at University Hospitals Coventry Warwickshire (UHCW) in May 2023. The UHCW patient had been treated at the QE. Long-read whole-genome sequencing was performed on Oxford Nanopore R10.4.1 flow cells, facilitating comparison of complete plasmid sequences. We identified structural variants of pQEB1 and defined the molecular events responsible for them. These have included IS26-mediated inversions and acquisitions of multiple insertion sequences and transposons, including carriers of mercury and arsenic resistance genes. We found that a particular inversion variant of pQEB1 was strongly associated with the QE Liver specialty after appearing in November 2023, but was found in different specialties and wards in January/February 2024. That variant has so far been seen in five different bacterial hosts from six patients, consistent with recent and ongoing inter-host and inter-patient transmission of pQEB1 in this hospital setting.
Article
We report four Chitinophaga sp. strains isolated from wastewater collected onboard the International Space Station. Here, we present three finished and one draft genome. Taxonomic ranks established by genome-based analysis indicate that these Chitinophaga sp. strains represent candidates for a new species.
Preprint
Full-text available
Variant calling is fundamental in bacterial genomics, underpinning the identification of disease transmission clusters, the construction of phylogenetic trees, and antimicrobial resistance prediction. This study presents a comprehensive benchmarking of SNP and indel variant calling accuracy across 14 diverse bacterial species using Oxford Nanopore Technologies (ONT) and Illumina sequencing. We generate gold standard reference genomes and project variations from closely-related strains onto them, creating biologically realistic distributions of SNPs and indels. Our results demonstrate that ONT variant calls from deep learning-based tools delivered higher SNP and indel accuracy than traditional methods and Illumina, with Clair3 providing the most accurate results overall. We investigate the causes of missed and false calls, highlighting the limitations inherent in short reads and discover that ONT’s traditional limitations with homopolymer-induced indel errors are absent with high-accuracy basecalling models and deep learning-based variant calls. Furthermore, our findings on the impact of read depth on variant calling offer valuable insights for sequencing projects with limited resources, showing that 10x depth is sufficient to achieve variant calls that match or exceed Illumina. In conclusion, our research highlights the superior accuracy of deep learning tools in SNP and indel detection with ONT sequencing, challenging the primacy of short-read sequencing. The reduction of systematic errors and the ability to attain high accuracy at lower read depths enhance the viability of ONT for widespread use in clinical and public health bacterial genomics.
Preprint
Full-text available
It is now possible to assemble near-perfect bacterial genomes using Oxford Nanopore Technologies (ONT) long reads, but short-read polishing is still required for perfection. However, the effect of short-read depth on polishing performance is not well understood. Here, we introduce Pypolca (with default and careful parameters) and Polypolish v0.6.0 (with a new careful parameter). We then show that: (1) all polishers other than Pypolca-careful, Polypolish-default and Polypolish-careful commonly introduce false-positive errors at low depth; (2) most of the benefit of short-read polishing occurs by 25× depth; (3) Polypolish-careful never introduces false-positive errors at any depth; and (4) Pypolca-careful is the single most effective polisher. Overall, we recommend the following polishing strategies: Polypolish-careful alone when depth is very low (<5×), Polypolish-careful and Pypolca-careful when depth is low (5–25×), and Polypolish-default and Pypolca-careful when depth is sufficient (>25×). Data Summary Pypolca is open-source and freely available on Bioconda, PyPI, and GitHub ( github.com/gbouras13/pypolca ). Polypolish is open-source and freely available on Bioconda and GitHub ( github.com/rrwick/Polypolish ). All code and data required to reproduce analyses and figures are available at github.com/gbouras13/depth_vs_polishing_analysis . All FASTQ sequencing reads are available at BioProject PRJNA1042815 . A detailed list of accessions can be found in Table S1.
Article
Full-text available
The best whole genome assemblies are currently built from a combination of highly accurate short-read sequencing data and long-read sequencing data that can bridge repetitive and problematic regions. Oxford Nanopore Technologies (ONT) produce long-read sequencing platforms and they are continually improving their technology to obtain higher quality read data that is approaching the quality obtained from short-read platforms such as Illumina. As these innovations continue, we evaluated how much ONT read coverage produced by the Rapid Barcoding Kit v14 (SQK-RBK114) is necessary to generate high-quality hybrid and long-read-only genome assemblies for a panel of carbapenemase-producing Enterobacterales bacterial isolates. We found that 30× long-read coverage is sufficient if Illumina data are available, and that more (at least 100× long-read coverage is recommended for long-read-only assemblies. Illumina polishing is still improving single nucleotide variants (SNVs) and INDELs in long-read-only assemblies. We also examined if antimicrobial resistance genes could be accurately identified in long-read-only data, and found that Flye assemblies regardless of ONT coverage detected >96% of resistance genes at 100% identity and length. Overall, the Rapid Barcoding Kit v14 and long-read-only assemblies can be an optimal sequencing strategy (i.e., plasmid characterization and AMR detection) but finer-scale analyses (i.e., SNV) still benefit from short-read data.
Article
Full-text available
Chronic rhinosinusitis (CRS) is a common chronic sinonasal mucosal inflammation associated with Staphylococcus aureus biofilm and relapsing infections. This study aimed to determine rates of S. aureus persistence and pathoadaptation in CRS patients by investigating the genomic relatedness and antibiotic resistance/tolerance in longitudinally collected S. aureus clinical isolates. A total of 68 S . aureus paired isolates (34 pairs) were sourced from 34 CRS patients at least 6 months apart. Isolates were grown into 48 h biofilms and tested for tolerance to antibiotics. A hybrid sequencing strategy was used to obtain high-quality reference-grade assemblies of all isolates. Single nucleotide variants (SNV) divergence in the core genome and sequence type clustering were used to analyse the relatedness of the isolate pairs. Single nucleotide and structural genome variations, plasmid similarity, and plasmid copy numbers between pairs were examined. Our analysis revealed that 41 % (14/34 pairs) of S. aureus isolates were persistent, while 59 % (20/34 pairs) were non-persistent. Persistent isolates showed episode-specific mutational changes over time with a bias towards events in genes involved in adhesion to the host and mobile genetic elements such as plasmids, prophages, and insertion sequences. Furthermore, a significant increase in the copy number of conserved plasmids of persistent strains was observed. This was accompanied by a significant increase in biofilm tolerance against all tested antibiotics, which was linked to a significant increase in biofilm biomass over time, indicating a potential biofilm pathoadaptive process in persistent isolates. In conclusion, our study provides important insights into the mutational changes during S. aureus persistence in CRS patients highlighting potential pathoadaptive mechanisms in S. aureus persistent isolates culminating in increased biofilm biomass.
Article
Full-text available
Carbapenems are considered last-resort antibiotics for the treatment of infections caused by multidrug-resistant Enterobacterales , but carbapenem resistance due to acquisition of carbapenemase genes is a growing threat that has been reported worldwide. Klebsiella pneumoniae carbapenemase ( bla KPC ) is the most common type of carbapenemase in Canada and elsewhere; it can hydrolyze penicillins, cephalosporins, aztreonam, and carbapenems and is frequently found on mobile plasmids in the Tn 4401 transposon. This means that alongside clonal expansion, bla KPC can disseminate through plasmid- and transposon-mediated horizontal gene transfer. We applied whole genome sequencing to characterize the molecular epidemiology of 829 bla KPC carbapenemase-producing isolates collected by the Canadian Nosocomial Infection Surveillance Program from 2010 to 2021. Using a combination of short-read and long-read sequencing, we obtained 202 complete and circular bla KPC -encoding plasmids. Using MOB-suite, 10 major plasmid clusters were identified from this data set which represented 87% (175/202) of the Canadian bla KPC -encoding plasmids. We further estimated the genomic location of incomplete bla KPC -encoding contigs and predicted a plasmid cluster for 95% (603/635) of these. We identified different patterns of carbapenemase mobilization across Canada related to different plasmid clusters, including clonal transmission of IncF-type plasmids (108/829, 13%) in K. pneumoniae clonal complex 258 and novel repE(pEh60-7) plasmids (44/829, 5%) in Enterobacter hormaechei ST316, and horizontal transmission of IncL/M (142/829, 17%) and IncN-type plasmids (149/829, 18%) across multiple genera. Our findings highlight the diversity of bla KPC genomic loci and indicate that multiple, distinct plasmid clusters have contributed to bla KPC spread and persistence in Canada.
Article
Full-text available
Motivation Microbial communities have a profound impact on both human health and various environments. Viruses infecting bacteria, known as bacteriophages or phages, play a key role in modulating bacterial communities within environments. High-quality phage genome sequences are essential for advancing our understanding of phage biology, enabling comparative genomics studies and developing phage-based diagnostic tools. Most available viral identification tools consider individual sequences to determine whether they are of viral origin. As a result of challenges in viral assembly, fragmentation of genomes can occur, and existing tools may recover incomplete genome fragments. Therefore, the identification and characterisation of novel phage genomes remains a challenge, leading to the need of improved approaches for phage genome recovery. Results We introduce Phables, a new computational method to resolve phage genomes from fragmented viral metagenome assemblies. Phables identifies phage-like components in the assembly graph, models each component as a flow network, and uses graph algorithms and flow decomposition techniques to identify genomic paths. Experimental results of viral metagenomic samples obtained from different environments show that Phables recovers on average over 49% more high-quality phage genomes compared to existing viral identification tools. Furthermore, Phables can resolve variant phage genomes with over 99% average nucleotide identity, a distinction that existing tools are unable to make. Availability and Implementation Phables is available on GitHub at https://github.com/Vini2/phables.
Article
Full-text available
Healthcare-associated infections (HCAIs) affect the most vulnerable people in society and are increasingly difficult to treat in the face of mounting antimicrobial resistance (AMR). Routine surveillance represents an effective way of understanding the circulation and burden of bacterial resistance and transmission in hospital settings. Here, we used whole-genome sequencing (WGS) to retrospectively analyse carbapenemase-producing Gram-negative bacteria from a single hospital in the UK over 6 years (n=165). We found that the vast majority of isolates were either hospital-onset (HAI) or HCAI. Most carbapenemase-producing organisms were carriage isolates, with 71 % isolated from screening (rectal) swabs. Using WGS, we identified 15 species, the most common being Escherichia coli and Klebsiella pneumoniae. Only one significant clonal outbreak occurred during the study period and involved a sequence type (ST)78 K. pneumoniae carrying bla NDM-1 on an IncFIB/IncHI1B plasmid. Contextualization with public data revealed little evidence of this ST outside of the study hospital, warranting ongoing surveillance. Carbapenemase genes were found on plasmids in 86 % of isolates, the most common types being bla NDM- and bla OXA-type alleles. Using long-read sequencing, we determined that approximately 30 % of isolates with carbapenemase genes on plasmids had acquired them via horizontal transmission. Overall, a national framework to collate more contextual genomic data, particularly for plasmids and resistant bacteria in the community, is needed to better understand how carbapenemase genes are transmitted in the UK.
Article
Full-text available
With recent advances in sequencing technologies, it is now possible to obtain near-perfect complete bacterial chromosome assemblies cheaply and efficiently by combining a long-read-first assembly approach with short-read polishing. However, existing methods for assembling bacterial plasmids from long-read-first assemblies often misassemble or even miss bacterial plasmids entirely and accordingly require manual curation. Plassembler was developed to provide a tool that automatically assembles and outputs bacterial plasmids using a hybrid assembly approach. It achieves increased accuracy and computational efficiency compared to the existing gold standard tool Unicycler by removing chromosomal reads from the input read sets using a mapping approach. Availability: Plassembler is implemented in Python and is installable as a bioconda package using 'conda install -c bioconda plassembler'. The source code is available on GitHub at https://github.com/gbouras13/plassembler. The full benchmarking pipeline can be found at https://github.com/gbouras13/plassembler_simulation_benchmarking, while the benchmarking input FASTQ and output files can be found at https://doi.org/10.5281/zenodo.7996690. Supplementary information: Supplementary data are available at Bioinformatics online.