ChapterPDF Available

SomaticSeq: An Ensemble and Machine Learning Method to Detect Somatic Mutations

Authors:
  • Bina Technologies, part of Roche Sequencing

Abstract and Figures

A standard strategy to discover somatic mutations in a cancer genome is to use next-generation sequencing (NGS) technologies to sequence the tumor tissue and its matched normal (commonly blood or adjacent normal tissue) for side-by-side comparison. However, when interrogating entire genomes (or even just the coding regions), the number of sequencing errors easily outnumbers the number of real somatic mutations by orders of magnitudes. Here, we describe SomaticSeq, which incorporates multiple somatic mutation detection algorithms and then uses machine learning to vastly improve the accuracy of the somatic mutation call sets.
Content may be subject to copyright.
Chapter 4
SomaticSeq: An Ensemble and Machine Learning Method
to Detect Somatic Mutations
Li Tai Fang
Abstract
A standard strategy to discover somatic mutations in a cancer genome is to use next-generation sequencing
(NGS) technologies to sequence the tumor tissue and its matched normal (commonly blood or adjacent
normal tissue) for side-by-side comparison. However, when interrogating entire genomes (or even just the
coding regions), the number of sequencing errors easily outnumbers the number of real somatic mutations
by orders of magnitudes. Here, we describe SomaticSeq, which incorporates multiple somatic mutation
detection algorithms and then uses machine learning to vastly improve the accuracy of the somatic mutation
call sets.
Key words Somatic mutations, Sequencing, Bioinformatics, Machine learning, Ensemble method
1 Introduction
To discover somatic mutations in a cancer genome, the tumor and
its matched normal tissues (commonly blood or adjacent normal
tissue) are typically sequenced side-by-side, with the normal acting
as a control to filter out germline variants. However, the whole
human genome consists of over 3 billion base pairs, and the coding
regions alone make up over 30 million base pairs. There is a
plethora of modern algorithms developed by different research
groups to detect somatic mutations in such data sets [111]. How-
ever, they usually produce more false positive calls than actual
somatic mutations.
SomaticSeq achieves higher accuracy by [12]:
1. Combine the call sets from these somatic mutation callers that
were incorporated into the SomaticSeq workflow.
2. For each somatic variant call (see Note 1 for SomaticSeq’s
definition of a unique variant call), extract genomic and
sequencing features from the tumor and normal BAM files.
Sebastian Boegel (ed.), Bioinformatics for Cancer Immunotherapy: Methods and Protocols, Methods in Molecular Biology,
vol. 2120, https://doi.org/10.1007/978-1-0716-0327-7_4,©Springer Science+Business Media, LLC, part of Springer Nature 2020
47
3. Deploy an adaptive boosting [13] machine learning classifier to
separate the false positives from the true mutations. The classi-
fiers can be trained from semisimulated data sets, which we will
describe in Subheading 3.4.
2 Materials
SomaticSeq is freely available under BSD 2-Clause open source
license. The source code is located at https://github.com/bio
inform/somaticseq. SomaticSeq Docker images can be found at
https://hub.docker.com/r/lethalfang/somaticseq.
2.1 Software To run SomaticSeq, the following tools and packages must be
installed in a Linux or Unix environment:
1. SomaticSeq was developed in Python 3 under Linux environ-
ment. We recommend using Python 3.5 or newer. In addition,
Python libraries of NumPy (v1.13 or newer), SciPy (v1.0 or
newer), and pysam (v0.13 or newer) are also required. Cur-
rently, SomaticSeq’s latest stable version is v3.3.0, and the
protocols in this book represent the version 3 branch of Soma-
ticSeq. There have been substantial improvements in features
and stability since SomaticSeq was first published in 2015.
2. SomaticSeq also uses BEDTools [14] (v2 or newer) to manipu-
late bed file inputs, that is, regions to include and/or exclude in
the workflow.
3. R(v3.2 or newer) and ada (v2.0.5 or newer) library were
implemented as the machine learning algorithm.
4. At its core, SomaticSeq combines and then filters the results of
multiple somatic mutation detection algorithms based on many
sequencing features. Generally speaking, at least one compati-
ble caller needs to be run to generate a list of mutation candi-
dates for SomaticSeq to evaluate. It is compatible with the
following callers: the original MuTect/Indelocator as well as
GATK4’s Mutect2 [1], VarScan2 [2], JointSNVMix2 [3],
SomaticSniper [4], VarDict [5], MuSE [6], LoFreq [7], Scalpel
[8], Strelka2 [9], TNscope [10], and Platypus [11].
5. Docker [http://www.docker.com] is a container technology
that can be used to package software and their dependencies
in a portable Docker images, which can be used to execute a
workflow reproducibly across different platforms and environ-
ments. SomaticSeq does not require Docker per se, but we have
created Docker images of it, along with a number of compati-
ble somatic mutation callers to make life easier for new users.
The advantage of using container technologies like Docker is
that one does not necessarily have to create the right software
48 Li Tai Fang
environment with the correct dependencies for every software
in a workflow, for example, by creating a Docker image for
MuTect2, the users can simply use that Docker image for
MuTect2 tasks. Otherwise, they must make sure to have the
correct Java version and other dependencies to run MuTect2,
and that those dependencies do not conflict with other soft-
ware that may need different Java versions.
2.2 Download and
Install SomaticSeq
Source code of SomaticSeq is available via Github repository:
https://github.com/bioinform/somaticseq. The latest source
code can be cloned with the following git command:
git clone https://github.com/bioinform/somaticseq.git
Alternatively, a fixed version may be downloaded and
unpacked, for example,
wget https://github.com/bioinform/somaticseq/archive/v3.3.0.
tar.gz tar -xvf v3.3.0.tar.gz
Installation is not required to run SomaticSeq. You may specify
the full path of all the SomaticSeq scripts. Nevertheless, the sim-
plest way to place SomaticSeq executables in your $PATH is to
install it:
cd somaticseq
./setup.py install
To have fully functional SomaticSeq, you must also install
BEDTools and R (and ada library) as specified in Subheading 2.1
and place them in your execution $PATH.
3 Methods
3.1 Running
SomaticSeq in Tumor-
Normal Paired Mode
To see global parameters for either tumor-normal paired or tumor-
only single mode,
somaticseq_parallel.py --help
To see input parameters specific for tumor-normal paired
mode,
somaticseq_parallel.py paired –help
The following is an example command to run SomaticSeq’s
core algorithm after completing some (or all) of the compatible
somatic mutation caller(s). This command will invoke the default
consensus mode (Fig. 1a). Keep in mind that not all the input VCF
Ensemble and Machine Learning Method to Detect Somatic Mutations 49
Matched Normal
Tu m o r
JointSNVMix2
SomaticSniper
VarD ic t
Scalpel Platypus
TNscope
J
o
i
n
t
SN
V
S
o
a
t
i
c
S
V
a
V
V
r
D
Feature extraction:
Caller classification
Mapping metrics
Alignment metrics
dbSNP membership
Strand bias
Combine
somatic
mutation
call sets
Majority-vote
Consensus
MuTect2
VarScan2 Strelka2
LoFreq
MuSE
Mutation Calls:
chr1 394852 PASS 11
chr1 405967 PASS 8
chr1 596033 REJECT 2
chr1 896032 REJECT 2
chrX 304842 LowQ 4
chrX 210394 REJECT 2
Normal Replicate #1
Normal Replicate #2
JointSNVMix2
SomaticSniper
VarD ic t
Scalpel Platypus
TNscope
J
o
i
n
tS
N
V
S
o
m
a
t
i
c
S
V
a
VV
r
D
mutations
spiked in
Feature extraction:
Caller classification
Mapping metrics
Alignment metrics
dbSNP membership
Strand bias
Combine
somatic
mutation
call sets
AdaBoost machine
learning classifier
ground truth
MuTect2
VarScan2 Strelka2
LoFreq
MuSE
Matched Normal
Tumor
JointSNVMix2
SomaticSniper
VarD ict
Scalpel Platypus
TNscope
J
o
i
n
tS
N
V
S
o
m
a
t
i
c
S
V
a
V
V
r
D
Feature extraction:
Caller classification
Mapping metrics
Alignment metrics
dbSNP membership
Strand bias
Combine
somatic
mutation
call sets
MuTect2
VarScan2 Strelka2
LoFreq
MuSE
Mutation Calls:
chr1 394852 PASS 0.99
chr1 405967 PASS 0.98
chr1 596033 REJECT 0.01
chr1 896032 REJECT 0.02
chrX 304842 LowQ 0.40
chrX 210394 REJECT 0.01
Trained
SomaticSeq
Classifier
a
b
c
Fig. 1 The three modes for SomaticSeq
50 Li Tai Fang
files from all the somatic callers are required. What somatic muta-
tion callers you want to run is your choice based on how well they
work on your data sets and the cost and availability of compute
resources:
somaticseq_parallel.py \
--output-directory OUTPUT_DIR \
--genome-reference GRCh38.fa \
--inclusion-region genome.bed \
--exclusion-region blacklist.bed \
--threads 12 \
paired \
--tumor-bam-file tumor.bam \
--normal-bam-file matched_normal.bam \
--mutect2-vcf MuTect2.vcf \
--varscan-snv VarScan2.snp.vcf \
--varscan-indel VarScan2.indel.vcf \
--jsm-vcf JointSNVMix2.vcf \
--somaticsniper-vcf SomaticSniper.vcf \
--vardict-vcf VarDict.vcf \
--muse-vcf MuSE.vcf \
--lofreq-snv LoFreq.somatic_final.snvs.vcf.gz \
--lofreq-indel LoFreq.somatic_final.indels.vcf.gz \
--scalpel-vcf Scalpel.vcf \
--strelka-snv Strelka/results/variants/somatic.snvs.vcf.gz \
--strelka-indelStrelka/results/variants/somatic.indels.vcf.gz
\
--tnscope-vcf TNscope.filtered.vcf.gz \
--platypus-vcf Platypus.vcf
If you have SomaticSeq classifiers that you want to use to
evaluate/score/classify the mutation candidates (Fig. 1b), point
to them before the paired option, that is,
--classifier-snvEnsemble.sSNV.tsv.ntChange.Classifier.RData \
--classifier-indelEnsemble.sINDEL.tsv.ntChange.Classifier.
RData \
If this is a training data set for which you want to create
SomaticSeq classifier (Fig. 1c), make sure to have your VCF files
containing only the true variants, one for SNVs and one for indels,
and place them before the paired option along with the --somaticseq-
train flag. Every variant call in the inclusion region but not in the
truth set will be considered a false positive, that is,
--truth-snv TruePositive.snv.vcf \
--truth-indel TruePositive.indel.vcf \
--somaticseq-train \
Ensemble and Machine Learning Method to Detect Somatic Mutations 51
3.1.1 Inputs and
Parameters
The paired argument puts SomaticSeq into tumor-normal paired
mode. The caller outputs (VCF files) and the tumor-normal BAM
files are placed after the paired argument. Everything else that is
agnostic of paired or single sample mode goes before the paired
argument, for example, genome reference, ground truth VCF files,
inclusion/exclusion regions, and resource files such as dbSNP,
COSMIC, and so on.
Arguments placed before the paired argument
l--output-directory: the path to output directory. Default is the
current directory.
l--genome-reference: genome reference file in fasta format (typi-
cally .fa or .fasta extension). Always required, and also required
are the existence of the index file (.fa.fai) and the dict file (.dict).
l--truth-snv: a VCF file containing true positive SNVs. When
included, every SNV call in this VCF file will be labeled a true
positive, and everything else a false positive. This is required in
training mode. (If an inclusion region file is specified, then only
calls within the inclusion regions may be considered. If there is
an exclusion region file, then calls inside the exclusion regions
will be ignored.)
l--truth-indel: same as above, but for indels.
l--somaticseq-train: flag to invoke training mode. When invoked,
will create SomaticSeq classifiers if truth-snv and/or truth-indel
files are specified.
l--classifier-snv: the trained SomaticSeq SNV classifier (.RData).
When this file is specified, it will automatically invoke the
prediction mode.
l--classifier-indel: same as above, but for indels.
l--pass-threshold: in prediction mode, this is the threshold
(between 0 and 1) above which a call will be labeled PASS.
Default is 0.5.
l--lowqual-threshold: in prediction mode, this is the threshold
above which (but below the PASS threshold) a call will be
labeled LowQual. Default is 0.1.
l--homozygous-threshold: a variant allele frequency (VAF) thresh-
old, above which the GT field will be labeled 1/1.
Default ¼0.85.
l--heterozygous-threshold: a VAF above which (but below the
homozygous threshold) the GT will be labeled 0/1.
Default ¼0.01.
l--minimum-mapping-quality: the minimum mapping quality
score to count the reads. Default ¼1.
l--minimum-base-quality: the minimum base-call quality score to
count the base. Default ¼5.
52 Li Tai Fang
l--minimum-num-callers: the minimum number of caller(s) to
call a variant candidate to be included in the combined call set.
Default is 0.5, which means it will include some calls that are
only called LowQual by a caller.
l--dbsnp-vcf: dbSNP VCF file. If included, can be used as a feature
in machine learning.
l--cosmic-vcf: COSMIC VCF file. If included, it is only used for
annotation of the variants. COSMIC is not a SomaticSeq feature.
l--inclusion-region: if included, then only variant calls in these
regions will be considered.
l--exclusion-region: if included, then variants in these regions will
be excluded. If a call is in both inclusion and exclusion regions, it
will be excluded.
l--threads: number of threads. It will split the job into equal-sized
regions (based on inclusion BED file or the .fa.fai file) to execute
each thread. Default ¼1.
l--keep-intermediates: a flag to tell SomaticSeq not to delete any
intermediate files, for debugging purposes.
Parallel processing is achieved by splitting the inclusion BED
file into a number of temporary BED files of equal sizes (i.e., same
number of base pairs per BED file), named 1.th.input.bed, 2.th.
input.bed, ..., n.th.input.bed. Then, each process will be run using
each temporary BED file as the inclusion BED file. If there is no
inclusion BED file in the command argument, it will split reference
genome’s index file (e.g., GRCh38.fa.fai) instead.
Arguments placed after the paired argument
l--tumor-bam: sorted and index tumor bam file. Required.
l--normal-bam: sorted and index normal bam file. Required.
l--tumor-sample: tumor sample name to place in the header of the
output VCF file. Default is TUMOR.
l--normal-sample: normal sample name to place in the header of
the output VCF file. Default is NORMAL.
l--mutect-vcf: VCF file output from the original MuTect v1.
(Optional)
l--indelocator-vcf: VCF file output from Indelocator. (Optional)
l--mutect2-vcf: VCF file output from GATK4’s Mutect2 and then
FilterMutect-Calls. (Optional)
l--varscan-snv: the snp VCF file output VarScan2. (Optional)
l--varscan-indel: the indel VCF file output VarScan2. (Optional)
l--jsm-vcf: JointSNVMix2’s output modified into a VCF file by
SomaticSeq. (See Subheading 3.3; Optional)
Ensemble and Machine Learning Method to Detect Somatic Mutations 53
l--somaticsniper-vcf: VCF file output from SomaticSniper.
(Optional)
l--vardict-vcf: VCF file output from VarDict. (Optional)
l--muse-vcf: VCF file output from MuSE. (Optional)
l--lofreq-snv: somatic snv VCF file output from LoFreq.
(Optional)
l--lofreq-indel: somatic indel VCF file output from LoFreq.
(Optional)
l--scalpel-vcf: VCF file output from Scalpel. (Optional)
l--strelka-snv: somatic snv VCF file output from Strelka2.
(Optional)
l--strelka-indel: somatic indel VCF file output from Strelka2.
(Optional)
l--tnscope-vcf: VCF file output from Sentieon TNscope caller.
(Optional)
l--platypus-vcf: VCF file output from Platypus. (Optional).
SomaticSeq supports any combination of the somatic mutation
callers we have incorporated into the workflow. SomaticSeq will run
based on the output VCFs you have provided. It will train to create
SNV and/or indel classifiers if you provide the true-Positives.snv.
vcf and/or truePositives.indel.vcf file(s) and invoke the --somatic-
seq-train option. Otherwise, it will fall back to the simple caller
consensus mode.
3.2 Running
SomaticSeq in Tumor-
Only Single Mode
SomaticSeq also supports tumor-only mode, in which case the
paired option is replaced with single option. It has the same set of
global input parameters as the paired mode, but a different set of
arguments/options to be placed after the single option. To see what
they are, you may run the following:
somaticseq_parallel.py single –help
3.2.1 Inputs and
Parameters for Tumor-Only
Mode
Arguments placed before the single option is same as the ones
described in Subheading 3.1.1. The following options are placed
after the single option:
l--bam-file: the BAM file for the tumor sample. Required.
l--sample-name: sample name to place into the VCF file. Default
is TUMOR. (Optional)
l--mutect-vcf: VCF file output from the original MuTect v1.
(Optional)
l--mutect2-vcf: VCF file output from GATK4’s Mutect2 and then
FilterMutect-Calls. (Optional)
54 Li Tai Fang
l--varscan-vcf: VCF file with both SNVs and indels from VarS-
can2. (Optional)
l--vardict-vcf: VCF file from VarDict. (Optional)
l--lofreq-vcf: VCF file with both SNVs and indels from LoFreq.
(Optional)
l--scalpel-vcf: VCF file from Scalpel. (Optional)
l--strelka-vcf: VCF file with both SNVs and indels from Strelka.
(Optional).
3.2.2 Interpreting the
Output Files
SomaticSeq will output a number of TSV and VCF files as results.
The SNVs and indels are in separate files. The TSV files contain all
the genomic and sequencing features extracted from the BAM files
and/or some field of the individual callers’ output. If the truth is
VCF files are supplied, then each variant candidate will also be
labeled true positive or false positive. The header of the TSV file
describes each column. Missing values, for example, tBAM NM
Diff (difference in average edit distances between variant-
supporting and reference-supporting reads) when there is no
reference-supporting read will be “nan.”
The VCF files and the TSV files contain the same variants,
though the VCF files only record a subset of the SomaticSeq
features recorded in the TSV files. The descriptions are in the
VCF headers, but we will describe them here. The QUAL column
is a standard column in VCF format. This column contains the
Phred-scaled SomaticSeq score (i.e., probability) if the VCF file
was generated during Prediction mode. By default, variants with
probability 0.5 (Phred-scaled QUAL 3.01) are labeled PASS.
Variants with 0.1 probability 0.5 (0.458 QUAL 3.01) are
labeled LowQual, and those under that threshold are labeled
REJECT. In Consensus code, however, the QUAL column will
always be zero, and the PASS/LowQual/REJECT labels are deter-
mined by the consensus of the callers you have incorporated into
each workflow. If the majority of the callers (i.e., >50%) considers a
variant high-confidence somatic mutation, the variant will be
labeled PASS. If at least 1/3 of the callers (i.e., 33%), the variant
will be labeled LowQual. Otherwise, the variant will be labeled
REJECT. As an example, if there are five SNV callers used, then
you need at least three callers to be labeled PASS and two callers to
be labeled LowQual. If it is called by only one caller, it will be
labeled REJECT.
In the INFO column of the VCF file, there is a string (e.g.,
MDUK ¼1,1,0,1) that tells you each caller’s binary classification
on the variant (1 for positive classification and 0 otherwise). The
string varies depending on the callers incorporated in the workflow,
that is, M ¼MuTect/Indelocator/MuTect2, V ¼VarScan2,
J¼JointSNVMix2, S ¼SomaticSniper, D ¼VarDict, U ¼MuSE,
Ensemble and Machine Learning Method to Detect Somatic Mutations 55
L¼LoFreq, P ¼Scalpel, K ¼Strelka, T ¼TNscope, and Y ¼Platy-
pus. NUM TOOLS tells you the number of callers where the
variant is classified as a somatic mutation.
The following metrics are in the sample columns, for the tumor
and normal samples separately:
lGT: genotyping. By default, it will be 1/1 if VAF 85%, 0/1 if
between 1% and 85%, and 0/0 if VAF is under 1%.
lDP4: four numbers representing the number of forward
reference-supporting reads, reverse reference-supporting reads,
forward variant-supporting reads, and reverse variant-
supporting reads.
lCD4: four numbers representing the number of concordant
reference-supporting reads, discordant reference-supporting
reads, concordant variant-supporting reads, and discordant
variant-supporting reads.
lrefMQ: average mapping quality score for reference-supporting
reads.
laltMQ: average mapping quality score for variant-supporting
reads.
lrefBQ: average base-call quality score for reference-supporting
bases.
laltBQ: average base-call quality score for variant-supporting
bases.
lrefNM: average edit-distance between reference-supporting
reads and genome reference.
laltNM: average edit-distance between variant-supporting reads
and genome reference.
lfetSB: Phred-scaled Fisher’s Exact Test score for DP4 to mea-
sure strand bias in variant-supporting versus reference-
supporting reads.
lfetCD: Phred-scaled Fisher’s Exact Test score for CD4 to mea-
sure bias in concordant versus discordant reads in variant-
supporting versus reference-supporting reads.
lzMQ: z-score for the mapping qualities between variant-
supporting versus reference-supporting reads. The value will
be positive if variant-supporting reads have higher MQs.
lzBQ: z-score for the base-call qualities between variant-
supporting versus reference-supporting bases. The value will
be positive if variant-supporting reads have higher BQs.
lMQ0: number of mapping quality 0 reads (multiply mapped
reads) covering the variant position.
lVAF: variant allele frequency.
56 Li Tai Fang
In the default consensus mode, the following files will be
generated:
lEnsemble.sSNV.tsv and Ensemble.sINDEL.tsv.
lConsensus.sSNV.vcf and Consensus.sINDEL.vcf.
In training mode, the same files will also be generated, and each
variant will be annotated as a true positive or false positive (i.e.,
TruePositive or FalsePositive in VCF’s ID column, and 0 or 1 in the
TSV’s TrueVariant or False column). In addition, two classifiers will
be created as well: Ensemble.sSNV.tsv.ntChange.Classifier.RData
and Ensemble.sINDEL.tsv.ntChange.Classifier.RData.
If classifiers are supplied, the following files will be generated in
prediction mode:
lEnsemble.sSNV.tsv, Ensemble.sINDEL.tsv, SSeq.Classified.sSNV.
tsv, and SSeq.Classified.sINDEL.tsv.
lSSeq.Classified.sSNV.vcf and SSeq.Classified.sINDEL.vcf.
The difference between SSeq.Classified files and their consensus
counterparts is that the former are scored by SomaticSeq classifiers.
3.3 Running
Compatible Somatic
Mutation Callers
To make it easy for new users to get things started, we have dock-
erized a number of commonly used somatic mutation callers that
you may use before running SomaticSeq. SomaticSeq includes the
makeSomaticScripts.py module that creates run scripts for those
dockerized callers. Both tumor-normal paired runs and tumor-
only jobs are supported.
To see the full options, you may run either of the following
commands, one for tumor-normal (paired) mode and the other for
tumor-only (single) mode.
makeSomaticScripts.py paired -h
makeSomaticScripts.py single -h
Here is an example to generate run scripts for the individual
callers and SomaticSeq that combines the results of these callers.
Do keep in mind that this module is not a core SomaticSeq algo-
rithm. It simply calls for a number of third-party software tools that
we have dockerized. The run scripts for the tools we generate are
not extensively optimized. They may not run the latest version, and
we cannot guarantee that they are fully compatible with your
compute environment. If that is the case, you need to consult
with the authors for those tools.
makeSomaticScripts.py paired \
--output-directory /ABSOLUTE/PATH/TO/SomaticOutput \
--tumor-bam /ABSOLUTE/PATH/TO/tumor.bam \
--normal-bam /ABSOLUTE/PATH/TO/normal.bam \
Ensemble and Machine Learning Method to Detect Somatic Mutations 57
--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \
--inclusion-region /ABSOLUTE/PATH/TO/inclusion_region.bed \
--exclusion-region /ABSOLUTE/PATH/TO/blacklist.bed \
--dbsnp-vcf /ABSOLUTE/PATH/TO/dbSNP.vcf \
--cosmic-vcf /ABSOLUTE/PATH/TO/COSMIC.vcf \
--threads 12 \
--run-mutect2 --run-vardict --run-muse --run-lofreq --run-
strelka2
--run-somaticseq
The --threads 12 input invokes the program to create 12 sub-
directories named 1, 2, ..., 12 in the output directory. In each
subdirectory, a BED file is created to represent 1/12 of the total
base pairs in the inclusion region BED file. If a BED file is not
supplied, those sub-BED files will be based on the index file for the
genome reference, that is, the .fa.fai file, although we recommend
supplying a BED file even for whole genome sequencing (see
Note 2).
In each of the subdirectory, a run script (ending in .cmd) for
each somatic mutation caller invoked by --run-XXX flag is created
in “logs.” You will need to execute these scripts. If you invoke --
action qsub in the command, then these scripts will be submitted to
the compute management via the qsub command. You may also
include extra arguments there, for example, --action ‘qsub
-l h ¼node01’. In addition, in each of the 12 subdirectories, a
SomaticSeq directory will be created as well. The run scripts Soma-
ticSeq/logs/somaticSeq.timestamp.cmd will need to be executed
or qsubed manually after all the callers (in this thread) are
completely successfully.
Furthermore, when more than one thread is invoked, there will
be another script named SomaticOutput/logs/mergeResults.time-
stamp.cmd, which you may qsub or execute to merge the result file
from all the threads together. SomaticSeq training will also be
executed at this step if more than one thread is specified.
3.3.1 Inputs and
Parameters
l--output-directory: absolute path to output all the results.
Default is the current directory.
l--somaticseq-directory: name of the SomaticSeq directory inside
the output directory. Default is SomaticSeq.
l--tumor-bam: absolute path to the indexed tumor bam file. This
along with its .bai index file is required.
l--normal-bam: absolute path to the indexed normal bam file.
This along with its .bai index file is required.
l--tumor-sample-name: sample name to place in the VCF file as
the tumor. Default is TUMOR.
58 Li Tai Fang
l--normal-sample-name: sample name to place in the VCF file as
the normal. Default is NORMAL.
l--genome-reference: genome reference file in fasta format (typi-
cally .fa or .fasta extension). Always required, and also required
are the existence of the index file (.fa.fai) and the dict file (.dict).
l--inclusion-region: if supplied, only calls within these regions will
be considered.
l--exclusion-region: if supplied, calls outside these regions will be
excluded.
l--dbsnp-vcf: dbSNP VCF file. This is required because some
tools ask for it. Also required are the .vcf.gz file and the .vcf.
gz.idx file if MuSE is invoked.
l--cosmic-vcf: COSMIC VCF file, purely for annotation purposes.
Optional.
l--run-mutect2: a flag to create script to run GATK4’s MuTect2.
l--run-varscan2: a flag to create script to run VarScan2.
l--run-jointsnvmix2: a flag to create script to run JointSNVMix2
(cannot be parallelized).
l--run-somaticsniper: a flag to create script to run SomaticSniper
(cannot be parallelized).
l--run-vardict: a flag to create script to run VarDictJava.
l--run-muse: a flag to create script to run MuSE.
l--run-lofreq: a flag to create script to run LoFreq.
l--run-scalpel: a flag to create script to run Scalpel.
l--run-strelka2: a flag to create script to run Strelka2.
l--run-somaticseq: a flag to create script to run SomaticSeq.
l--action: the command for the caller scripts generator. Default is
“echo,” such that the paths of the scripts will be printed onto the
command line terminal, but nothing will be done for them. A
common choice would be “qsub” if you want to submit those
scripts into your compute queue system. You may also include
arguments for the scripts by using single quotes such as --action
‘qsub -l h ¼“node01|node02”’.
l--somaticseq-action: same as above, but for the SomaticSeq
script. Keep in mind the SomaticSeq cannot be executed until
all the individual caller jobs have completed. Default is echo.
l--snv-classifier: absolute path to SomaticSeq SNV classifier,
which will invoke prediction mode.
l--indel-classifier: absolute path to SomaticSeq SNV classifier,
which will invoke prediction mode.
l--truth-snv: a VCF file containing true positive SNVs. When
included, every SNV call in this VCF file will be labeled a true
Ensemble and Machine Learning Method to Detect Somatic Mutations 59
positive, and everything a false positive. This is required in
training mode. (If an exclusion region BED file is included, the
calls inside the exclusion regions will be ignored.)
l--truth-indel: same as above, but for indels.
l--train-somaticseq: a flag to invoke training mode in SomaticSeq
script if truth SNV and/or indel VCF files are supplied. Default
is False.
l--minimum-VAF: minimum variant allele frequencies to be
passed onto VarScan2 and VarDict callers. If not supplied, it
will be the default 0.10 for VarScan2 and the recommended
0.05 for VarDict.
l--threads: number of threads. It will split the job into equal sized
regions (based on inclusion BED file or the .fa.fai file) to execute
each thread. Default ¼1.
l--exome-setting: a flag to invoke exome setting in MuSE and
Strelka2.
l--mutect2-arguments: extra argument to pass onto GATK4’s
Mutect2 command. Use single quotes to include multiple
words, for example, --mutect2-arguments ‘--min-base-quality-
score 20 --tumor-lod-to-emit 10’.
l--mutect2-filter-arguments: extra argument to pass onto
GATK4’s Filter-MutectCalls command. Use single quotes to
include multiple words, see --mutect2-arguments for example.
l--varscan-pileup-arguments: extra argument to pass onto sam-
tools mpileup prior to VarScan2. Use single quotes to include
multiple words, see --mutect2-arguments for example.
l--varscan-arguments: extra argument to pass onto VarScan2.
Use single quotes to include multiple words, see --mutect2-
arguments for example.
l--jsm-train-arguments: extra argument to pass onto Join-
tSNVMix2’s train step. Use single quotes to include multiple
words, see --mutect2-arguments for example.
l--jsm-classify-arguments: extra argument to pass onto Join-
tSNVMix2’s classify step. Use single quotes to include multiple
words, see --mutect2-arguments for example.
l--somaticsniper-arguments: extra argument to pass onto Soma-
ticSniper. Use single quotes to include multiple words, see
--mutect2-arguments for example.
l--vardict-arguments: extra argument to pass onto vardict com-
mand. Use single quotes to include multiple words, see
--mutect2-arguments for example.
l--muse-arguments: extra argument to pass onto MuSE. Use
single quotes to include multiple words, see --mutect2-
arguments for example.
60 Li Tai Fang
l--lofreq-arguments: extra argument to pass onto LoFreq. Use
single quotes to include multiple words, see --mutect2-
arguments for example.
l--scalpel-discovery-arguments: extra argument to pass onto Scal-
pel’s discovery step. Use single quotes to include multiple
words, see --mutect2-arguments for example.
l--scalpel-export-arguments: extra argument to pass onto
Scalpel’s export step. Use single quotes to include multiple
words, see --mutect2-arguments for example.
l--scalpel-two-pass: a flag to invoke the two-pass option for
Scalpel.
l--strelka-config-arguments: extra argument to pass onto
Strelka2’s config step. Use single quotes to include multiple
words, see --mutect2-arguments for example.
l--strelka-run-arguments: extra argument to pass onto Strelka2’s
run step. Use single quotes to include multiple words, see
--mutect2-arguments for example.
l--somaticseq-arguments: extra argument to pass onto Somatic-
Seq. Use single quotes to include multiple words, see --mutect2-
arguments for example.
Keep in mind that makeSomaticScripts.py is not a core Soma-
ticSeq algorithm. It is to help get things started. The run scripts
generated by the makeSomaticScripts module pull the docker
images we have created. The docker containers access the system
files by mounting the root directory to /mnt inside the container,
and then access the files through /mnt/PATH/TO/files. Thus,
when pointing to paths and files in the makeSomaticScripts.py
command, it is imperative to use absolute physical paths.
3.4 Create Training
Data Sets
SomaticSeq employs supervised machine learning that relies on
good training data sets to create accurate classifiers. An ideal train-
ing data for SomaticSeq would be a pair of real tumor-normal NGS
data sets, with every true somatic mutation and false positive accu-
rately labeled. However, no such data set exists right now. Never-
theless, an excellent approach is to use two germline sequencing
replicates as background, and then create in silico mutations into
one of them as the designated tumor [15]. This way, only the in
silico mutations we have created are true mutations, and everything
else represents replicate-to-replicate noises that make up the false
positives. This approach describes the SomaticSeq training mode
depicted in Fig. 1c. There are number of well-characterized germ-
line samples that have been repeatedly sequenced at multiple
sequencing centers that you may try out [16,17]. To make it easy
to get things started, we have included a number of scripts in
Ensemble and Machine Learning Method to Detect Somatic Mutations 61
SomaticSeq to run dockerized BAMSurgeon workflows to create
training data on which SomaticSeq classifiers can be built.
3.4.1 Have Sequencing
Replicates for the Normal
Here, we present an example for the entire workflow of creating
SomaticSeq classifiers using two replicates of the NA12878
genome made public by Garvan Institute [17].
First, you may download the .fastq.gz files for both replicates,
NA12878D and NA12878J, and then align them into BAM files.
For inexperienced users, you may run the following command to
generate a run script based on GATK’s beat practices to create the
BAM files [18]. Run the following command for both replicate D
and J, assuming you have bwa indexed reference files [19].
somaticseq/utilities/dockered_pipelines/alignments/ fas-
tq2bam_pipeline.sh \
--output -dir /ABSOLUTE/PATH/TO/replicateD \
--tumor-fq1/ABSOLUTE/PATH/TO/NA12878D_HiSeqX_R1.fastq.gz \
--tumor-fq2/ABSOLUTE/PATH/TO/NA12878D_HiSeqX_R2.fastq.gz \
--tumor-bam-header’@RG\tID:XTenD\tPL:illumina\tLB:X10\tSM:
NA12878D’ \
--tumor-out-bam NA12878D.bam \
--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \
--threads 12
--bwa --pre-realign-markdup
In addition to the genome reference GRCh38.fa, there must
also be GRCh38.fa.bwt, GRCh38.fa.sa, GRCh38.fa.pac, GRCh38.
fa.ann, and GRCh38.fa.amb, which can be created with “bwa index
GRCh38.fa” command [19].
Once you run the command above, a script /ABSOLUTE/
PATH/TO/replicateD/logs/fastq2bam.timestampe.cmd will be
created, which you may execute to use BAM MEM to align the
reads into BAM files. Make sure to use different SM and ID tags in
the BAM header for the two BAM files (designated tumor and
normal) because MuTect2 requires it.
Once NA12878D.bam and NA12878J.bam are created, the
command below will designate NA12878D.bam as the tumor and
create up to 20,000 SNVs and 8000 indels into NA12878D.bam.
BAMSurgeon creates in silico mutations by changing the base(s) in
a subset of the reads covering the genomic position, and then
realigns the reads after they are synthetically mutated. A common
question is about the size of the training data, for example, how
many samples or how many variants. A rule of thumb is >500 true
positive variants plus a larger number of false positives (see Note 3).
The somatic mutation rate in training data should not be vastly
different from reality. If your data is small targeted panels, you may
need more than one tumor-normal pair to create a training set large
enough (see Note 4). The resulting semisynthetic tumor-normal
62 Li Tai Fang
BAM files will be named syntheticTumor.bam and syntheticNor-
mal.bam.
somaticseq/utilities/dockered_pipelines/bamSimulator/ BamSi-
mulator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/TO/trainingSet \
--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \
--tumor-bam-in /ABSOLUTE/PATH/TO/NA12878D.bam \
--normal-bam-in /ABSOLUTE/PATH/TO/NA12878J.bam \
--tumor-bam-out syntheticTumor.bam \
--normal-bam-out syntheticNormal.bam \
--num-snvs 20000 \
--num-indels 8000 \
--min-vaf 0.0 \
--max-vaf 1.0 \
--left-beta 2 \
--right -beta 5 \
--min-variant-reads 2 \
--threads 12 \
--action qsub \
--merge -output -bams
Again, in addition to GRCh38.fa, bwa index files GRCh38.fa.
bwt, GRCh38.fa.sa, GRCh38.fa.pac, GRCh38.fa.ann, and
GRCh38.fa.amb are also required.
The four parameters in the command above, that is, --min-vaf,
--max-vaf,--left-beta, and --right-beta, determine the VAF distri-
bution of the in silico mutations. The following python script will
display the VAF distribution of the settings in the previous
command:
Once all the threads are completed successfully, you may exe-
cute the /ABSOLUTE/PATH/TO/trainingSet/logs/merge-
Files.timestamp.cmd to merge all the BAM files and ground truth
VCF files (in silico mutations) into the output directory. These files
may be used to train for SomaticSeq classifiers, for example,
makeSomaticScripts.py paired \
--normal-bam /ABSOLUTE/PATH/TO/trainingSet/syntheticNormal.
bam \
--tumor-bam/ABSOLUTE/PATH/TO/trainingSet/syntheticTumor.bam \
--truth-snv /ABSOLUTE/PATH/TO/trainingSet/synthetic_snvs.vcf
\
--truth-inde /ABSOLUTE/PATH/TO/trainingSet/synthetic_indels.
leftAlign.vcf \
--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \
--output-directory/ABSOLUTE/PATH/TO/trainingSet/somaticMuta-
tions \
--dbsnp-vcf /ABSOLUTE/PATH/TO/dbSNP.hg38.vcf \
Ensemble and Machine Learning Method to Detect Somatic Mutations 63
--inclusion-region /ABSOLUTE/PATH/TO/genome.bed \
--action qsub \
--threads 12 \
--run-mutect2 --run-vardict --run-muse --run-strelka2 --run-
somaticseq
--train -somaticseq
The --action qsub will submit the somatic mutation caller jobs
via that command. Otherwise you may execute them yourself.
Once all those jobs are complete, you may submit or execute all
the SomaticSeq scripts created in /ABSOLUTE/PATH/TO/trai-
ningSet/somaticMutations/1,2,3,.../SomaticSeq/logs/somatic-
Seq.timestamp.cmd.
After all the SomaticSeq threads are complete, you can finally
submit or execute the script /ABSOLUTE/PATH/TO/training-
Set/somaticMutations/logs/mergeResults.timestamp.cmd to
combine the results from the different threads. It will also train
the labeled data sets into SomaticSeq classifiers Ensemble.sSNV.tsv.
ntChange.Classifier.RData and Ensemble.sINDEL.tsv.ntChange.
Classifier.RData in the output directory. These classifiers may
then be used to classify your own mutation calls.
3.4.2 Split a Normal Data
Set Into Designated Tumor
and Normal
Another way to create training data set is to split a (relatively high
coverage) germline sequencing data into two halves, with one
designated as normal and the other designated as tumor (Fig. 2a).
You may not get run-to-run biases from this method, but the data
will still have plenty of false positives deriving from sequencing
errors, sampling error of germline variants, and so on. An example
command would be as follows, where the --split-proportion directs
the fraction of reads to the designated normal. The following is an
example command to split HighCoverageGenome.bam 50–50 into
designated tumor and normal.
somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimu-
lator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/TO/trainingSet \
--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \
--tumor-bam-in /ABSOLUTE/PATH/TO/HighCoverageGenome.bam \
--tumor-bam-out syntheticTumor.bam \
--normal-bam-out syntheticNormal.bam \
--split-proportion 0.5 \
--min-variant-reads 2 \
--threads 12 \
--action qsub \
--num-snvs 10000 --num-indels 8000 --num-svs 1500 \
--min-vaf 0.0 --max-vaf 1.0 --left-beta 2 --right-beta 5 \
--split -bam --merge -output -bams
64 Li Tai Fang
3.4.3 Merge Tumor-
Normal and Then Random
Split Them
Another approach create training data set is to first merge the
tumor and normal BAM files into a single BAM file, and then
randomly split it into the designated tumor and normal (Fig. 2b).
This way, the real somatic mutations in the original tumor BAM will
be equally split into both designated tumor and normal
(on average) and would effectively be germline variants that will
be labeled as false positive if called by a caller. In silico mutations are
then created in the designated tumor. In the absence of normal
sequencing replicates or germline sequencing data with high
enough coverage, this may be the next best option to create classi-
fiers. Use --merge-bam flag to invoke this option:
somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimu-
lator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/TO/trainingSet \
--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \
--tumor-bam-in /ABSOLUTE/PATH/TO/Tumor_Sample.bam \
Fig. 2 Two additional scenarios to create synthetic tumor-normal pairs
Ensemble and Machine Learning Method to Detect Somatic Mutations 65
--normal-bam-in /ABSOLUTE/PATH/TO/Normal_Sample.bam \
--tumor-bam-out syntheticTumor.bam \
--normal-bam-out syntheticNormal.bam \
--split-proportion 0.5 \
--min-variant-reads 2 \
--threads 12 \
--num-snvs 30000 --num-indels 10000 --num-svs 1500 \
--min-vaf 0.0 --max-vaf 1.0 --left-beta 2 --right-beta 5 \
--merge -bam --split -bam --merge -output -bams
3.4.4 Inputs and
Parameters
The following are all the options available for the BamSimulator
multiThreads.sh script:
l--output-dir: absolute path to the output directory. Required.
l--genome-reference: absolute path to the genome reference,
assuming the index files for the aligner (i.e., BWA) are available
as well. Required.
l--selector: if provided, will create in silico mutations only in these
regions.
l--tumor-bam-out: file name for the synthetic tumor BAM out-
put. Default is syntheticTumor.bam.
l--tumor-bam-in: absolute path to input tumor BAM file. In
scenario with two germline sequencing replicates, this is the
sample where in silico mutations will be created. In scenario
where a (relatively) high-coverage germline sample will be
split, this is the path to that germline BAM file. In the final
scenario where two BAM files are to be merged, this can point
to any of the two BAM files. Required.
l--normal-bam-out: file name for the synthetic normal BAM
output. Default is syntheticNormal.bam.
l--normal-bam-in: absolute path to input BAM file to be the
designated normal. Required.
l--split-proportion: the fraction of total reads to be split into the
designated normal. Not needed in the first scenario with two
germline sequencing replicates. Otherwise the default is 0.5.
l--down-sample: downsamples the BAM files when creating syn-
thetic tumor and normal BAM files. Default is 1, that is, no
downsampling.
l--num-snvs: number of in silico SNVs to attempt. The actual
SNVs will usually be lower because when certain conditions are
not met (e.g., depth too low, etc.), an attempt will be skipped.
l--num-indels: number of in silico indels to attempt.
l--num-svs: number of in silico SVs to attempt. Default is 0.
l--min-vaf: minimum variant allele frequency create.
66 Li Tai Fang
l--max-vaf: maximum variant allele frequency create.
l--left-beta: left beta for beta distribution for VAF.
l--right-beta: right beta for beta distribution for VAF.
l--min-depth: mimimum depth to attempt mutation.
l--max-depth: maximum depth to attempt mutation.
l--min-variant-reads: minimum number of variant reads to cre-
ate for each in silico mutations.
l--aligner: what aligner to use to remap a read after the read is
mutated in in silico. Default is bwa mem.
l--seed: choose a random number generator seed for reproduc-
ibility purposes.
l--action: what to do with the workflow scripts generated.
Default is echo.
l--threads: number of threads. You would have to merge the
results from each thread after all the threads are completed
successfully.
l--merge-bam: flag to merge the input tumor and normal bam.
l--split-bam: flag to split the BAM files into designated tumor and
normal.
l--clean-bam: flag to clean up the input BAM files if there are
more than two reads of the same read names, by simply
removing them.
l--indel-realign: flag to perform GATK’s joint indel realignment
for the designated tumor and normal BAM files.
l--merge-output-bams: flag to merge the output BAM and VCF
files from different thread. You should use this for
multithreaded jobs.
l--keep-intermediates: keep all the intermediate files for debug-
ging purposes.
4 Notes
1. In SomaticSeq algorithms, each variant is defined by its geno-
mic start position, reference base(s), and variant base(s), that is,
the following four fields in a VCF file: CHROM, POS, REF,
and ALT. Different mutations in the same genomic position are
considered different variant calls and will have different features
extracted.
2. SomaticSeq allows the input of an inclusion region. Without it,
SomaticSeq will assume whole genome with the index file of
the genome reference (typically .fa.fai). However, even for
Ensemble and Machine Learning Method to Detect Somatic Mutations 67
whole genome data, we recommend that you create a BED file
for the whole genome that only includes the major chromo-
somes (i.e., chr1, chr2, ..., chrX, chrY) because often the
human reference files include alternate contigs, viral contigs,
decoy contigs, and even chrM, and so on. Reads aligned to
those contigs tend to be poorly aligned but often with very
high apparent coverage, thus some mutations callers will then
waste a disproportionate amount of time attempting to do local
assembly on them to resolve variant calls in these contigs, when
they are mostly wasted compute time. A BED file can be
created to simply exclude those regions.
3. Generally speaking, machine learning works increasingly better
with larger data sets. In our original publication, we measured
the accuracy of SomaticSeq cross validation with increasing
data size and have found that generally, the accuracy plateaus
when there are >500 true mutations in the training data
(assuming there are more false positives than true positives;
Fig. 3).
4. There are cases where users need to combine samples to create
a larger training set, for example, mutation call sets from tar-
geted panel or even a whole exome sequencing may not be
large enough to create a reliable classifier. In order to do so,
you first need to merge the Ensemble.sSNV.tsv files and
Ensemble.sINDEL.tsv files, but keeping only one header.
Make sure those files were created with the same versions and
parameters of SomaticSeq, so that each column means the same
thing from different files. The files can be combined like this:
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
DC3A DC3D N0T50 N2.5T15
Fig. 3 Figure adapted from [12]. Y-axis represents the F
1
score of cross validation, and X-axis represents the
number of true somatic mutations in a training data set. The four different plots represent four different
data sets
68 Li Tai Fang
cat /PATH/sampleA/SomaticSeq/Ensemble.sSNV.tsv /PATH/sampleB/
SomaticSeq/Ensemble.sSNV.tsv | awk ‘NR==1 | $1 !~ /^CHROM/’ >
Combined.sSNV.tsv
cat /PATH/sampleA/SomaticSeq/Ensemble.sINDEL.tsv /PATH/sam-
pleB/SomaticSeq/Ensemble.sINDEL.tsv | awk ‘NR==1 | $1 !~ /
^CHROM/’ >Combined.sINDEL.tsv
Then, you can invoke the machine learning training scripts
in R directly:
r_scripts/ada_model_builder_ntChange.R Combined.sINDEL.tsv
Consistent_Mates Inconsistent_Mates Strelka_QSS Strelka_TQSS
r_scripts/ada_model_builder_ntChange.R Combined.sSNV.tsv Con-
sistent_Mates Inconsistent_Mates
The values after the TSV files are features to be excluded in
training, and those features have not yet shown they improve
accuracy, so by default, they were excluded.
References
1. Cibulskis K, Lawrence MS, Carter SL et al
(2013) Sensitive detection of somatic point
mutations in impure and heterogeneous cancer
samples. Nat Biotechnol 31(3):213–219
2. Koboldt DC, Zhang Q, Larson DE et al
(2012) VarScan 2: somatic mutation and copy
numberalteration discovery in cancer by exome
sequencing. Genome Res 22(3):568–576
3. Roth A, Ding J, Morin R et al (2012) Join-
tSNVMix: a probabilistic model for accurate
detection of somatic mutations in normal/
tumour paired next-generation sequencing
data. Bioinformatics 28(7):907–913
4. Larson DE, Harris CC, Chen K et al (2012)
SomaticSniper: identification of somatic point
mutations in whole genome sequencing data.
Bioinformatics 28(3):311–317
5. Lai Z, Markovets A, Ahdesmaki M et al (2016)
VarDict: a novel and versatile variant caller for
next-generation sequencing in cancer research.
Nucleic Acids Res 44(11):e108
6. Fan Y, Xi L, Hughes DST et al (2016) MuSE:
accounting for tumor het- erogeneity using a
sample-specific error model improves sensitiv-
ity and specificity in mutation calling from
sequencing data. Genome Biol 17(1):178
7. Wilm A, Aw PPK, Bertrand D et al (2012)
LoFreq: a sequence-quality aware, ultra-
sensitive variant caller for uncovering cell-
population heterogeneity from high-
throughput sequencing datasets. NucleicA-
cidsRes 40(22):11189–11201
8. Narzisi G, O’Rawe JA, Iossifov I et al (2014)
Accurate de novo and transmitted indel detec-
tion in exome-capture data using microassem-
bly. Nat Methods 11(10):1033–1036
9. Kim S, Scheffler K, Halpern AL et al (2018)
Strelka2: fast and accurate calling of germline
and somatic variants. Nat Methods 15
(8):591–594
10. Freed D, Pan R, Aldana R (2018) Tnscope:
accurate detection of somatic mutations with
haplotype-based variant candidate detection
and machine learning filtering. bioRxiv
11. Thorvaldsdottir H, Robinson JT, Mesirov JP
(2013) Integrative genomics viewer (IGV):
high-performance genomics data visualization
and exploration. Brief Bioinform 14
(2):178–192
12. Fang LT, Afshar PT, Chhibber A et al (2015)
An ensemble approach to accurately detect
somatic mutations using somaticseq. Genome
Biol 16(1):197
13. Johnson K, Culp M, Michailides G (2006) ada:
an R package for stochastic boosting. J Stat
Softw 17(2)
14. Quinlan AR, Hall IM (2010) BEDTools: a
flexible suite of utilities for comparing genomic
features. Bioinformatics 26(6):841–842
15. Ewing AD, Houlahan KE, Hu Y et al (2015)
Combining tumor genome simulation with
crowdsourcing to benchmark somatic single-
nucleotide-variant detection. Nat Methods 12
(7):623–630
Ensemble and Machine Learning Method to Detect Somatic Mutations 69
16. Genome in a bottle. https://www.nist.gov/pro
grams-projects/genome-bottle
17. First publicly available XTen genome. http://
allseq.com/knowledge-bank/1000-genome/
get-your-1000-genome-test-data-set/
18. Roberts ND, Daniel Kortschak R, Parker WT
et al (2013) A comparative analysis of algo-
rithms for somatic snv detection in cancer. Bio-
informatics 29(18):2223–2230
19. Li H (2013) Aligning sequence reads, clone
sequences and assembly contigs with bwa-mem
70 Li Tai Fang
Article
Full-text available
Accurate indel calling plays an important role in precision medicine. A benchmarking indel set is essential for thoroughly evaluating the indel calling performance of bioinformatics pipelines. A reference sample with a set of known-positive variants was developed in the FDA-led Sequencing Quality Control Phase 2 (SEQC2) project, but the known indels in the known-positive set were limited. This project sought to provide an enriched set of known indels that would be more translationally relevant by focusing on additional cancer related regions. A thorough manual review process completed by 42 reviewers, two advisors, and a judging panel of three researchers significantly enriched the known indel set by an additional 516 indels. The extended benchmarking indel set has a large range of variant allele frequencies (VAFs), with 87% of them having a VAF below 20% in reference Sample A. The reference Sample A and the indel set can be used for comprehensive benchmarking of indel calling across a wider range of VAF values in the lower range. Indel length was also variable, but the majority were under 10 base pairs (bps). Most of the indels were within coding regions, with the remainder in the gene regulatory regions. Although high confidence can be derived from the robust study design and meticulous human review, this extensive indel set has not undergone orthogonal validation. The extended benchmarking indel set, along with the indels in the previously published known-positive set, was the truth set used to benchmark indel calling pipelines in a community challenge hosted on the precisionFDA platform. This benchmarking indel set and reference samples can be utilized for a comprehensive evaluation of indel calling pipelines. Additionally, the insights and solutions obtained during the manual review process can aid in improving the performance of these pipelines.
Article
Full-text available
Publications comparing variant caller algorithms present discordant results with contradictory rankings. Caller performances are inconsistent and wide ranging, and dependent upon input data, application, parameter settings, and evaluation metric. With no single variant caller emerging as a superior standard, combinations or ensembles of variant callers have appeared in the literature. In this study, a whole genome somatic reference standard was used to derive principles to guide strategies for combining variant calls. Then, manually annotated variants called from the whole exome sequencing of a tumor were used to corroborate these general principles. Finally, we examined the ability of these principles to reduce noise in targeted sequencing.
Preprint
Full-text available
Detection of somatic mutations in tumor samples is important in the clinic, where treatment decisions are increasingly based upon molecular diagnostics. However, accurate detection of these mutations is difficult, due in part to intra-tumor heterogeneity, contamination of the tumor sample with normal tissue and pervasive structural variation. Here, we describe Sentieon TNscope, a haplotype-based somatic variant caller with increased accuracy relative to existing methods. An early engineering version of TNscope was used in our submission to the most recent ICGC-DREAM Somatic Mutation calling challenge. In that challenge, TNscope is the leader in accuracy for SNVs, indels and SVs. To further improve variant calling accuracy, we combined the improvements in the variant caller with machine learning. We benchmarked TNscope using in-silico mixtures of well-characterized Genome in a Bottle (GIAB) samples. TNscope displays higher accuracy than the other benchmarked tools and the accuracy is substantially improved by the machine learning model.
Article
Full-text available
We describe Strelka2 ( https://github.com/Illumina/strelka ), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.
Article
Full-text available
Subclonal mutations reveal important features of the genetic architecture of tumors. However, accurate detection of mutations in genetically heterogeneous tumor cell populations using next-generation sequencing remains challenging. We develop MuSE (http://bioinformatics.mdanderson.org/main/MuSE), Mutation calling using a Markov Substitution model for Evolution, a novel approach for modeling the evolution of the allelic composition of the tumor and normal tissue at each reference base. MuSE adopts a sample-specific error model that reflects the underlying tumor heterogeneity to greatly improve the overall accuracy. We demonstrate the accuracy of MuSE in calling subclonal mutations in the context of large-scale tumor sequencing projects using whole exome and whole genome sequencing. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-1029-6) contains supplementary material, which is available to authorized users.
Article
Full-text available
Accurate variant calling in next generation sequencing (NGS) is critical to understand cancer genomes better. Here we present VarDict, a novel and versatile variant caller for both DNA- and RNA-sequencing data. VarDict simultaneously calls SNV, MNV, InDels, complex and structural variants, expanding the detected genetic driver landscape of tumors. It performs local realignments on the fly for more accurate allele frequency estimation. VarDict performance scales linearly to sequencing depth, enabling ultra-deep sequencing used to explore tumor evolution or detect tumor DNA circulating in blood. In addition, VarDict performs amplicon aware variant calling for polymerase chain reaction (PCR)-based targeted sequencing often used in diagnostic settings, and is able to detect PCR artifacts. Finally, VarDict also detects differences in somatic and loss of heterozygosity variants between paired samples. VarDict reprocessing of The Cancer Genome Atlas (TCGA) Lung Adenocarcinoma dataset called known driver mutations in KRAS, EGFR, BRAF, PIK3CA and MET in 16% more patients than previously published variant calls. We believe VarDict will greatly facilitate application of NGS in clinical cancer research.
Article
Full-text available
SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0758-2) contains supplementary material, which is available to authorized users.
Article
Full-text available
The detection of somatic mutations from cancer genome sequences is key to understanding the genetic basis of disease progression, patient survival and response to therapy. Benchmarking is needed for tool assessment and improvement but is complicated by a lack of gold standards, by extensive resource requirements and by difficulties in sharing personal genomic information. To resolve these issues, we launched the ICGC-TCGA DREAM Somatic Mutation Calling Challenge, a crowdsourced benchmark of somatic mutation detection algorithms. Here we report the BAMSurgeon tool for simulating cancer genomes and the results of 248 analyses of three in silico tumors created with it. Different algorithms exhibit characteristic error profiles, and, intriguingly, false positives show a trinucleotide profile very similar to one found in human tumors. Although the three simulated tumors differ in sequence contamination (deviation from normal cell sequence) and in subclonality, an ensemble of pipelines outperforms the best individual pipeline in all cases. BAMSurgeon is available at https://github.com/adamewing/bamsurgeon/.
Article
Full-text available
We present an open-source algorithm, Scalpel (http://scalpel.sourceforge.net/), which combines mapping and assembly for sensitive and specific discovery of insertions and deletions (indels) in exome-capture data. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We analyzed 593 families from the Simons Simplex Collection and demonstrated Scalpel's power to detect long (≥30 bp) transmitted events and enrichment for de novo likely gene-disrupting indels in autistic children.
Article
Full-text available
With the advent of relatively affordable, high-throughput technologies, DNA sequencing of cancers is now common practice in cancer research projects and will be increasingly used in clinical practice to inform diagnosis and treatment. Somatic (cancer-only) single nucleotide variants (SNVs) are the simplest class of mutation, yet their identification in DNA sequencing data is confounded by germline polymorphisms, tumour heterogeneity, and sequencing and analysis errors. Four recently published algorithms for the detection of somatic SNV sites in matched cancer-normal sequencing datasets are VarScan, SomaticSniper, JointSNVMix and Strelka. In this analysis, we apply these four SNV calling algorithms to cancer-normal Illumina exome sequencing of a chronic myeloid leukaemia (CML) patient. The candidate SNV sites returned by each algorithm are filtered to remove likely false positives, then characterised and compared to investigate the strengths and weaknesses of each SNV calling algorithm. Comparing the candidate SNV sets returned by VarScan, SomaticSniper, JointSNVMix2 and Strelka revealed substantial differences with respect to: the number and character of sites returned; the somatic probability scores assigned to the same sites; their susceptibility to various sources of noise; and their sensitivities to low-allelic-fraction candidates. Data accession number SRA081939, code at http://code.google.com/p/snv-caller-review/ CONTACT: david.adelson@adelaide.edu.au.
Article
BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/lh3/bwa. Contact: hengli@broadinstitute.org
Article
Detection of somatic point substitutions is a key step in characterizing the cancer genome. However, existing methods typically miss low-allelic-fraction mutations that occur in only a subset of the sequenced cells owing to either tumor heterogeneity or contamination by normal cells. Here we present MuTect, a method that applies a Bayesian classifier to detect somatic mutations with very low allele fractions, requiring only a few supporting reads, followed by carefully tuned filters that ensure high specificity. We also describe benchmarking approaches that use real, rather than simulated, sequencing data to evaluate the sensitivity and specificity as a function of sequencing depth, base quality and allelic fraction. Compared with other methods, MuTect has higher sensitivity with similar specificity, especially for mutations with allelic fractions as low as 0.1 and below, making MuTect particularly useful for studying cancer subclones and their evolution in standard exome and genome sequencing data.