ChapterPDF Available

SomaticSeq: An Ensemble and Machine Learning Method to Detect Somatic Mutations

March 2020
Methods in molecular biology (Clifton, N.J.) 2120:47-70

March 2020
2120:47-70

DOI:10.1007/978-1-0716-0327-7_4

In book: Bioinformatics for Cancer Immunotherapy (pp.47-70)

Authors:

Li Tai Fang

Bina Technologies, part of Roche Sequencing

A standard strategy to discover somatic mutations in a cancer genome is to use next-generation sequencing (NGS) technologies to sequence the tumor tissue and its matched normal (commonly blood or adjacent normal tissue) for side-by-side comparison. However, when interrogating entire genomes (or even just the coding regions), the number of sequencing errors easily outnumbers the number of real somatic mutations by orders of magnitudes. Here, we describe SomaticSeq, which incorporates multiple somatic mutation detection algorithms and then uses machine learning to vastly improve the accuracy of the somatic mutation call sets.

Two additional scenarios to create synthetic tumor-normal pairs

…

Figure adapted from [12]. Y-axis represents the F 1 score of cross validation, and X-axis represents the number of true somatic mutations in a training data set. The four different plots represent four different data sets

…

Figures - uploaded by Li Tai Fang

Content may be subject to copyright.

Content uploaded by Li Tai Fang

Content may be subject to copyright.

Chapter 4

SomaticSeq: An Ensemble and Machine Learning Method

to Detect Somatic Mutations

Li Tai Fang

Abstract

A standard strategy to discover somatic mutations in a cancer genome is to use next-generation sequencing

(NGS) technologies to sequence the tumor tissue and its matched normal (commonly blood or adjacent

normal tissue) for side-by-side comparison. However, when interrogating entire genomes (or even just the

coding regions), the number of sequencing errors easily outnumbers the number of real somatic mutations

by orders of magnitudes. Here, we describe SomaticSeq, which incorporates multiple somatic mutation

detection algorithms and then uses machine learning to vastly improve the accuracy of the somatic mutation

call sets.

Key words Somatic mutations, Sequencing, Bioinformatics, Machine learning, Ensemble method

1 Introduction

To discover somatic mutations in a cancer genome, the tumor and

its matched normal tissues (commonly blood or adjacent normal

tissue) are typically sequenced side-by-side, with the normal acting

as a control to ﬁlter out germline variants. However, the whole

human genome consists of over 3 billion base pairs, and the coding

regions alone make up over 30 million base pairs. There is a

plethora of modern algorithms developed by different research

groups to detect somatic mutations in such data sets [1–11]. How-

ever, they usually produce more false positive calls than actual

somatic mutations.

SomaticSeq achieves higher accuracy by [12]:

1. Combine the call sets from these somatic mutation callers that

were incorporated into the SomaticSeq workﬂow.

2. For each somatic variant call (see Note 1 for SomaticSeq’s

deﬁnition of a unique variant call), extract genomic and

sequencing features from the tumor and normal BAM ﬁles.

Sebastian Boegel (ed.), Bioinformatics for Cancer Immunotherapy: Methods and Protocols, Methods in Molecular Biology,

vol. 2120, https://doi.org/10.1007/978-1-0716-0327-7_4,©Springer Science+Business Media, LLC, part of Springer Nature 2020

3. Deploy an adaptive boosting [13] machine learning classiﬁer to

separate the false positives from the true mutations. The classi-

ﬁers can be trained from semisimulated data sets, which we will

describe in Subheading 3.4.

2 Materials

SomaticSeq is freely available under BSD 2-Clause open source

license. The source code is located at https://github.com/bio

inform/somaticseq. SomaticSeq Docker images can be found at

https://hub.docker.com/r/lethalfang/somaticseq.

2.1 Software To run SomaticSeq, the following tools and packages must be

installed in a Linux or Unix environment:

1. SomaticSeq was developed in Python 3 under Linux environ-

ment. We recommend using Python 3.5 or newer. In addition,

Python libraries of NumPy (v1.13 or newer), SciPy (v1.0 or

newer), and pysam (v0.13 or newer) are also required. Cur-

rently, SomaticSeq’s latest stable version is v3.3.0, and the

protocols in this book represent the version 3 branch of Soma-

ticSeq. There have been substantial improvements in features

and stability since SomaticSeq was ﬁrst published in 2015.

2. SomaticSeq also uses BEDTools [14] (v2 or newer) to manipu-

late bed ﬁle inputs, that is, regions to include and/or exclude in

the workﬂow.

3. R(v3.2 or newer) and ada (v2.0.5 or newer) library were

implemented as the machine learning algorithm.

4. At its core, SomaticSeq combines and then ﬁlters the results of

multiple somatic mutation detection algorithms based on many

sequencing features. Generally speaking, at least one compati-

ble caller needs to be run to generate a list of mutation candi-

dates for SomaticSeq to evaluate. It is compatible with the

following callers: the original MuTect/Indelocator as well as

GATK4’s Mutect2 [1], VarScan2 [2], JointSNVMix2 [3],

SomaticSniper [4], VarDict [5], MuSE [6], LoFreq [7], Scalpel

[8], Strelka2 [9], TNscope [10], and Platypus [11].

5. Docker [http://www.docker.com] is a container technology

that can be used to package software and their dependencies

in a portable Docker images, which can be used to execute a

workﬂow reproducibly across different platforms and environ-

ments. SomaticSeq does not require Docker per se, but we have

created Docker images of it, along with a number of compati-

ble somatic mutation callers to make life easier for new users.

The advantage of using container technologies like Docker is

that one does not necessarily have to create the right software

48 Li Tai Fang

environment with the correct dependencies for every software

in a workﬂow, for example, by creating a Docker image for

MuTect2, the users can simply use that Docker image for

MuTect2 tasks. Otherwise, they must make sure to have the

correct Java version and other dependencies to run MuTect2,

and that those dependencies do not conﬂict with other soft-

ware that may need different Java versions.

2.2 Download and

Install SomaticSeq

Source code of SomaticSeq is available via Github repository:

https://github.com/bioinform/somaticseq. The latest source

code can be cloned with the following git command:

git clone https://github.com/bioinform/somaticseq.git

Alternatively, a ﬁxed version may be downloaded and

unpacked, for example,

wget https://github.com/bioinform/somaticseq/archive/v3.3.0.

tar.gz tar -xvf v3.3.0.tar.gz

Installation is not required to run SomaticSeq. You may specify

the full path of all the SomaticSeq scripts. Nevertheless, the sim-

plest way to place SomaticSeq executables in your $PATH is to

install it:

cd somaticseq

./setup.py install

To have fully functional SomaticSeq, you must also install

BEDTools and R (and ada library) as speciﬁed in Subheading 2.1

and place them in your execution $PATH.

3 Methods

3.1 Running

SomaticSeq in Tumor-

Normal Paired Mode

To see global parameters for either tumor-normal paired or tumor-

only single mode,

somaticseq_parallel.py --help

To see input parameters speciﬁc for tumor-normal paired

mode,

somaticseq_parallel.py paired –help

The following is an example command to run SomaticSeq’s

core algorithm after completing some (or all) of the compatible

somatic mutation caller(s). This command will invoke the default

consensus mode (Fig. 1a). Keep in mind that not all the input VCF

Ensemble and Machine Learning Method to Detect Somatic Mutations 49

Matched Normal

Tu m o r

JointSNVMix2

SomaticSniper

VarD ic t

Scalpel Platypus

TNscope

Feature extraction:

Caller classification

Mapping metrics

Alignment metrics

dbSNP membership

Strand bias

Combine

somatic

mutation

call sets

Majority-vote

Consensus

MuTect2

VarScan2 Strelka2

LoFreq

MuSE

Mutation Calls:

chr1 394852 PASS 11

chr1 405967 PASS 8

chr1 596033 REJECT 2

chr1 896032 REJECT 2

chrX 304842 LowQ 4

chrX 210394 REJECT 2

Normal Replicate #1

Normal Replicate #2

JointSNVMix2

SomaticSniper

VarD ic t

Scalpel Platypus

TNscope

mutations

spiked in

Feature extraction:

Caller classification

Mapping metrics

Alignment metrics

dbSNP membership

Strand bias

Combine

somatic

mutation

call sets

AdaBoost machine

learning classifier

ground truth

MuTect2

VarScan2 Strelka2

LoFreq

MuSE

Matched Normal

Tumor

JointSNVMix2

SomaticSniper

VarD ict

Scalpel Platypus

TNscope

Feature extraction:

Caller classification

Mapping metrics

Alignment metrics

dbSNP membership

Strand bias

Combine

somatic

mutation

call sets

MuTect2

VarScan2 Strelka2

LoFreq

MuSE

Mutation Calls:

chr1 394852 PASS 0.99

chr1 405967 PASS 0.98

chr1 596033 REJECT 0.01

chr1 896032 REJECT 0.02

chrX 304842 LowQ 0.40

chrX 210394 REJECT 0.01

Trained

SomaticSeq

Classifier

Fig. 1 The three modes for SomaticSeq

50 Li Tai Fang

ﬁles from all the somatic callers are required. What somatic muta-

tion callers you want to run is your choice based on how well they

work on your data sets and the cost and availability of compute

resources:

somaticseq_parallel.py \

--output-directory OUTPUT_DIR \

--genome-reference GRCh38.fa \

--inclusion-region genome.bed \

--exclusion-region blacklist.bed \

--threads 12 \

paired \

--tumor-bam-file tumor.bam \

--normal-bam-file matched_normal.bam \

--mutect2-vcf MuTect2.vcf \

--varscan-snv VarScan2.snp.vcf \

--varscan-indel VarScan2.indel.vcf \

--jsm-vcf JointSNVMix2.vcf \

--somaticsniper-vcf SomaticSniper.vcf \

--vardict-vcf VarDict.vcf \

--muse-vcf MuSE.vcf \

--lofreq-snv LoFreq.somatic_final.snvs.vcf.gz \

--lofreq-indel LoFreq.somatic_final.indels.vcf.gz \

--scalpel-vcf Scalpel.vcf \

--strelka-snv Strelka/results/variants/somatic.snvs.vcf.gz \

--strelka-indelStrelka/results/variants/somatic.indels.vcf.gz

--tnscope-vcf TNscope.filtered.vcf.gz \

--platypus-vcf Platypus.vcf

If you have SomaticSeq classiﬁers that you want to use to

evaluate/score/classify the mutation candidates (Fig. 1b), point

to them before the paired option, that is,

--classifier-snvEnsemble.sSNV.tsv.ntChange.Classifier.RData \

--classifier-indelEnsemble.sINDEL.tsv.ntChange.Classifier.

RData \

If this is a training data set for which you want to create

SomaticSeq classiﬁer (Fig. 1c), make sure to have your VCF ﬁles

containing only the true variants, one for SNVs and one for indels,

and place them before the paired option along with the --somaticseq-

train ﬂag. Every variant call in the inclusion region but not in the

truth set will be considered a false positive, that is,

--truth-snv TruePositive.snv.vcf \

--truth-indel TruePositive.indel.vcf \

--somaticseq-train \

Ensemble and Machine Learning Method to Detect Somatic Mutations 51

3.1.1 Inputs and

Parameters

The paired argument puts SomaticSeq into tumor-normal paired

mode. The caller outputs (VCF ﬁles) and the tumor-normal BAM

ﬁles are placed after the paired argument. Everything else that is

agnostic of paired or single sample mode goes before the paired

argument, for example, genome reference, ground truth VCF ﬁles,

inclusion/exclusion regions, and resource ﬁles such as dbSNP,

COSMIC, and so on.

Arguments placed before the paired argument

l--output-directory: the path to output directory. Default is the

current directory.

l--genome-reference: genome reference ﬁle in fasta format (typi-

cally .fa or .fasta extension). Always required, and also required

are the existence of the index ﬁle (.fa.fai) and the dict ﬁle (.dict).

l--truth-snv: a VCF ﬁle containing true positive SNVs. When

included, every SNV call in this VCF ﬁle will be labeled a true

positive, and everything else a false positive. This is required in

training mode. (If an inclusion region ﬁle is speciﬁed, then only

calls within the inclusion regions may be considered. If there is

an exclusion region ﬁle, then calls inside the exclusion regions

will be ignored.)

l--truth-indel: same as above, but for indels.

l--somaticseq-train: ﬂag to invoke training mode. When invoked,

will create SomaticSeq classiﬁers if truth-snv and/or truth-indel

ﬁles are speciﬁed.

l--classiﬁer-snv: the trained SomaticSeq SNV classiﬁer (.RData).

When this ﬁle is speciﬁed, it will automatically invoke the

prediction mode.

l--classiﬁer-indel: same as above, but for indels.

l--pass-threshold: in prediction mode, this is the threshold

(between 0 and 1) above which a call will be labeled PASS.

Default is 0.5.

l--lowqual-threshold: in prediction mode, this is the threshold

above which (but below the PASS threshold) a call will be

labeled LowQual. Default is 0.1.

l--homozygous-threshold: a variant allele frequency (VAF) thresh-

old, above which the GT ﬁeld will be labeled 1/1.

Default ¼0.85.

l--heterozygous-threshold: a VAF above which (but below the

homozygous threshold) the GT will be labeled 0/1.

Default ¼0.01.

l--minimum-mapping-quality: the minimum mapping quality

score to count the reads. Default ¼1.

l--minimum-base-quality: the minimum base-call quality score to

count the base. Default ¼5.

52 Li Tai Fang

l--minimum-num-callers: the minimum number of caller(s) to

call a variant candidate to be included in the combined call set.

Default is 0.5, which means it will include some calls that are

only called LowQual by a caller.

l--dbsnp-vcf: dbSNP VCF ﬁle. If included, can be used as a feature

in machine learning.

l--cosmic-vcf: COSMIC VCF ﬁle. If included, it is only used for

annotation of the variants. COSMIC is not a SomaticSeq feature.

l--inclusion-region: if included, then only variant calls in these

regions will be considered.

l--exclusion-region: if included, then variants in these regions will

be excluded. If a call is in both inclusion and exclusion regions, it

will be excluded.

l--threads: number of threads. It will split the job into equal-sized

regions (based on inclusion BED ﬁle or the .fa.fai ﬁle) to execute

each thread. Default ¼1.

l--keep-intermediates: a ﬂag to tell SomaticSeq not to delete any

intermediate ﬁles, for debugging purposes.

Parallel processing is achieved by splitting the inclusion BED

ﬁle into a number of temporary BED ﬁles of equal sizes (i.e., same

number of base pairs per BED ﬁle), named 1.th.input.bed, 2.th.

input.bed, ..., n.th.input.bed. Then, each process will be run using

each temporary BED ﬁle as the inclusion BED ﬁle. If there is no

inclusion BED ﬁle in the command argument, it will split reference

genome’s index ﬁle (e.g., GRCh38.fa.fai) instead.

Arguments placed after the paired argument

l--tumor-bam: sorted and index tumor bam ﬁle. Required.

l--normal-bam: sorted and index normal bam ﬁle. Required.

l--tumor-sample: tumor sample name to place in the header of the

output VCF ﬁle. Default is TUMOR.

l--normal-sample: normal sample name to place in the header of

the output VCF ﬁle. Default is NORMAL.

l--mutect-vcf: VCF ﬁle output from the original MuTect v1.

(Optional)

l--indelocator-vcf: VCF ﬁle output from Indelocator. (Optional)

l--mutect2-vcf: VCF ﬁle output from GATK4’s Mutect2 and then

FilterMutect-Calls. (Optional)

l--varscan-snv: the snp VCF ﬁle output VarScan2. (Optional)

l--varscan-indel: the indel VCF ﬁle output VarScan2. (Optional)

l--jsm-vcf: JointSNVMix2’s output modiﬁed into a VCF ﬁle by

SomaticSeq. (See Subheading 3.3; Optional)

Ensemble and Machine Learning Method to Detect Somatic Mutations 53

l--somaticsniper-vcf: VCF ﬁle output from SomaticSniper.

(Optional)

l--vardict-vcf: VCF ﬁle output from VarDict. (Optional)

l--muse-vcf: VCF ﬁle output from MuSE. (Optional)

l--lofreq-snv: somatic snv VCF ﬁle output from LoFreq.

(Optional)

l--lofreq-indel: somatic indel VCF ﬁle output from LoFreq.

(Optional)

l--scalpel-vcf: VCF ﬁle output from Scalpel. (Optional)

l--strelka-snv: somatic snv VCF ﬁle output from Strelka2.

(Optional)

l--strelka-indel: somatic indel VCF ﬁle output from Strelka2.

(Optional)

l--tnscope-vcf: VCF ﬁle output from Sentieon TNscope caller.

(Optional)

l--platypus-vcf: VCF ﬁle output from Platypus. (Optional).

SomaticSeq supports any combination of the somatic mutation

callers we have incorporated into the workﬂow. SomaticSeq will run

based on the output VCFs you have provided. It will train to create

SNV and/or indel classiﬁers if you provide the true-Positives.snv.

vcf and/or truePositives.indel.vcf ﬁle(s) and invoke the --somatic-

seq-train option. Otherwise, it will fall back to the simple caller

consensus mode.

3.2 Running

SomaticSeq in Tumor-

Only Single Mode

SomaticSeq also supports tumor-only mode, in which case the

paired option is replaced with single option. It has the same set of

global input parameters as the paired mode, but a different set of

arguments/options to be placed after the single option. To see what

they are, you may run the following:

somaticseq_parallel.py single –help

3.2.1 Inputs and

Parameters for Tumor-Only

Mode

Arguments placed before the single option is same as the ones

described in Subheading 3.1.1. The following options are placed

after the single option:

l--bam-ﬁle: the BAM ﬁle for the tumor sample. Required.

l--sample-name: sample name to place into the VCF ﬁle. Default

is TUMOR. (Optional)

l--mutect-vcf: VCF ﬁle output from the original MuTect v1.

(Optional)

l--mutect2-vcf: VCF ﬁle output from GATK4’s Mutect2 and then

FilterMutect-Calls. (Optional)

54 Li Tai Fang

l--varscan-vcf: VCF ﬁle with both SNVs and indels from VarS-

can2. (Optional)

l--vardict-vcf: VCF ﬁle from VarDict. (Optional)

l--lofreq-vcf: VCF ﬁle with both SNVs and indels from LoFreq.

(Optional)

l--scalpel-vcf: VCF ﬁle from Scalpel. (Optional)

l--strelka-vcf: VCF ﬁle with both SNVs and indels from Strelka.

(Optional).

3.2.2 Interpreting the

Output Files

SomaticSeq will output a number of TSV and VCF ﬁles as results.

The SNVs and indels are in separate ﬁles. The TSV ﬁles contain all

the genomic and sequencing features extracted from the BAM ﬁles

and/or some ﬁeld of the individual callers’ output. If the truth is

VCF ﬁles are supplied, then each variant candidate will also be

labeled true positive or false positive. The header of the TSV ﬁle

describes each column. Missing values, for example, tBAM NM

Diff (difference in average edit distances between variant-

supporting and reference-supporting reads) when there is no

reference-supporting read will be “nan.”

The VCF ﬁles and the TSV ﬁles contain the same variants,

though the VCF ﬁles only record a subset of the SomaticSeq

features recorded in the TSV ﬁles. The descriptions are in the

VCF headers, but we will describe them here. The QUAL column

is a standard column in VCF format. This column contains the

Phred-scaled SomaticSeq score (i.e., probability) if the VCF ﬁle

was generated during Prediction mode. By default, variants with

probability 0.5 (Phred-scaled QUAL 3.01) are labeled PASS.

Variants with 0.1 probability 0.5 (0.458 QUAL 3.01) are

labeled LowQual, and those under that threshold are labeled

REJECT. In Consensus code, however, the QUAL column will

always be zero, and the PASS/LowQual/REJECT labels are deter-

mined by the consensus of the callers you have incorporated into

each workﬂow. If the majority of the callers (i.e., >50%) considers a

variant high-conﬁdence somatic mutation, the variant will be

labeled PASS. If at least 1/3 of the callers (i.e., 33%), the variant

will be labeled LowQual. Otherwise, the variant will be labeled

REJECT. As an example, if there are ﬁve SNV callers used, then

you need at least three callers to be labeled PASS and two callers to

be labeled LowQual. If it is called by only one caller, it will be

labeled REJECT.

In the INFO column of the VCF ﬁle, there is a string (e.g.,

MDUK ¼1,1,0,1) that tells you each caller’s binary classiﬁcation

on the variant (1 for positive classiﬁcation and 0 otherwise). The

string varies depending on the callers incorporated in the workﬂow,

that is, M ¼MuTect/Indelocator/MuTect2, V ¼VarScan2,

J¼JointSNVMix2, S ¼SomaticSniper, D ¼VarDict, U ¼MuSE,

Ensemble and Machine Learning Method to Detect Somatic Mutations 55

L¼LoFreq, P ¼Scalpel, K ¼Strelka, T ¼TNscope, and Y ¼Platy-

pus. NUM TOOLS tells you the number of callers where the

variant is classiﬁed as a somatic mutation.

The following metrics are in the sample columns, for the tumor

and normal samples separately:

lGT: genotyping. By default, it will be 1/1 if VAF 85%, 0/1 if

between 1% and 85%, and 0/0 if VAF is under 1%.

lDP4: four numbers representing the number of forward

reference-supporting reads, reverse reference-supporting reads,

forward variant-supporting reads, and reverse variant-

supporting reads.

lCD4: four numbers representing the number of concordant

reference-supporting reads, discordant reference-supporting

reads, concordant variant-supporting reads, and discordant

variant-supporting reads.

lrefMQ: average mapping quality score for reference-supporting

reads.

laltMQ: average mapping quality score for variant-supporting

reads.

lrefBQ: average base-call quality score for reference-supporting

bases.

laltBQ: average base-call quality score for variant-supporting

bases.

lrefNM: average edit-distance between reference-supporting

reads and genome reference.

laltNM: average edit-distance between variant-supporting reads

and genome reference.

lfetSB: Phred-scaled Fisher’s Exact Test score for DP4 to mea-

sure strand bias in variant-supporting versus reference-

supporting reads.

lfetCD: Phred-scaled Fisher’s Exact Test score for CD4 to mea-

sure bias in concordant versus discordant reads in variant-

supporting versus reference-supporting reads.

lzMQ: z-score for the mapping qualities between variant-

supporting versus reference-supporting reads. The value will

be positive if variant-supporting reads have higher MQs.

lzBQ: z-score for the base-call qualities between variant-

supporting versus reference-supporting bases. The value will

be positive if variant-supporting reads have higher BQs.

lMQ0: number of mapping quality 0 reads (multiply mapped

reads) covering the variant position.

lVAF: variant allele frequency.

56 Li Tai Fang

In the default consensus mode, the following ﬁles will be

generated:

lEnsemble.sSNV.tsv and Ensemble.sINDEL.tsv.

lConsensus.sSNV.vcf and Consensus.sINDEL.vcf.

In training mode, the same ﬁles will also be generated, and each

variant will be annotated as a true positive or false positive (i.e.,

TruePositive or FalsePositive in VCF’s ID column, and 0 or 1 in the

TSV’s TrueVariant or False column). In addition, two classiﬁers will

be created as well: Ensemble.sSNV.tsv.ntChange.Classiﬁer.RData

and Ensemble.sINDEL.tsv.ntChange.Classiﬁer.RData.

If classiﬁers are supplied, the following ﬁles will be generated in

prediction mode:

lEnsemble.sSNV.tsv, Ensemble.sINDEL.tsv, SSeq.Classiﬁed.sSNV.

tsv, and SSeq.Classiﬁed.sINDEL.tsv.

lSSeq.Classiﬁed.sSNV.vcf and SSeq.Classiﬁed.sINDEL.vcf.

The difference between SSeq.Classiﬁed ﬁles and their consensus

counterparts is that the former are scored by SomaticSeq classiﬁers.

3.3 Running

Compatible Somatic

Mutation Callers

To make it easy for new users to get things started, we have dock-

erized a number of commonly used somatic mutation callers that

you may use before running SomaticSeq. SomaticSeq includes the

makeSomaticScripts.py module that creates run scripts for those

dockerized callers. Both tumor-normal paired runs and tumor-

only jobs are supported.

To see the full options, you may run either of the following

commands, one for tumor-normal (paired) mode and the other for

tumor-only (single) mode.

makeSomaticScripts.py paired -h

makeSomaticScripts.py single -h

Here is an example to generate run scripts for the individual

callers and SomaticSeq that combines the results of these callers.

Do keep in mind that this module is not a core SomaticSeq algo-

rithm. It simply calls for a number of third-party software tools that

we have dockerized. The run scripts for the tools we generate are

not extensively optimized. They may not run the latest version, and

we cannot guarantee that they are fully compatible with your

compute environment. If that is the case, you need to consult

with the authors for those tools.

makeSomaticScripts.py paired \

--output-directory /ABSOLUTE/PATH/TO/SomaticOutput \

--tumor-bam /ABSOLUTE/PATH/TO/tumor.bam \

--normal-bam /ABSOLUTE/PATH/TO/normal.bam \

Ensemble and Machine Learning Method to Detect Somatic Mutations 57

--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \

--inclusion-region /ABSOLUTE/PATH/TO/inclusion_region.bed \

--exclusion-region /ABSOLUTE/PATH/TO/blacklist.bed \

--dbsnp-vcf /ABSOLUTE/PATH/TO/dbSNP.vcf \

--cosmic-vcf /ABSOLUTE/PATH/TO/COSMIC.vcf \

--threads 12 \

--run-mutect2 --run-vardict --run-muse --run-lofreq --run-

strelka2

--run-somaticseq

The --threads 12 input invokes the program to create 12 sub-

directories named 1, 2, ..., 12 in the output directory. In each

subdirectory, a BED ﬁle is created to represent 1/12 of the total

base pairs in the inclusion region BED ﬁle. If a BED ﬁle is not

supplied, those sub-BED ﬁles will be based on the index ﬁle for the

genome reference, that is, the .fa.fai ﬁle, although we recommend

supplying a BED ﬁle even for whole genome sequencing (see

Note 2).

In each of the subdirectory, a run script (ending in .cmd) for

each somatic mutation caller invoked by --run-XXX ﬂag is created

in “logs.” You will need to execute these scripts. If you invoke --

action qsub in the command, then these scripts will be submitted to

the compute management via the qsub command. You may also

include extra arguments there, for example, --action ‘qsub

-l h ¼node01’. In addition, in each of the 12 subdirectories, a

SomaticSeq directory will be created as well. The run scripts Soma-

ticSeq/logs/somaticSeq.timestamp.cmd will need to be executed

or qsubed manually after all the callers (in this thread) are

completely successfully.

Furthermore, when more than one thread is invoked, there will

be another script named SomaticOutput/logs/mergeResults.time-

stamp.cmd, which you may qsub or execute to merge the result ﬁle

from all the threads together. SomaticSeq training will also be

executed at this step if more than one thread is speciﬁed.

3.3.1 Inputs and

Parameters

l--output-directory: absolute path to output all the results.

Default is the current directory.

l--somaticseq-directory: name of the SomaticSeq directory inside

the output directory. Default is SomaticSeq.

l--tumor-bam: absolute path to the indexed tumor bam ﬁle. This

along with its .bai index ﬁle is required.

l--normal-bam: absolute path to the indexed normal bam ﬁle.

This along with its .bai index ﬁle is required.

l--tumor-sample-name: sample name to place in the VCF ﬁle as

the tumor. Default is TUMOR.

58 Li Tai Fang

l--normal-sample-name: sample name to place in the VCF ﬁle as

the normal. Default is NORMAL.

l--genome-reference: genome reference ﬁle in fasta format (typi-

cally .fa or .fasta extension). Always required, and also required

are the existence of the index ﬁle (.fa.fai) and the dict ﬁle (.dict).

l--inclusion-region: if supplied, only calls within these regions will

be considered.

l--exclusion-region: if supplied, calls outside these regions will be

excluded.

l--dbsnp-vcf: dbSNP VCF ﬁle. This is required because some

tools ask for it. Also required are the .vcf.gz ﬁle and the .vcf.

gz.idx ﬁle if MuSE is invoked.

l--cosmic-vcf: COSMIC VCF ﬁle, purely for annotation purposes.

Optional.

l--run-mutect2: a ﬂag to create script to run GATK4’s MuTect2.

l--run-varscan2: a ﬂag to create script to run VarScan2.

l--run-jointsnvmix2: a ﬂag to create script to run JointSNVMix2

(cannot be parallelized).

l--run-somaticsniper: a ﬂag to create script to run SomaticSniper

(cannot be parallelized).

l--run-vardict: a ﬂag to create script to run VarDictJava.

l--run-muse: a ﬂag to create script to run MuSE.

l--run-lofreq: a ﬂag to create script to run LoFreq.

l--run-scalpel: a ﬂag to create script to run Scalpel.

l--run-strelka2: a ﬂag to create script to run Strelka2.

l--run-somaticseq: a ﬂag to create script to run SomaticSeq.

l--action: the command for the caller scripts generator. Default is

“echo,” such that the paths of the scripts will be printed onto the

command line terminal, but nothing will be done for them. A

common choice would be “qsub” if you want to submit those

scripts into your compute queue system. You may also include

arguments for the scripts by using single quotes such as --action

‘qsub -l h ¼“node01|node02”’.

l--somaticseq-action: same as above, but for the SomaticSeq

script. Keep in mind the SomaticSeq cannot be executed until

all the individual caller jobs have completed. Default is echo.

l--snv-classiﬁer: absolute path to SomaticSeq SNV classiﬁer,

which will invoke prediction mode.

l--indel-classiﬁer: absolute path to SomaticSeq SNV classiﬁer,

which will invoke prediction mode.

l--truth-snv: a VCF ﬁle containing true positive SNVs. When

included, every SNV call in this VCF ﬁle will be labeled a true

Ensemble and Machine Learning Method to Detect Somatic Mutations 59

positive, and everything a false positive. This is required in

training mode. (If an exclusion region BED ﬁle is included, the

calls inside the exclusion regions will be ignored.)

l--truth-indel: same as above, but for indels.

l--train-somaticseq: a ﬂag to invoke training mode in SomaticSeq

script if truth SNV and/or indel VCF ﬁles are supplied. Default

is False.

l--minimum-VAF: minimum variant allele frequencies to be

passed onto VarScan2 and VarDict callers. If not supplied, it

will be the default 0.10 for VarScan2 and the recommended

0.05 for VarDict.

l--threads: number of threads. It will split the job into equal sized

regions (based on inclusion BED ﬁle or the .fa.fai ﬁle) to execute

each thread. Default ¼1.

l--exome-setting: a ﬂag to invoke exome setting in MuSE and

Strelka2.

l--mutect2-arguments: extra argument to pass onto GATK4’s

Mutect2 command. Use single quotes to include multiple

words, for example, --mutect2-arguments ‘--min-base-quality-

score 20 --tumor-lod-to-emit 10’.

l--mutect2-ﬁlter-arguments: extra argument to pass onto

GATK4’s Filter-MutectCalls command. Use single quotes to

include multiple words, see --mutect2-arguments for example.

l--varscan-pileup-arguments: extra argument to pass onto sam-

tools mpileup prior to VarScan2. Use single quotes to include

multiple words, see --mutect2-arguments for example.

l--varscan-arguments: extra argument to pass onto VarScan2.

Use single quotes to include multiple words, see --mutect2-

arguments for example.

l--jsm-train-arguments: extra argument to pass onto Join-

tSNVMix2’s train step. Use single quotes to include multiple

words, see --mutect2-arguments for example.

l--jsm-classify-arguments: extra argument to pass onto Join-

tSNVMix2’s classify step. Use single quotes to include multiple

words, see --mutect2-arguments for example.

l--somaticsniper-arguments: extra argument to pass onto Soma-

ticSniper. Use single quotes to include multiple words, see

--mutect2-arguments for example.

l--vardict-arguments: extra argument to pass onto vardict com-

mand. Use single quotes to include multiple words, see

--mutect2-arguments for example.

l--muse-arguments: extra argument to pass onto MuSE. Use

single quotes to include multiple words, see --mutect2-

arguments for example.

60 Li Tai Fang

l--lofreq-arguments: extra argument to pass onto LoFreq. Use

single quotes to include multiple words, see --mutect2-

arguments for example.

l--scalpel-discovery-arguments: extra argument to pass onto Scal-

pel’s discovery step. Use single quotes to include multiple

words, see --mutect2-arguments for example.

l--scalpel-export-arguments: extra argument to pass onto

Scalpel’s export step. Use single quotes to include multiple

words, see --mutect2-arguments for example.

l--scalpel-two-pass: a ﬂag to invoke the two-pass option for

Scalpel.

l--strelka-conﬁg-arguments: extra argument to pass onto

Strelka2’s conﬁg step. Use single quotes to include multiple

words, see --mutect2-arguments for example.

l--strelka-run-arguments: extra argument to pass onto Strelka2’s

run step. Use single quotes to include multiple words, see

--mutect2-arguments for example.

l--somaticseq-arguments: extra argument to pass onto Somatic-

Seq. Use single quotes to include multiple words, see --mutect2-

arguments for example.

Keep in mind that makeSomaticScripts.py is not a core Soma-

ticSeq algorithm. It is to help get things started. The run scripts

generated by the makeSomaticScripts module pull the docker

images we have created. The docker containers access the system

ﬁles by mounting the root directory to /mnt inside the container,

and then access the ﬁles through /mnt/PATH/TO/ﬁles. Thus,

when pointing to paths and ﬁles in the makeSomaticScripts.py

command, it is imperative to use absolute physical paths.

3.4 Create Training

Data Sets

SomaticSeq employs supervised machine learning that relies on

good training data sets to create accurate classiﬁers. An ideal train-

ing data for SomaticSeq would be a pair of real tumor-normal NGS

data sets, with every true somatic mutation and false positive accu-

rately labeled. However, no such data set exists right now. Never-

theless, an excellent approach is to use two germline sequencing

replicates as background, and then create in silico mutations into

one of them as the designated tumor [15]. This way, only the in

silico mutations we have created are true mutations, and everything

else represents replicate-to-replicate noises that make up the false

positives. This approach describes the SomaticSeq training mode

depicted in Fig. 1c. There are number of well-characterized germ-

line samples that have been repeatedly sequenced at multiple

sequencing centers that you may try out [16,17]. To make it easy

to get things started, we have included a number of scripts in

Ensemble and Machine Learning Method to Detect Somatic Mutations 61

SomaticSeq to run dockerized BAMSurgeon workﬂows to create

training data on which SomaticSeq classiﬁers can be built.

3.4.1 Have Sequencing

Replicates for the Normal

Here, we present an example for the entire workﬂow of creating

SomaticSeq classiﬁers using two replicates of the NA12878

genome made public by Garvan Institute [17].

First, you may download the .fastq.gz ﬁles for both replicates,

NA12878D and NA12878J, and then align them into BAM ﬁles.

For inexperienced users, you may run the following command to

generate a run script based on GATK’s beat practices to create the

BAM ﬁles [18]. Run the following command for both replicate D

and J, assuming you have bwa indexed reference ﬁles [19].

somaticseq/utilities/dockered_pipelines/alignments/ fas-

tq2bam_pipeline.sh \

--output -dir /ABSOLUTE/PATH/TO/replicateD \

--tumor-fq1/ABSOLUTE/PATH/TO/NA12878D_HiSeqX_R1.fastq.gz \

--tumor-fq2/ABSOLUTE/PATH/TO/NA12878D_HiSeqX_R2.fastq.gz \

--tumor-bam-header’@RG\tID:XTenD\tPL:illumina\tLB:X10\tSM:

NA12878D’ \

--tumor-out-bam NA12878D.bam \

--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \

--threads 12

--bwa --pre-realign-markdup

In addition to the genome reference GRCh38.fa, there must

also be GRCh38.fa.bwt, GRCh38.fa.sa, GRCh38.fa.pac, GRCh38.

fa.ann, and GRCh38.fa.amb, which can be created with “bwa index

GRCh38.fa” command [19].

Once you run the command above, a script /ABSOLUTE/

PATH/TO/replicateD/logs/fastq2bam.timestampe.cmd will be

created, which you may execute to use BAM MEM to align the

reads into BAM ﬁles. Make sure to use different SM and ID tags in

the BAM header for the two BAM ﬁles (designated tumor and

normal) because MuTect2 requires it.

Once NA12878D.bam and NA12878J.bam are created, the

command below will designate NA12878D.bam as the tumor and

create up to 20,000 SNVs and 8000 indels into NA12878D.bam.

BAMSurgeon creates in silico mutations by changing the base(s) in

a subset of the reads covering the genomic position, and then

realigns the reads after they are synthetically mutated. A common

question is about the size of the training data, for example, how

many samples or how many variants. A rule of thumb is >500 true

positive variants plus a larger number of false positives (see Note 3).

The somatic mutation rate in training data should not be vastly

different from reality. If your data is small targeted panels, you may

need more than one tumor-normal pair to create a training set large

enough (see Note 4). The resulting semisynthetic tumor-normal

62 Li Tai Fang

BAM ﬁles will be named syntheticTumor.bam and syntheticNor-

mal.bam.

somaticseq/utilities/dockered_pipelines/bamSimulator/ BamSi-

mulator_multiThreads.sh \

--output-dir /ABSOLUTE/PATH/TO/trainingSet \

--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \

--tumor-bam-in /ABSOLUTE/PATH/TO/NA12878D.bam \

--normal-bam-in /ABSOLUTE/PATH/TO/NA12878J.bam \

--tumor-bam-out syntheticTumor.bam \

--normal-bam-out syntheticNormal.bam \

--num-snvs 20000 \

--num-indels 8000 \

--min-vaf 0.0 \

--max-vaf 1.0 \

--left-beta 2 \

--right -beta 5 \

--min-variant-reads 2 \

--threads 12 \

--action qsub \

--merge -output -bams

Again, in addition to GRCh38.fa, bwa index ﬁles GRCh38.fa.

bwt, GRCh38.fa.sa, GRCh38.fa.pac, GRCh38.fa.ann, and

GRCh38.fa.amb are also required.

The four parameters in the command above, that is, --min-vaf,

--max-vaf,--left-beta, and --right-beta, determine the VAF distri-

bution of the in silico mutations. The following python script will

display the VAF distribution of the settings in the previous

command:

Once all the threads are completed successfully, you may exe-

cute the /ABSOLUTE/PATH/TO/trainingSet/logs/merge-

Files.timestamp.cmd to merge all the BAM ﬁles and ground truth

VCF ﬁles (in silico mutations) into the output directory. These ﬁles

may be used to train for SomaticSeq classiﬁers, for example,

makeSomaticScripts.py paired \

--normal-bam /ABSOLUTE/PATH/TO/trainingSet/syntheticNormal.

bam \

--tumor-bam/ABSOLUTE/PATH/TO/trainingSet/syntheticTumor.bam \

--truth-snv /ABSOLUTE/PATH/TO/trainingSet/synthetic_snvs.vcf

--truth-inde /ABSOLUTE/PATH/TO/trainingSet/synthetic_indels.

leftAlign.vcf \

--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \

--output-directory/ABSOLUTE/PATH/TO/trainingSet/somaticMuta-

tions \

--dbsnp-vcf /ABSOLUTE/PATH/TO/dbSNP.hg38.vcf \

Ensemble and Machine Learning Method to Detect Somatic Mutations 63

--inclusion-region /ABSOLUTE/PATH/TO/genome.bed \

--action qsub \

--threads 12 \

--run-mutect2 --run-vardict --run-muse --run-strelka2 --run-

somaticseq

--train -somaticseq

The --action qsub will submit the somatic mutation caller jobs

via that command. Otherwise you may execute them yourself.

Once all those jobs are complete, you may submit or execute all

the SomaticSeq scripts created in /ABSOLUTE/PATH/TO/trai-

ningSet/somaticMutations/1,2,3,.../SomaticSeq/logs/somatic-

Seq.timestamp.cmd.

After all the SomaticSeq threads are complete, you can ﬁnally

submit or execute the script /ABSOLUTE/PATH/TO/training-

Set/somaticMutations/logs/mergeResults.timestamp.cmd to

combine the results from the different threads. It will also train

the labeled data sets into SomaticSeq classiﬁers Ensemble.sSNV.tsv.

ntChange.Classiﬁer.RData and Ensemble.sINDEL.tsv.ntChange.

Classiﬁer.RData in the output directory. These classiﬁers may

then be used to classify your own mutation calls.

3.4.2 Split a Normal Data

Set Into Designated Tumor

and Normal

Another way to create training data set is to split a (relatively high

coverage) germline sequencing data into two halves, with one

designated as normal and the other designated as tumor (Fig. 2a).

You may not get run-to-run biases from this method, but the data

will still have plenty of false positives deriving from sequencing

errors, sampling error of germline variants, and so on. An example

command would be as follows, where the --split-proportion directs

the fraction of reads to the designated normal. The following is an

example command to split HighCoverageGenome.bam 50–50 into

designated tumor and normal.

somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimu-

lator_multiThreads.sh \

--output-dir /ABSOLUTE/PATH/TO/trainingSet \

--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \

--tumor-bam-in /ABSOLUTE/PATH/TO/HighCoverageGenome.bam \

--tumor-bam-out syntheticTumor.bam \

--normal-bam-out syntheticNormal.bam \

--split-proportion 0.5 \

--min-variant-reads 2 \

--threads 12 \

--action qsub \

--num-snvs 10000 --num-indels 8000 --num-svs 1500 \

--min-vaf 0.0 --max-vaf 1.0 --left-beta 2 --right-beta 5 \

--split -bam --merge -output -bams

64 Li Tai Fang

3.4.3 Merge Tumor-

Normal and Then Random

Split Them

Another approach create training data set is to ﬁrst merge the

tumor and normal BAM ﬁles into a single BAM ﬁle, and then

randomly split it into the designated tumor and normal (Fig. 2b).

This way, the real somatic mutations in the original tumor BAM will

be equally split into both designated tumor and normal

(on average) and would effectively be germline variants that will

be labeled as false positive if called by a caller. In silico mutations are

then created in the designated tumor. In the absence of normal

sequencing replicates or germline sequencing data with high

enough coverage, this may be the next best option to create classi-

ﬁers. Use --merge-bam ﬂag to invoke this option:

somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimu-

lator_multiThreads.sh \

--output-dir /ABSOLUTE/PATH/TO/trainingSet \

--genome-reference /ABSOLUTE/PATH/TO/GRCh38.fa \

--tumor-bam-in /ABSOLUTE/PATH/TO/Tumor_Sample.bam \

Fig. 2 Two additional scenarios to create synthetic tumor-normal pairs

Ensemble and Machine Learning Method to Detect Somatic Mutations 65

--normal-bam-in /ABSOLUTE/PATH/TO/Normal_Sample.bam \

--tumor-bam-out syntheticTumor.bam \

--normal-bam-out syntheticNormal.bam \

--split-proportion 0.5 \

--min-variant-reads 2 \

--threads 12 \

--num-snvs 30000 --num-indels 10000 --num-svs 1500 \

--min-vaf 0.0 --max-vaf 1.0 --left-beta 2 --right-beta 5 \

--merge -bam --split -bam --merge -output -bams

3.4.4 Inputs and

Parameters

The following are all the options available for the BamSimulator

multiThreads.sh script:

l--output-dir: absolute path to the output directory. Required.

l--genome-reference: absolute path to the genome reference,

assuming the index ﬁles for the aligner (i.e., BWA) are available

as well. Required.

l--selector: if provided, will create in silico mutations only in these

regions.

l--tumor-bam-out: ﬁle name for the synthetic tumor BAM out-

put. Default is syntheticTumor.bam.

l--tumor-bam-in: absolute path to input tumor BAM ﬁle. In

scenario with two germline sequencing replicates, this is the

sample where in silico mutations will be created. In scenario

where a (relatively) high-coverage germline sample will be

split, this is the path to that germline BAM ﬁle. In the ﬁnal

scenario where two BAM ﬁles are to be merged, this can point

to any of the two BAM ﬁles. Required.

l--normal-bam-out: ﬁle name for the synthetic normal BAM

output. Default is syntheticNormal.bam.

l--normal-bam-in: absolute path to input BAM ﬁle to be the

designated normal. Required.

l--split-proportion: the fraction of total reads to be split into the

designated normal. Not needed in the ﬁrst scenario with two

germline sequencing replicates. Otherwise the default is 0.5.

l--down-sample: downsamples the BAM ﬁles when creating syn-

thetic tumor and normal BAM ﬁles. Default is 1, that is, no

downsampling.

l--num-snvs: number of in silico SNVs to attempt. The actual

SNVs will usually be lower because when certain conditions are

not met (e.g., depth too low, etc.), an attempt will be skipped.

l--num-indels: number of in silico indels to attempt.

l--num-svs: number of in silico SVs to attempt. Default is 0.

l--min-vaf: minimum variant allele frequency create.

66 Li Tai Fang

l--max-vaf: maximum variant allele frequency create.

l--left-beta: left beta for beta distribution for VAF.

l--right-beta: right beta for beta distribution for VAF.

l--min-depth: mimimum depth to attempt mutation.

l--max-depth: maximum depth to attempt mutation.

l--min-variant-reads: minimum number of variant reads to cre-

ate for each in silico mutations.

l--aligner: what aligner to use to remap a read after the read is

mutated in in silico. Default is bwa mem.

l--seed: choose a random number generator seed for reproduc-

ibility purposes.

l--action: what to do with the workﬂow scripts generated.

Default is echo.

l--threads: number of threads. You would have to merge the

results from each thread after all the threads are completed

successfully.

l--merge-bam: ﬂag to merge the input tumor and normal bam.

l--split-bam: ﬂag to split the BAM ﬁles into designated tumor and

normal.

l--clean-bam: ﬂag to clean up the input BAM ﬁles if there are

more than two reads of the same read names, by simply

removing them.

l--indel-realign: ﬂag to perform GATK’s joint indel realignment

for the designated tumor and normal BAM ﬁles.

l--merge-output-bams: ﬂag to merge the output BAM and VCF

ﬁles from different thread. You should use this for

multithreaded jobs.

l--keep-intermediates: keep all the intermediate ﬁles for debug-

ging purposes.

4 Notes

1. In SomaticSeq algorithms, each variant is deﬁned by its geno-

mic start position, reference base(s), and variant base(s), that is,

the following four ﬁelds in a VCF ﬁle: CHROM, POS, REF,

and ALT. Different mutations in the same genomic position are

considered different variant calls and will have different features

extracted.

2. SomaticSeq allows the input of an inclusion region. Without it,

SomaticSeq will assume whole genome with the index ﬁle of

the genome reference (typically .fa.fai). However, even for

Ensemble and Machine Learning Method to Detect Somatic Mutations 67

whole genome data, we recommend that you create a BED ﬁle

for the whole genome that only includes the major chromo-

somes (i.e., chr1, chr2, ..., chrX, chrY) because often the

human reference ﬁles include alternate contigs, viral contigs,

decoy contigs, and even chrM, and so on. Reads aligned to

those contigs tend to be poorly aligned but often with very

high apparent coverage, thus some mutations callers will then

waste a disproportionate amount of time attempting to do local

assembly on them to resolve variant calls in these contigs, when

they are mostly wasted compute time. A BED ﬁle can be

created to simply exclude those regions.

3. Generally speaking, machine learning works increasingly better

with larger data sets. In our original publication, we measured

the accuracy of SomaticSeq cross validation with increasing

data size and have found that generally, the accuracy plateaus

when there are >500 true mutations in the training data

(assuming there are more false positives than true positives;

Fig. 3).

4. There are cases where users need to combine samples to create

a larger training set, for example, mutation call sets from tar-

geted panel or even a whole exome sequencing may not be

large enough to create a reliable classiﬁer. In order to do so,

you ﬁrst need to merge the Ensemble.sSNV.tsv ﬁles and

Ensemble.sINDEL.tsv ﬁles, but keeping only one header.

Make sure those ﬁles were created with the same versions and

parameters of SomaticSeq, so that each column means the same

thing from different ﬁles. The ﬁles can be combined like this:

0.4

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400 500 600 700 800 900 1000

DC3A DC3D N0T50 N2.5T15

Fig. 3 Figure adapted from [12]. Y-axis represents the F

score of cross validation, and X-axis represents the

number of true somatic mutations in a training data set. The four different plots represent four different

data sets

68 Li Tai Fang

cat /PATH/sampleA/SomaticSeq/Ensemble.sSNV.tsv /PATH/sampleB/

SomaticSeq/Ensemble.sSNV.tsv | awk ‘NR==1 | $1 !~ /^CHROM/’ >

Combined.sSNV.tsv

cat /PATH/sampleA/SomaticSeq/Ensemble.sINDEL.tsv /PATH/sam-

pleB/SomaticSeq/Ensemble.sINDEL.tsv | awk ‘NR==1 | $1 !~ /

^CHROM/’ >Combined.sINDEL.tsv

Then, you can invoke the machine learning training scripts

in R directly:

r_scripts/ada_model_builder_ntChange.R Combined.sINDEL.tsv

Consistent_Mates Inconsistent_Mates Strelka_QSS Strelka_TQSS

r_scripts/ada_model_builder_ntChange.R Combined.sSNV.tsv Con-

sistent_Mates Inconsistent_Mates

The values after the TSV ﬁles are features to be excluded in

training, and those features have not yet shown they improve

accuracy, so by default, they were excluded.

References

1. Cibulskis K, Lawrence MS, Carter SL et al

(2013) Sensitive detection of somatic point

mutations in impure and heterogeneous cancer

samples. Nat Biotechnol 31(3):213–219

2. Koboldt DC, Zhang Q, Larson DE et al

(2012) VarScan 2: somatic mutation and copy

numberalteration discovery in cancer by exome

sequencing. Genome Res 22(3):568–576

3. Roth A, Ding J, Morin R et al (2012) Join-

tSNVMix: a probabilistic model for accurate

detection of somatic mutations in normal/

tumour paired next-generation sequencing

data. Bioinformatics 28(7):907–913

4. Larson DE, Harris CC, Chen K et al (2012)

SomaticSniper: identiﬁcation of somatic point

mutations in whole genome sequencing data.

Bioinformatics 28(3):311–317

5. Lai Z, Markovets A, Ahdesmaki M et al (2016)

VarDict: a novel and versatile variant caller for

next-generation sequencing in cancer research.

Nucleic Acids Res 44(11):e108

6. Fan Y, Xi L, Hughes DST et al (2016) MuSE:

accounting for tumor het- erogeneity using a

sample-speciﬁc error model improves sensitiv-

ity and speciﬁcity in mutation calling from

sequencing data. Genome Biol 17(1):178

7. Wilm A, Aw PPK, Bertrand D et al (2012)

LoFreq: a sequence-quality aware, ultra-

sensitive variant caller for uncovering cell-

population heterogeneity from high-

throughput sequencing datasets. NucleicA-

cidsRes 40(22):11189–11201

8. Narzisi G, O’Rawe JA, Iossifov I et al (2014)

Accurate de novo and transmitted indel detec-

tion in exome-capture data using microassem-

bly. Nat Methods 11(10):1033–1036

9. Kim S, Schefﬂer K, Halpern AL et al (2018)

Strelka2: fast and accurate calling of germline

and somatic variants. Nat Methods 15

(8):591–594

10. Freed D, Pan R, Aldana R (2018) Tnscope:

accurate detection of somatic mutations with

haplotype-based variant candidate detection

and machine learning ﬁltering. bioRxiv

11. Thorvaldsdottir H, Robinson JT, Mesirov JP

(2013) Integrative genomics viewer (IGV):

high-performance genomics data visualization

and exploration. Brief Bioinform 14

(2):178–192

12. Fang LT, Afshar PT, Chhibber A et al (2015)

An ensemble approach to accurately detect

somatic mutations using somaticseq. Genome

Biol 16(1):197

13. Johnson K, Culp M, Michailides G (2006) ada:

an R package for stochastic boosting. J Stat

Softw 17(2)

14. Quinlan AR, Hall IM (2010) BEDTools: a

ﬂexible suite of utilities for comparing genomic

features. Bioinformatics 26(6):841–842

15. Ewing AD, Houlahan KE, Hu Y et al (2015)

Combining tumor genome simulation with

crowdsourcing to benchmark somatic single-

nucleotide-variant detection. Nat Methods 12

(7):623–630

Ensemble and Machine Learning Method to Detect Somatic Mutations 69

16. Genome in a bottle. https://www.nist.gov/pro

grams-projects/genome-bottle

17. First publicly available XTen genome. http://

allseq.com/knowledge-bank/1000-genome/

get-your-1000-genome-test-data-set/

18. Roberts ND, Daniel Kortschak R, Parker WT

et al (2013) A comparative analysis of algo-

rithms for somatic snv detection in cancer. Bio-

informatics 29(18):2223–2230

19. Li H (2013) Aligning sequence reads, clone

sequences and assembly contigs with bwa-mem

70 Li Tai Fang

Extend the benchmarking indel set by manual review using the individual cell line sequencing data from the Sequencing Quality Control 2 (SEQC2) project

Article

Full-text available

Mar 2024

Accurate indel calling plays an important role in precision medicine. A benchmarking indel set is essential for thoroughly evaluating the indel calling performance of bioinformatics pipelines. A reference sample with a set of known-positive variants was developed in the FDA-led Sequencing Quality Control Phase 2 (SEQC2) project, but the known indels in the known-positive set were limited. This project sought to provide an enriched set of known indels that would be more translationally relevant by focusing on additional cancer related regions. A thorough manual review process completed by 42 reviewers, two advisors, and a judging panel of three researchers significantly enriched the known indel set by an additional 516 indels. The extended benchmarking indel set has a large range of variant allele frequencies (VAFs), with 87% of them having a VAF below 20% in reference Sample A. The reference Sample A and the indel set can be used for comprehensive benchmarking of indel calling across a wider range of VAF values in the lower range. Indel length was also variable, but the majority were under 10 base pairs (bps). Most of the indels were within coding regions, with the remainder in the gene regulatory regions. Although high confidence can be derived from the robust study design and meticulous human review, this extensive indel set has not undergone orthogonal validation. The extended benchmarking indel set, along with the indels in the previously published known-positive set, was the truth set used to benchmark indel calling pipelines in a community challenge hosted on the precisionFDA platform. This benchmarking indel set and reference samples can be utilized for a comprehensive evaluation of indel calling pipelines. Additionally, the insights and solutions obtained during the manual review process can aid in improving the performance of these pipelines.

Simple combination of multiple somatic variant callers to increase accuracy

Article

Full-text available

May 2023

Publications comparing variant caller algorithms present discordant results with contradictory rankings. Caller performances are inconsistent and wide ranging, and dependent upon input data, application, parameter settings, and evaluation metric. With no single variant caller emerging as a superior standard, combinations or ensembles of variant callers have appeared in the literature. In this study, a whole genome somatic reference standard was used to derive principles to guide strategies for combining variant calls. Then, manually annotated variants called from the whole exome sequencing of a tumor were used to corroborate these general principles. Finally, we examined the ability of these principles to reduce noise in targeted sequencing.

TNscope: Accurate Detection of Somatic Mutations with Haplotype-based Variant Candidate Detection and Machine Learning Filtering

Preprint

Full-text available

Jan 2018

Detection of somatic mutations in tumor samples is important in the clinic, where treatment decisions are increasingly based upon molecular diagnostics. However, accurate detection of these mutations is difficult, due in part to intra-tumor heterogeneity, contamination of the tumor sample with normal tissue and pervasive structural variation. Here, we describe Sentieon TNscope, a haplotype-based somatic variant caller with increased accuracy relative to existing methods. An early engineering version of TNscope was used in our submission to the most recent ICGC-DREAM Somatic Mutation calling challenge. In that challenge, TNscope is the leader in accuracy for SNVs, indels and SVs. To further improve variant calling accuracy, we combined the improvements in the variant caller with machine learning. We benchmarked TNscope using in-silico mixtures of well-characterized Genome in a Bottle (GIAB) samples. TNscope displays higher accuracy than the other benchmarked tools and the accuracy is substantially improved by the machine learning model.

Strelka2: fast and accurate calling of germline and somatic variants

Article

Full-text available

Aug 2018
Br J Pharmacol

We describe Strelka2 ( https://github.com/Illumina/strelka ), an open-source small-variant-calling method for research and clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model-based estimation of insertion/deletion error parameters from each sample, an efficient tiered haplotype-modeling strategy, and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperformed the current leading tools in terms of both variant-calling accuracy and computing cost.

MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data

Article

Full-text available

Aug 2016
GENOME BIOL

Subclonal mutations reveal important features of the genetic architecture of tumors. However, accurate detection of mutations in genetically heterogeneous tumor cell populations using next-generation sequencing remains challenging. We develop MuSE (http://bioinformatics.mdanderson.org/main/MuSE), Mutation calling using a Markov Substitution model for Evolution, a novel approach for modeling the evolution of the allelic composition of the tumor and normal tissue at each reference base. MuSE adopts a sample-specific error model that reflects the underlying tumor heterogeneity to greatly improve the overall accuracy. We demonstrate the accuracy of MuSE in calling subclonal mutations in the context of large-scale tumor sequencing projects using whole exome and whole genome sequencing. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-1029-6) contains supplementary material, which is available to authorized users.

VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research

Article

Full-text available

Apr 2016

Accurate variant calling in next generation sequencing (NGS) is critical to understand cancer genomes better. Here we present VarDict, a novel and versatile variant caller for both DNA- and RNA-sequencing data. VarDict simultaneously calls SNV, MNV, InDels, complex and structural variants, expanding the detected genetic driver landscape of tumors. It performs local realignments on the fly for more accurate allele frequency estimation. VarDict performance scales linearly to sequencing depth, enabling ultra-deep sequencing used to explore tumor evolution or detect tumor DNA circulating in blood. In addition, VarDict performs amplicon aware variant calling for polymerase chain reaction (PCR)-based targeted sequencing often used in diagnostic settings, and is able to detect PCR artifacts. Finally, VarDict also detects differences in somatic and loss of heterozygosity variants between paired samples. VarDict reprocessing of The Cancer Genome Atlas (TCGA) Lung Adenocarcinoma dataset called known driver mutations in KRAS, EGFR, BRAF, PIK3CA and MET in 16% more patients than previously published variant calls. We believe VarDict will greatly facilitate application of NGS in clinical cancer research.

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Article

Full-text available

Sep 2015
GENOME BIOL

SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0758-2) contains supplementary material, which is available to authorized users.

Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection

Article

Full-text available

May 2015
Br J Pharmacol

The detection of somatic mutations from cancer genome sequences is key to understanding the genetic basis of disease progression, patient survival and response to therapy. Benchmarking is needed for tool assessment and improvement but is complicated by a lack of gold standards, by extensive resource requirements and by difficulties in sharing personal genomic information. To resolve these issues, we launched the ICGC-TCGA DREAM Somatic Mutation Calling Challenge, a crowdsourced benchmark of somatic mutation detection algorithms. Here we report the BAMSurgeon tool for simulating cancer genomes and the results of 248 analyses of three in silico tumors created with it. Different algorithms exhibit characteristic error profiles, and, intriguingly, false positives show a trinucleotide profile very similar to one found in human tumors. Although the three simulated tumors differ in sequence contamination (deviation from normal cell sequence) and in subclonality, an ensemble of pipelines outperforms the best individual pipeline in all cases. BAMSurgeon is available at https://github.com/adamewing/bamsurgeon/.

Accurate de novo and transmitted indel detection in exome-capture data using microassembly

Article

Full-text available

Aug 2014
Br J Pharmacol

We present an open-source algorithm, Scalpel (http://scalpel.sourceforge.net/), which combines mapping and assembly for sensitive and specific discovery of insertions and deletions (indels) in exome-capture data. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We analyzed 593 families from the Simons Simplex Collection and demonstrated Scalpel's power to detect long (≥30 bp) transmitted events and enrichment for de novo likely gene-disrupting indels in autistic children.

A Comparative Analysis of Algorithms for Somatic SNV Detection in Cancer

Article

Full-text available

Jul 2013
BIOINFORMATICS

With the advent of relatively affordable, high-throughput technologies, DNA sequencing of cancers is now common practice in cancer research projects and will be increasingly used in clinical practice to inform diagnosis and treatment. Somatic (cancer-only) single nucleotide variants (SNVs) are the simplest class of mutation, yet their identification in DNA sequencing data is confounded by germline polymorphisms, tumour heterogeneity, and sequencing and analysis errors. Four recently published algorithms for the detection of somatic SNV sites in matched cancer-normal sequencing datasets are VarScan, SomaticSniper, JointSNVMix and Strelka. In this analysis, we apply these four SNV calling algorithms to cancer-normal Illumina exome sequencing of a chronic myeloid leukaemia (CML) patient. The candidate SNV sites returned by each algorithm are filtered to remove likely false positives, then characterised and compared to investigate the strengths and weaknesses of each SNV calling algorithm. Comparing the candidate SNV sets returned by VarScan, SomaticSniper, JointSNVMix2 and Strelka revealed substantial differences with respect to: the number and character of sites returned; the somatic probability scores assigned to the same sites; their susceptibility to various sources of noise; and their sensitivities to low-allelic-fraction candidates. Data accession number SRA081939, code at http://code.google.com/p/snv-caller-review/ CONTACT: david.adelson@adelaide.edu.au.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Article

Mar 2013

Heng Li

BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/lh3/bwa. Contact: hengli@broadinstitute.org

Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples

Article

Feb 2013
NAT BIOTECHNOL

Detection of somatic point substitutions is a key step in characterizing the cancer genome. However, existing methods typically miss low-allelic-fraction mutations that occur in only a subset of the sequenced cells owing to either tumor heterogeneity or contamination by normal cells. Here we present MuTect, a method that applies a Bayesian classifier to detect somatic mutations with very low allele fractions, requiring only a few supporting reads, followed by carefully tuned filters that ensure high specificity. We also describe benchmarking approaches that use real, rather than simulated, sequencing data to evaluate the sensitivity and specificity as a function of sequencing depth, base quality and allelic fraction. Compared with other methods, MuTect has higher sensitivity with similar specificity, especially for mutations with allelic fractions as low as 0.1 and below, making MuTect particularly useful for studying cancer subclones and their evolution in standard exome and genome sequencing data.

SomaticSeq: An Ensemble and Machine Learning Method to Detect Somatic Mutations

Abstract and Figures

Recommended publications

Achieving robust somatic mutation detection with deep learning models derived from reference data se...

Deep convolutional neural networks for accurate somatic variant calling

Powering Toxicogenomic Studies by Applying Machine Learning to Genomic Sequencing and Variant Detect...

SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and...