PosterPDF Available

Sensitivity Analysis and Machine Learning Techniques Applied to Mosquito Gene Drive Emulators

Authors:

Abstract

Genetic control of insect populations, such as reducing the impact of mosquito-vectored diseases, became reality with the advent of CRISPR/Cas9 gene editing techniques. Since then, the field has exploded in complexity, as researchers and regulators demand ever-more realistic simulations to predict the ecological and epidemiological impact of novel genetic control mechanisms. However, the tools necessary to analyze complex simulation data lag behind in their ability to disentangle interactions between variables and understand their impact on experimental outcomes. Questions such as “What is the most efficient release methodology to maximally reduce disease transmission in this population?” or “What are the most important characteristics of a genetic construct for population suppression/replacement?” highlight the need for reliable and scalable analysis pipelines. Expected pipelines should be able to: run simulations for parameter exploration, process results and calculate summary statistics, deploy advanced regression and ML techniques for more comprehensive analysis, and generate simplified models garnering feedback from collaborators, regulators, or the wider community. In this work, we present such a workflow developed in tandem with MGDrivE (Mosquito Gene Drive Explorer). We augment previous pipelines with the addition of machine learning regression models to generate computationally simple emulators, and statistical sensitivity analysis to rank the relative importance of simulation parameters for their effect on mosquito populations. This improved pipeline is applied to the suppression of Aedes aegypti populations using pgSIT (precision guided sterile insect technique).
Sensitivity Analysis and Machine Learning Techniques
Applied to Mosquito Gene Drive Emulators
ctor M. Sánchez C., Jared B. Bennett, John M. Marshall
6. Package with Docker & Deploy to Heroku
3. Calculate Summary Statistics
4. Train MLR and Run SSA Independently
5. Inspect MLR and Compare with SSA
1. Setup Sampling Scheme
To get started, we determine the experimental framework,
and answer questions like:
-Which gene drive characteristics do we want to study?
-Which releases characteristics are we trying to
optimize?
In doing this, we determine independent variables,
ranges and sampling scheme; for the parameter
exploration.
In our pgSIT application we want to optimize for the GM mosquito
release sizes (res), number of releases (ren) and time interval (rei); while
checking how much of an impact cutting rate (pct), maternal deposition
(pmd), male fertility (mfr), mating tness (mtf), and female viability (fvb);
have on the success of the mosquito suppression intervention.
In this stage, we calculate relevant statistics for mosquito genetic
constructs, such as:
-WOP: amount of time the non-transgenic population is kept below
a given population fraction threshold
-CPT: area under the curve of the non-transgenic population
divided by the total simulated time
-POE: the probability to eliminate the non-transgenic population
with the construct
-TTI: time it takes for the wild population to drop below a certain
fraction threshold
-TTO: time it takes for the wild population to bounce back above a
certain population fraction threshold
As we are interested in mosquito suppression/elimination, we will focus on the CPT,
and POE in this use-case.
To add redundancy and interpretability, we train a neural network
regression model (MLR), and do statistical sensitivity analysis
(SSA) on the data independently. In each case, we do the following
analyses to rank feature importances:
-Statistical SA: Delta-Moments, FAST, HDMR
-Machine-Learning SA: Permutation Importance, SHAP Values,
PDP/ICE Plots
For our study, small neural networks (3 internal layers with 16 neurons each) were
enough to provide a parsimonious prediction upon provided inputs whilst avoiding
overtting (cross-validation r-squared values >0.91 for both summary statistics).
2. Dene Constants & Run Simulations
We now move to setting up the main simulation in our mechanistic
stochastic model: MGDrivE. This involves dening the landscape,
species, and gene drive parameters. With this in place, we let the
model run for a couple of days and wait patiently.
For this demo, we setup a panmictic population setting, with Aedes aegypti
parameters, and run thirty repetitions of each stochastic trace for ve years. The
whole process generated around 100Gb of CSV les.
Genetic control of insect populations, such as reducing the
impact of mosquito-vectored diseases, became reality with
the advent of CRISPR/Cas9 gene editing techniques. Since
then, the eld has exploded in complexity, as researchers and
regulators demand ever-more realistic simulations to predict
the ecological and epidemiological impact of novel genetic
control mechanisms. However, the tools necessary to analyze
complex simulation data lag behind in their ability to
disentangle interactions between variables and understand
their impact on experimental outcomes.
Questions such as “What is the most ecient release
methodology to maximally reduce disease transmission in
this population?” or “What are the most important
characteristics of a genetic construct for population
suppression/replacement?” highlight the need for reliable
and scalable analysis pipelines. Expected pipelines should
be able to: run simulations for parameter exploration,
process results and calculate summary statistics, deploy
advanced regression and ML techniques for more
comprehensive analysis, and generate simplied models
garnering feedback from collaborators, regulators, or the
wider community.
In this work, we present such a workow developed in tandem
with MGDrivE (Mosquito Gene Drive Explorer). We augment
previous pipelines with the addition of machine learning
regression models to generate computationally simple
emulators, and statistical sensitivity analysis to rank the
relative importance of simulation parameters for their eect
on mosquito populations. This improved pipeline is applied
to the suppression of Aedes aegypti populations using pgSIT
(precision guided sterile insect technique).
Research Context
Quick Summary
We now take our regression models and probe them in ways that
allow us to "peek" inside their black-box structures. At this stage we
also compare the normalized importance values of the features
between the ML models and the raw data's sensitivity analysis to
check for potential discrepancies that might indicate errors in
our regressions.
With pgSIT, as it is a self-limiting sterile-insect technique, the probability of
elimination (POE) and cumulative potential for transmission (CPT) show dierent
responses to changes on the inputs. Male fertility (mfr) and female viability (fvb), for
example, are not critical in CPT, but if elimination is the goal, they are quite relevant
for the construct's design.
In terms of features importance, we can see that there's a strong agreement
between the MLR models and the SSA, so we can be more condent
in the reliability of our machine learning emulators.
Finally, if there is enough agreement between the SA and the MLR —and if the regression's inspection
metrics are good— we end up with a perfectly usable set of regression models, which can be
shipped to collaborators as emulators of simulated scenarios. This can be done by creating a user
interface with Dash and packaging it with Docker. Once the image has been created and tested, it
can even be deployed to DockerHub for portability, or even to Heroku for online interactivity.
We present a data-analysis pipeline in which we combine statistical sensitivity analysis with
machine learning emulators to increase the reliability and interpretability of our gene drive models
for mosquito-disease elimination. We do so in eorts to make the results of our mechanistic
MGDrivE model more accessible to the wider scientic community; and as a step towards laying
down the foundation for operational-ranges analysis, and target product proles on mosquito
genetic constructs.
POECPT
POE
CPT
References
scan for
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.