PosterPDF Available

Sensitivity Analysis and Machine Learning Techniques Applied to Mosquito Gene Drive Emulators

March 2023

March 2023

DOI:10.13140/RG.2.2.26044.72326

Conference: Bay Area Ecology & Evolution of Infectious Diseases Conference
At: San Fancisco State University
Affiliation: University of California, Berkeley

Authors:

Héctor Manuel Sánchez Castellanos

California Department of Public Health

Jared Bennett

University of California, Berkeley

John M Marshall

University of California, Berkeley

Genetic control of insect populations, such as reducing the impact of mosquito-vectored diseases, became reality with the advent of CRISPR/Cas9 gene editing techniques. Since then, the field has exploded in complexity, as researchers and regulators demand ever-more realistic simulations to predict the ecological and epidemiological impact of novel genetic control mechanisms. However, the tools necessary to analyze complex simulation data lag behind in their ability to disentangle interactions between variables and understand their impact on experimental outcomes. Questions such as “What is the most efficient release methodology to maximally reduce disease transmission in this population?” or “What are the most important characteristics of a genetic construct for population suppression/replacement?” highlight the need for reliable and scalable analysis pipelines. Expected pipelines should be able to: run simulations for parameter exploration, process results and calculate summary statistics, deploy advanced regression and ML techniques for more comprehensive analysis, and generate simplified models garnering feedback from collaborators, regulators, or the wider community. In this work, we present such a workflow developed in tandem with MGDrivE (Mosquito Gene Drive Explorer). We augment previous pipelines with the addition of machine learning regression models to generate computationally simple emulators, and statistical sensitivity analysis to rank the relative importance of simulation parameters for their effect on mosquito populations. This improved pipeline is applied to the suppression of Aedes aegypti populations using pgSIT (precision guided sterile insect technique).

Content uploaded by Héctor Manuel Sánchez Castellanos

Content may be subject to copyright.

Sensitivity Analysis and Machine Learning Techniques

Applied to Mosquito Gene Drive Emulators

Héctor M. Sánchez C., Jared B. Bennett, John M. Marshall

6. Package with Docker & Deploy to Heroku

3. Calculate Summary Statistics

4. Train MLR and Run SSA Independently

5. Inspect MLR and Compare with SSA

1. Setup Sampling Scheme

To get started, we determine the experimental framework,

and answer questions like:

-Which gene drive characteristics do we want to study?

-Which releases characteristics are we trying to

optimize?

In doing this, we determine independent variables,

ranges and sampling scheme; for the parameter

exploration.

In our pgSIT application we want to optimize for the GM mosquito

release sizes (res), number of releases (ren) and time interval (rei); while

checking how much of an impact cutting rate (pct), maternal deposition

(pmd), male fertility (mfr), mating ﬁtness (mtf), and female viability (fvb);

have on the success of the mosquito suppression intervention.

In this stage, we calculate relevant statistics for mosquito genetic

constructs, such as:

-WOP: amount of time the non-transgenic population is kept below

a given population fraction threshold

-CPT: area under the curve of the non-transgenic population

divided by the total simulated time

-POE: the probability to eliminate the non-transgenic population

with the construct

-TTI: time it takes for the wild population to drop below a certain

fraction threshold

-TTO: time it takes for the wild population to bounce back above a

certain population fraction threshold

As we are interested in mosquito suppression/elimination, we will focus on the CPT,

and POE in this use-case.

To add redundancy and interpretability, we train a neural network

regression model (MLR), and do statistical sensitivity analysis

(SSA) on the data independently. In each case, we do the following

analyses to rank feature importances:

-Statistical SA: Delta-Moments, FAST, HDMR

-Machine-Learning SA: Permutation Importance, SHAP Values,

PDP/ICE Plots

For our study, small neural networks (3 internal layers with 16 neurons each) were

enough to provide a parsimonious prediction upon provided inputs whilst avoiding

overﬁtting (cross-validation r-squared values >0.91 for both summary statistics).

2. Deﬁne Constants & Run Simulations

We now move to setting up the main simulation in our mechanistic

stochastic model: MGDrivE. This involves deﬁning the landscape,

species, and gene drive parameters. With this in place, we let the

model run for a couple of days and wait patiently.

For this demo, we setup a panmictic population setting, with Aedes aegypti

parameters, and run thirty repetitions of each stochastic trace for ﬁve years. The

whole process generated around 100Gb of CSV ﬁles.

Genetic control of insect populations, such as reducing the

impact of mosquito-vectored diseases, became reality with

the advent of CRISPR/Cas9 gene editing techniques. Since

then, the ﬁeld has exploded in complexity, as researchers and

regulators demand ever-more realistic simulations to predict

the ecological and epidemiological impact of novel genetic

control mechanisms. However, the tools necessary to analyze

complex simulation data lag behind in their ability to

disentangle interactions between variables and understand

their impact on experimental outcomes.

Questions such as “What is the most ecient release

methodology to maximally reduce disease transmission in

this population?” or “What are the most important

characteristics of a genetic construct for population

suppression/replacement?” highlight the need for reliable

and scalable analysis pipelines. Expected pipelines should

be able to: run simulations for parameter exploration,

process results and calculate summary statistics, deploy

advanced regression and ML techniques for more

comprehensive analysis, and generate simpliﬁed models

garnering feedback from collaborators, regulators, or the

wider community.

In this work, we present such a workow developed in tandem

with MGDrivE (Mosquito Gene Drive Explorer). We augment

previous pipelines with the addition of machine learning

regression models to generate computationally simple

emulators, and statistical sensitivity analysis to rank the

relative importance of simulation parameters for their eect

on mosquito populations. This improved pipeline is applied

to the suppression of Aedes aegypti populations using pgSIT

(precision guided sterile insect technique).

Research Context

Quick Summary

We now take our regression models and probe them in ways that

allow us to "peek" inside their black-box structures. At this stage we

also compare the normalized importance values of the features

between the ML models and the raw data's sensitivity analysis to

check for potential discrepancies that might indicate errors in

our regressions.

With pgSIT, as it is a self-limiting sterile-insect technique, the probability of

elimination (POE) and cumulative potential for transmission (CPT) show di�erent

responses to changes on the inputs. Male fertility (mfr) and female viability (fvb), for

example, are not critical in CPT, but if elimination is the goal, they are quite relevant

for the construct's design.

In terms of features importance, we can see that there's a strong agreement

between the MLR models and the SSA, so we can be more condent

in the reliability of our machine learning emulators.

Finally, if there is enough agreement between the SA and the MLR —and if the regression's inspection

metrics are good— we end up with a perfectly usable set of regression models, which can be

shipped to collaborators as emulators of simulated scenarios. This can be done by creating a user

interface with Dash and packaging it with Docker. Once the image has been created and tested, it

can even be deployed to DockerHub for portability, or even to Heroku for online interactivity.

We present a data-analysis pipeline in which we combine statistical sensitivity analysis with

machine learning emulators to increase the reliability and interpretability of our gene drive models

for mosquito-disease elimination. We do so in e�orts to make the results of our mechanistic

MGDrivE model more accessible to the wider scientiﬁc community; and as a step towards laying

down the foundation for operational-ranges analysis, and target product proﬁles on mosquito

genetic constructs.

POECPT

POE

CPT

References

scan for

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

Sensitivity Analysis and Machine Learning Techniques Applied to Mosquito Gene Drive Emulators

Abstract

Recommended publications

Using Neural Network Regression Emulators and Sensitivity Analysis to Characterize the use of Precis...

Confinement and reversibility of threshold-dependent gene drive systems in spatially-explicit Aedes...

MGDrivE: A modular simulation framework for the spread of gene drives through spatially‐explicit mos...

A Tale of Two Cities: Confinability and Remediation Potential of UDmel and Translocation Gene-Drives...