ArticlePDF Available

An optimal Bayesian intervention policy in response to unknown dynamic cell stimuli

March 2024
Information Sciences 666(4):120440

March 2024
666(4):120440

DOI:10.1016/j.ins.2024.120440

Authors:

Seyed Hamid Hosseini

Northeastern University

Mahdi Imani

Northeastern University

Content uploaded by Mahdi Imani

Content may be subject to copyright.

An Optimal Bayesian Intervention Policy in Response

to Unknown Dynamic Cell Stimuli

Seyed Hamid Hosseini, Mahdi Imania

aNortheastern University, 360 Huntington Ave, Boston, MA, 02115, U.S.

Abstract

Interventions in gene regulatory networks (GRNs) aim to restore normal functions

of cells experiencing abnormal behavior, such as uncontrolled cell proliferation.

The dynamic, uncertain, and complex nature of cellular processes poses signiﬁ-

cant challenges in determining the best interventions. Most existing intervention

methods assume that cells are unresponsive to therapies, resulting in stationary

and deterministic intervention solutions. However, cells in unhealthy conditions

can dynamically respond to therapies through internal stimuli, leading to the re-

currence of undesirable conditions. This paper proposes a Bayesian intervention

policy that adaptively responds to cell dynamic responses according to the latest

available information. The GRNs are modeled using a Boolean network with per-

turbation (BNp), and the ﬁght between the cell and intervention is modeled as a

two-player zero-sum game. Assuming an incomplete knowledge of cell stimuli,

a recursive approach is developed to keep track of the posterior distribution of

cell responses. The proposed Bayesian intervention policy takes action accord-

ing to the posterior distribution and a set of Nash equilibrium policies associated

with all possible cell responses. Analytical results demonstrate the superiority of

the proposed intervention policy against several existing intervention techniques.

Meanwhile, the performance of the proposed policy is investigated through com-

prehensive numerical experiments using the p53-MDM2 negative feedback loop

regulatory network and melanoma network. The results demonstrate the empirical

convergence of the proposed policy to the optimal Nash equilibrium policy.

Keywords: Gene Regulatory Networks, Two-Player Zero-Sum Game, Bayesian

intervention, Boolean networks, Nash Equilibrium.

1. Introduction

Preprint submitted to Information Science March 9, 2024

Recent genomics advances have deepened our understanding of complex bi-

ological systems, particularly gene regulatory networks (GRNs) [1, 2, 3, 4, 5, 6].

GRNs consist of several interacting genes whose activities control cellular pro-

cesses, including DNA repair, stress response, and complex diseases like can-

cer [7]. In genomics intervention, the objective is to design effective intervention

strategies that can alter the undesirable behavior of unhealthy cells (e.g., those

associated with chronic diseases) and shift them into desirable ones.

Boolean networks have emerged as a powerful class of models for character-

izing the temporal dynamics of GRNs [8, 9, 10, 11, 12, 13]. Several interven-

tion strategies have been developed for Boolean network models in recent years.

These include structural interventions, which aim to make a single-time, long-

lasting change in the interaction between two or more genes [14, 15, 16, 17, 18],

and dynamic interventions that perturb (e.g., overexpress or suppress) the activity

of targeted genes over time [14, 15, 16, 17]. The most well-known method is the

optimal stationary intervention derived in [19], which is later extended to include

constraints [20, 21] and asynchronicity of the GRNs [22, 13]. Meanwhile, several

intervention approaches are developed for GRNs with states observed indirectly

through gene-expression data [23, 24, 25, 26, 27, 28], including robust interven-

tion methods for domains with partially-known dynamics and costs [29, 30].

Most existing intervention methods are built on the assumption that cells are

isolated and non-responsive to therapies. However, the dynamic and intelligent

responses of cells to therapies, triggered by internal stimuli, often result in the

short-term success of interventions at early stages and the recurrence of the un-

healthy condition afterward. This paper models GRNs using Boolean networks

with perturbation (BNp) [31, 32], and models the cell dynamic responses to inter-

ventions through a two-player zero-sum game [33, 34, 35]. There are two players

in the game: the cell and the intervention, each with opposing goals. The cell

aims to maintain the cell condition in unhealthy states using its internal stimuli,

while the intervention’s objective is to deviate the system from unhealthy condi-

tions through therapies. Assuming incomplete information about the possible cell

responses to interventions, this paper develops a recursive method for computing

the posterior distribution of the cell responses. Given the quantiﬁed uncertainty

in cell responses, we develop a Bayesian intervention policy. The proposed pol-

icy utilizes the combination of the Nash equilibrium policies for different cell

responses and the posterior associated with them. The policy is fully adaptive;

as new data appears, the posterior distribution of cell responses and the proposed

intervention policy are updated.

The main contributions of this paper are as follows:

• Modeling the aggressive and dynamic responses of unhealthy cells during

the intervention process, which enables deriving intervention solutions by

accounting for and predicting possible cell responses to therapies.

• Develop an adaptive Bayesian intervention policy that can probabilistically

reason about cell responses and incorporate such knowledge to make better

intervention decisions.

• Analytically demonstrating the superiority of the proposed policy compared

to existing intervention methods, along with numerical results indicating the

empirical convergence of the proposed policy to the optimal Nash policy.

We analyze the performance of the proposed intervention policy using the

p53-MDM2 and melanoma networks. The p53-MDM2 network is a crucial reg-

ulatory system that responds to cellular stresses such as DNA damage [36, 37].

The melanoma regulatory network also plays a crucial role in the development

and progression of melanoma, a highly aggressive form of skin cancer [21, 38].

Through a comprehensive set of numerical experiments using these two networks,

we compare the performance of the proposed policy with state-of-the-art interven-

tion methods.

The article is organized as follows: The GRN model is brieﬂy described in

Section 2. Section 3 includes formulating the intervention process as a two-player

zero-sum game, followed by the optimal Nash equilibrium policy for a two-player

zero-sum game. The proposed Bayesian intervention policy and its matrix-form

implementation are presented in Sections 4 and 5, respectively. The analytical and

numerical results are presented in Section 6 and Section 7, respectively. Finally,

Section 8 contains the concluding remarks.

2. Background

In this paper, a Boolean network with perturbation model [32, 39] is used

to capture the dynamics of gene regulatory networks. The BNp model effec-

tively incorporates the stochastic nature of GRNs and accounts for the uncertainty

coming from unmodeled parts of the systems. Consider a GRN consisting of d

components. The state process can be represented as {xk;k= 0,1, . . .}, where

xk∈ {0,1}ddenotes the activation or inactivation state of the genes at time k.

The genes’ state is inﬂuenced by a series of internal and external inputs/stimuli.

At each discrete time point, the state of the genes evolves according to the follow-

ing Boolean signal model [40]:

xk=f(xk−1)⊕ak−1⊕uk−1⊕nk, k = 1,2, . . . , (1)

where {ak;k= 0,1, ...}refers to a set of external interventions/therapies, {uk;k=

0,1, ...}represents internal inputs regulated by the cell, nk∈ {0,1}drepresents

Boolean transition noise at time k, "⊕" denotes component-wise module-2 addi-

tion, and fis the network function. The noise value nk(j)=1alters the state of

the jth gene at time step k; whereas for nk(j)=0, the jth state follows the value

predicted by the network function. The noise process nkis assumed to have inde-

pendent components modeled by a Bernoulli distribution with parameter p > 0.

The Bernoulli parameter prepresents the noise intensity, with higher values rep-

resenting more chaotic systems and smaller values indicating nearly deterministic

models. Note that the rest of the paper is applicable to a general class of Boolean

network models of the form f(xk−1,ak−1,uk−1,nk).

The network function in GRNs is often represented through a Boolean logic

model or a pathway diagram model [41, 40]. The Boolean logic model captures

the genes’ activities and interactions using logical operators such as AND, OR,

XOR, and NOT, while the pathway diagram model parameterizes suppressive and

activating interactions among genes to capture their dynamics. These models have

shown success in capturing the temporal changes in gene activities and causal

interactions among genes.

3. Battle of Cell and Intervention

3.1. Two-Player Zero-Sum Game

We represent the battle between the cell and intervention as a two-player zero-

sum game [42, 33, 34, 35]. This can be characterized by a tuple ⟨X ,A,U, Ra, T ⟩,

where X={0,1}dis the state space,Ais the intervention space,Uis the cell

control space,Rais the intervention reward function, and Tis the state transition

probability function.T:X ×A×U×X is such that p(x′|x,a,u)represents

the probability of moving to state x′according to the external and internal inputs

aand uin state x. Also, Ra(x,a,u,x′)denotes the immediate intervention re-

ward gained if the system moves from state xto state x′according to the joint

intervention and cell actions (a,u).

3.2. Optimal Nash Intervention Policy under Known Cell Responses

The diagram representing the ﬁght between cell and intervention is shown

in Fig. 1. For cells in cancerous conditions, the intervention objective is to de-

crease cell proliferation, whereas cells aim to increase such proliferation by ﬁght-

ing against interventions. The opposite objectives of the intervention and cell can

be expressed by the cell reward Rutaking the negative of the intervention reward,

i.e., Ru(x,a,u,x′) = −Ra(x,a,u,x′).

Figure 1: The ﬁght between intervention and the cell dynamic response according to its

internal stimuli.

This paper focuses on stationary Markov Nash equilibria in GRNs modeled by

the inﬁnite-horizon discounted Markov game. Let Ucontain a ﬁnite set of stim-

uli/actions that the cell could perform during the intervention process against ther-

apies. Let also Abe the set of actions/therapies available during the intervention

process. We deﬁne the intervention policy πa(a|x), representing the probabil-

ity of taking action a∈ A in any given state x∈ X . Similarly, the cell policy

πu(u|x)speciﬁes the probability of selecting input u∈ U in state x∈ X . For

the joint stochastic policy (πa, πu), the expected value function of intervention

and cell can be deﬁned as:

πa,πu(x) = EX

t≥0

γtRa(xt,at,ut,xt+1)|a0:∞∼πa,u0:∞∼πu,x0=x,

πa,πu(x) = EX

t≥0

γtRu(xt,at,ut,xt+1)|a0:∞∼πa,u0:∞∼πu,x0=x,

(2)

for x∈ X ; where 0< γ < 1is a discount factor that prioritizes the early-

stage rewards compared to future ones. Given that cell and intervention reward

functions are negative of each other, we have Va

πa,πu(x) = −Vu

πa,πu(x), for any

x∈ X . Due to the interplay between state values for the cell and intervention, this

problem differs from a Markov decision process (MDP). The optimal solution for

a two-player zero-sum game can be expressed through the Markov game. This is

expressed as the optimal Nash equilibrium policy π∗= (πa

∗, πu

∗), which for any

joint policy π= (πa, πu)and x∈ X satisﬁes [33]:

πa

∗,πu

∗(x)≥Va

πa,πu

∗(x)and Vu

πa

∗,πu

∗(x)≥Vu

πa

∗,πu(x).(3)

The optimal Nash equilibrium policy is the policy that the cell and intervention

have no motivation to deviate from it. This policy can be expressed according to

the min-max theorem as [43]:

(πa

∗, πu

∗) = argmax

πa

argmin

πu

πa,πu(x) = argmin

πu

argmax

πa

πa,πu(x),for all x∈ X .

(4)

Based on equation (2), any pair of (πa, πu)that achieves the supremum and inﬁ-

mum values in equation (4) forms an optimal Nash equilibrium.

4. Bayesian Intervention Policy under Unknown Cell Responses

4.1. Intervention Challenges of Unknown Cell Space

If the cell space U, representing the internal cell stimuli, is fully known, then

the optimal Nash policy could be achieved as a solution for the optimization in

(4). However, in practice, the cell’s internal stimuli are often unknown, preventing

the computation of the optimal Nash policy. Therefore, this paper aims to de-

rive an effective intervention policy that can be implemented despite incomplete

knowledge about cell space. We present a systematic approach to probabilistically

reason about the possible cell responses using the latest available data and use this

knowledge for effective intervention selection.

Let U1, ...., UMbe the set of all possible cell spaces. This set depends on the

size of the regulatory networks and the prior biological knowledge regarding the

cell responses. Given a regulatory network consisting of dgenes, there are 2dpos-

sible cell actions. In this case, there are 2d

1cell spaces containing 1cell actions,

2d

2sets with 2cell actions, and 2d

msets containing mcell actions. This set can

be large in large regulatory networks, but as described in the following paragraph,

the posterior of many models approach zero as more data are observed.

If Uiis the true cell space, the optimal space-speciﬁc Nash policy can be ex-

pressed as (πa,Ui

∗, πu,Ui

∗), where this policy can be computed using the optimization

problem in (4) corresponding to the cell space Ui. The Nash policy obtained under

cell space Uimight signiﬁcantly differ from Uj=Ui. Thus, given the limited or

no knowledge about the true cell space, the space-speciﬁc intervention policies are

not directly implementable. In fact, executing a wrong (non-optimal) intervention

policy corresponding to Uj=U∗could lead to poor intervention performance and

the dominance of the cell.

4.2. Probability Model over Cell Spaces

This paper constructs a probabilistic model over the cell spaces. Let p0(i)be

the prior probability of the ith cell space Ui. The prior information about the set

of cell spaces can be represented in a single vector as:

p0= [P(U1), ..., P (UM)]T.(5)

If no prior biological knowledge about cell space is available, a uniform prior can

be considered over the cell spaces, i.e., p0= [1/M, ..., 1/M].

Let pk−1= [pk−1(1), ..., pk−1(M)] be the posterior probability over the cell

spaces obtained according to the sequence of observed states x0:k−1obtained upon

taking interventions a0:k−2. If intervention ak−1is taken at time step k−1and the

state xkis observed at time step k, the posterior probability of the cell spaces at

time step kcan be expressed as:

pk(i) = P(U∗=Ui|a0:k−1,x0:k)

=p(xk,Ui|a0:k−1,x0:k−1)

p(xk|a0:k−1,x0:k−1)

=P(xk|a0:k−1,x0:k−1,Ui)P(U∗=Ui|a0:k−2,x0:k−1)

j=1 P(xk|a0:k−1,x0:k−1,Uj)P(U∗=Uj|a0:k−2,x0:k−1)

=p(xk|a0:k−1,x0:k−1,Ui)pk−1(i)

j=1 p(xk|a0:k−1,x0:k−1,Uj)pk−1(j),

(6)

for i= 1, ..., M . The numerator term in (6) speciﬁes the probability of observing

the next state xkgiven the sequence of interventions and states and the cell space

Ui. Further simpliﬁcation of this term through marginalization of the joint distri-

bution of the state xkand the unobserved cell action uk−1at time step kleads

to:

p(xk| Ui,a0:k−1,x0:k−1) = X

u∈Ui

p(xk,uk−1=u| Ui,a0:k−1,x0:k−1)

u∈Ui

p(xk|uk−1=u,ak−1,xk−1)p(uk−1=u| Ui,xk−1)

u∈Uip

1−p||f(xk−1)⊕ak−1⊕u⊕xk||1

(1 −p)dπu,Ui

∗(u|xk−1),

(7)

where πu,Ui

∗(uk−1=u|xk−1) = p(uk−1=u| U∗=Ui,xk−1)is the probability

that cell takes action uk−1=uat state xk−1if the true cell action space is Ui. The

ﬁrst line in the last expression in (7) is obtained using the Markovian properties

of the state transition and the Bernoulli process noise. Replacing (7) into (6), the

posterior probability of the cell space can be recursively computed using the last

taken intervention and observed state.

4.3. Bayesian Intervention Policy

Let pkbe the posterior probability over the cell spaces obtained according to

the states x0:kand the sequence of intervention a0:k−1. The proposed Bayesian

intervention policy at time step kcan be expressed as:

µa,B

k(a|xk) : = p(ak=a|a0:k−1,x0:k)

i=1

p(ak=a,U∗=Ui|a0:k−1,x0:k)

i=1

p(ak=a| Ui,a0:k−1,x0:k)p(U∗=Ui|a0:k−1,x0:k)

i=1

p(ak=a| Ui,a0:k−1,x0:k)pk(i)

i=1

πa,Ui

∗(a|xk)pk(i),

(8)

for a∈ A; where the cell space is augmented and marginalized out in the second

line. One can see that if the uncertainty over the cell spaces goes to zero, the

Bayesian policy µa,B(.|xk)becomes the optimal Nash equilibrium policy under

the known cell space πa,U∗(.|xk).

The Bayesian policy in (8) is stochastic and provides the best intervention

solution given the available data. Let {u1, ..., uN}be all unique cell actions in the

set of cell spaces, i.e., {u1, ..., uN}=U1∪... ∪ UM⊂ {0,1}d. The Bayesian

modeling of the cell defense policy at time step kcan be expressed as:

µu,B

k(u|xk) = p(uk=u|a0:k−1,x0:k)

i=1

p(uk=u,U∗=Ui|a0:k−1,x0:k)

i=1

p(uk=u| Ui,a0:k−1,x0:k)p(U∗=Ui|a0:k−1,x0:k)

i=1

p(uk=u| Ui,a0:k−1,x0:k)pk(i)

i=1

πu,Ui

∗(u|xk)pk(i),

(9)

for u∈ {u1, ..., uN}. Note that the cell defense response in (9) represents the

intervention belief about the cell policy since the cell performs the optimal Nash

policy corresponding to the true cell space.

The Bayesian policy in (8) yields the optimality with respect to the posterior

distribution of the cell spaces. The schematic diagram of the proposed Bayesian

intervention policy is shown in Fig. 2. As the next intervention is performed and

the next state is observed, the posterior distribution over the cell spaces becomes

updated, and the optimal Bayesian policy can also be computed according to the

new posterior and the next observed state. The analysis of the proposed Bayesian

policy and its comparison with the state-of-art intervention policies are described

in Section 6.

5. Matrix-Form formulation of the Proposed Bayesian Intervention Policy

This section provides an efﬁcient and recursive computation of the proposed

Bayesian intervention policy. The process is divided into ofﬂine and online steps.

The ofﬂine step consists of computing the space-speciﬁc optimal Nash policies

associated with all cell spaces. Upon termination of the ofﬂine step, the online

step computes the posterior distribution of all cell spaces given the last observed

state, followed by the calculation of the Bayesian intervention policy. The details

of these two steps are outlined below.

Figure 2: The schematic diagram of the proposed Bayesian intervention policy.

5.1. Ofﬂine Step Computation

The ofﬂine step computes the optimal space-speciﬁc optimal Nash equilibrium

policy for all cell spaces, i.e., {U1, ..., UM}. This is achieved according to the

value iteration method for a two-player zero-sum game [33]. For the ith cell space

Ui, we deﬁne the state joint-action value function for any state value function

V:X → Ras:

Qa,Ui

V(x,a,u) = Ex′∼P(.|x,a,u)[Ra(x,a,u,x′) + γV(x′)] ,(10)

for x∈ X ,a∈ A and u∈ U i.Qa,Ui

V(x, ., .)can be seen as a matrix in R|A|×|Ui|,

with elements representing the expected discounted accumulated rewards for the

intervention when the joint actions (a,u)are performed at state xand the policy

associated with the state value function Vis followed.

We deﬁne the joint-action transition matrix associated with (a,u)in R2d×2d

as:

(M(a,u))lj =Pxk=xj|xk−1=xl,ak−1=a,uk−1=u

=p||f(xl)⊕a⊕u⊕xj||1(1 −p)d−||f(xl)⊕a⊕u⊕xj||1,

(11)

for l, j = 1,...,2d,a∈ A, and u∈ Ui, where ||.||1is the absolute L-1 norm of

a vector. Under zero noise and stochasticity, f(xl)⊕a⊕urepresents the state

of genes in the next time step. Thus, ||f(xl)⊕a⊕u⊕xj||1counts the number

of ﬂips caused by the noise once the system moves from state xlto state xj. The

transition probability in (11) is computed based on the noise characteristics for

each variable, modeled as independent Bernoulli variables with parameter p.

The matrix form representation of the intervention reward function associated

with aand ucan be expressed as:

(Ra(a,u))lj =Raxl,a,u,xj,for l, j = 1, ..., 2d.(12)

The expected intervention reward in state xlafter taking actions (a,u)and before

observing the next state can be computed as:

Ra(xl,a,u) = Ex′|x,a,u[Ra(xl,a,u,x′)]

j=1

P(xk=xj|xk−1=xl,ak−1=a,uk−1=u)Ra(xl,a,u,xj),

(13)

for l= 1, .., 2d. The expected reward in (13) can be rewritten according to (11)

and (12) as:

Ra(xl,a,u) =

j=1

(Ra(a,u))lj (M(a,u))lj .(14)

We deﬁne the expected intervention reward function in a vector form as Ra

a,u=

[Ra(x1,a,u),· · · , Ra(x2d,a,u)]T. This vector can be computed using the follow-

ing matrix-form computation:

a,u= (Ra(a,u)⊙M(a,u)) 12d×1,(15)

for a∈ A and u∈ U i; where 12d×1is a vector of size 2dwith all elements 1, and

⊙is the Hadamard product.

According to the controlled transition matrix M(a,u)and the vector-form

reward function Ra

a,u, the Q-values deﬁned in (10) can be calculated as:







Qa,Ui

V(x1,a,u)

Qa,Ui

V(x2d,a,u)





=Ra

a,u+γM (a,u)V,(16)

for a∈ A and u∈ U iand any given state value function V.

Let πabe 2d-simplex of size A, and πube 2d-simplex of size Ui. Consider

Qa,Ui

V(x, ., .)as the payoff matrix in a matrix form zero-sum game. We deﬁne the

Bellman operator T∗for any x∈ X as [33]:

(T∗[V])(x) = Value[Qa,Ui

V(x, ., .)]

= max

πamin

πuX

a∈A X

u∈Ui

πa(a|x)πu(u|x)Qa,Ui

V(x,a,u),(17)

which should meet the condition Pa∈A πa(a|x) = Pu∈Uiπu(u|x)=1. The

solution for the min-max optimization in (17) can be obtained using a linear pro-

gramming technique.

The Bellman operator T∗is a γ-contractive in the L∞-norm, and the exclu-

sive solution to the Bellman equation corresponds to the optimal value function,

denoted as V∗=T∗[V∗][33]. This ﬁxed-point solution represents an optimal

Nash equilibrium for the Markov game, associated with the cell space Ui. There-

fore, starting from any arbitrary V, we can repeatedly apply Vt+1 =T∗[Vt]for

t= 0,1, ..., and compute a ﬁxed point solution for the value vector.

Let V0= [0,· · · ,0]Tdenote the initial value vector with all elements set to

0. During the rth iteration of the value iteration method, the new vector Vr+1 is

obtained upon performing the Bellman operator to the previous value Vras:

Vr+1(xl) = Value[Qa,Ui

Vr(xl, ., .)],for l= 1, ..., 2d,(18)

where Qa,Ui

Vr(xl, ., .)consists of Q-values for all joint pairs of (a,u). In practice,

the iterations continue until the maximum difference between the elements of the

value vectors in two consecutive iterations becomes smaller than a predetermined

threshold ϵ > 0, expressed as:

max

l∈{1,..,2d}|VT(l)−VT−1(l)|< ϵ.

Let VT=V∗be the ﬁxed-point solution after conducting the value iteration

method. The Q-values associated with V∗can be computed as:







Qa,Ui

V∗(x1,a,u)

Qa,Ui

V∗(x2d,a,u)





=Ra

a,u+γM (a,u)V∗,for a∈ A,u∈ Ui.(19)

After computation of the optimal Q-values, the optimal policy for intervention and

cell can be calculated as:

πa,Ui

∗(.|x), πu,Ui

∗(.|x)= argmax

πa

argmin

πuX

a∈A X

u∈Ui

πa(a|x)πu(u|x)Qa,Ui

V∗(x,a,u),

(20)

for any x∈ X , where πa,Ui(a|x)and πa,Ui(u|x)are non-negative numbers that

add up to 1for any x∈ X . The solution to the Nash equilibrium policy in (20) can

be obtained using a linear programming technique. Repeating the above process

for all cell spaces leads to the computation of the space-speciﬁc Nash policies in

the ofﬂine step.

Algorithm 1 Bayesian Intervention Policy

1: Intervention space A; Cell spaces, U1,...,UM; intervention reward (Ra(a,u))lj =

Ra(xl,a,u,xj); controlled transition matrix M(a,u); threshold ϵ > 0.

Ofﬂine Step

2: for Ui∈ {U1,...,UM}do

3: Set V′=02d×1.

4: repeat

5: V=V′.

6: 





Qa,Ui

V(x1,a,u)

Qa,Ui

V(x2d,a,u)







=(Ra(a,u)⊙M(a,u)) 12d×1+γM (a,u)V,for a∈

Aand u∈ Ui.

7: Bellman Operator: V′(xl) = Value[Qa,Ui

V(xl, ., .)], for l= 1,...,2d– Eq.

(17)

8: until maxl∈{1,...,2d}|V(xl)−V′(xl)|< ϵ

9: For any given x∈ X , use linear programming approach over Qa,Ui

V′(x, ., .)to

obtain πa,Ui

∗(.|x)and πu,Ui

∗(.|x).

10: end for

Online Step

11: Initial state x0, and initial probability of cell space: p0= [P(U1), . . . , P (UM)].

12: for k= 0,1,2,...,do

13: Compute Bayesian Intervention µa,B

k(a|xk) = PM

i=1 πa,Ui

∗(a|xk)pk(i),a∈

A, and select action accordingly: ak∼µa,B

k(.|xk).

14: Apply the intervention akand receive the next system state, xk+1.

15: Posterior Update:

pk+1(i) = Pu∈Uip

1−p||f(xk)⊕ak⊕u⊕xk+1||1πu,Ui

∗(u|xk)pk(i)

j=1 Pu∈Ujp

1−p||f(xk)⊕ak⊕u⊕xk+1||1πu,Uj

∗(u|xk)pk(j)

, i = 1, . . . , M.

16: end for

5.2. Online Step Computation

This section describes a recursive and online computation of the Bayesian in-

tervention policy, obtained according to the space-speciﬁc Nash equilibrium poli-

cies computed during the ofﬂine step. Let pkcontain the posterior probability of

the cell spaces and xkbe the system state at time step k. An intervention at time

step kcan be selected according to the Bayesian policy in (8) as:

ak∼µa,B

k(.|xk),(21)

where

µa,B

k(a|xk) =

i=1

πa,Ui

∗(a|xk)pk(i),for a∈ A.(22)

Upon performing the intervention akand observing the next state xk+1, the pos-

terior distribution of the cell spaces can be updated using (6) and (7) as:

pk+1(i) = Pu∈Uip

1−p||f(xk)⊕ak⊕u⊕xk+1||1πu,Ui

∗(u|xk)pk(i)

j=1 Pu∈Ujp

1−p||f(xk)⊕ak⊕u⊕xk+1||1πu,Uj

∗(u|xk)pk(j)

,(23)

for i= 1, ..., M .

The diagram in Fig. 3 represents the processes of the computation of the pro-

posed intervention policy in the ofﬂine and online steps. Algorithm 1 provides the

details of the computations in both steps. Meanwhile, the complexity of each step

is provided in Table 1. The ofﬂine step has a computational complexity of order

O(22d× |A| × maxi=1,...,M |Ui| × L), where 22dis due to the transition matrices

involved, Lrepresents the number of steps of the value iteration method before

termination, |A| is the size of intervention space, and the |Ui|is the size of the

ith cell space. In the online step, the computation of the Bayesian intervention

has the complexity of order O(M), whereas the posterior update’s complexity is

of order O(M×maxi=1,...,M |Ui|). Overall, the complexity of the online step is

signiﬁcantly lower than that of the ofﬂine step, enabling a recursive computation

of the proposed intervention policy.

6. Performance Analysis and Comparison with State-of-Art Methods

This section analyzes the performance of the proposed Bayesian intervention

policy with the system under no intervention and some of the existing interven-

tion policies. First, consider a system with no intervention under the aggressive

Figure 3: The schematic diagram of processes in the ofﬂine and online steps of the pro-

posed Bayesian intervention policy.

Table 1: Computational complexity of the proposed Bayesian intervention policy.

Ofﬂine Step (Cell Space Ui) Bayesian Intervention Posterior Update

O(22d× |A| × |Ui| × Li)O(M)O(M×max{|U1|, ..., |U M|})

response of cells, e.g., representing uncontrolled cancerous conditions. The best

cell policy under no intervention is deterministic. Let πu:X → U∗be a determin-

istic cell policy, which maps a cell action in U∗to each system state. The optimal

cell response under no intervention can be computed as:

πu,a=0

∗(x) = argmin

πu

E∞

t=0

γtRa(xt,at=0,ut,xt+1)|x0=x,u0:∞∼πu,

(24)

where πu∈(U∗)2dand the minimization is used since the reward of the interven-

tion is negative of the cell reward function. The steady-state probability under no

intervention can be expressed as:

Π∞

a=0(j) = lim

k→∞ P(xk=xj|u0:∞∼πu,a=0

∗,a0:∞=0),(25)

for j= 1...., 2d. One can see Π∞

a=0as a long-term probability of the visitation of

various states under no intervention.

Most conventional intervention methods assume non-responsive cells [19],

wherein cells lack defense mechanisms to counteract interventions (i.e., U={}).

In this scenario, the Markov game can be represented by an MDP with a single

agent/player, and since the intervention is driven by no competition with cell re-

sponses assumption, the optimal intervention policy becomes deterministic. This

policy can be expressed as:

πa,u=0

∗(x) = argmax

πa

E∞

t=0

γtRa(xt,at,ut=0,xt+1)|x0=x,a0:∞∼πa,

(26)

where the maximization is over all deterministic intervention policies, i.e., (A)2d.

The cell’s aggressive response to the naive and deterministic intervention in (26)

can be expressed as:

πu,πa,u=0

∗

∗(x) = argmin

πu

E∞

t=0

γtRa(xt,at,ut,xt+1)|x0=x,

a0:∞∼πa,u=0

∗,u0:∞∼πu,for x∈ X .

(27)

The expected value function for the intervention under no cell response policy in

(26) and (27) can be expressed through Va

πa,u=0

∗,πu,πa,u=0

∗

. The intervention gain

obtained under this policy compared to no intervention case can be expressed as:

πa,u=0

∗,πu,πa,u=0

∗

(x)−Va

0,πu,a=0

∗(x)≥0,(28)

for any x∈ X . The positivity of the difference in the state values indicates that the

intervention helps the system to experience less undesirable conditions, compared

to cases with no intervention. Meanwhile, the comparison with the optimal Nash

policy (πa,U∗

∗, πu,U∗

∗)can be achieved as:

πa,u=0

∗,πu,πa,u=0

∗

(x)≤Va

πa,u=0

∗,πu,U∗

∗(x)≤Va

πa,U∗

∗,πu,U∗

∗(x),(29)

for any x∈ X , where the inequalities are obtained due to the fact that devia-

tion of the intervention from the optimal Nash policy leads to a reduction in the

intervention performance (see (3)). More speciﬁcally, if the intervention policy

deviates from the Nash policy, the cell can take advantage of this and further shift

the system toward undesirable conditions. Note that the conventional intervention

policies can achieve the same performance level as the optimal Nash policy if and

only if the optimal Nash policy is deterministic, i.e., πa,U∗

∗(πa,u=0

∗(x)|x)=1, for

all x∈ X .

In this part, the difference between the state-value function of the proposed

Bayesian intervention policy and the optimal Nash policy is investigated. The

proposed Bayesian policy is adaptive, meaning that its policy becomes updated

according to the latest observed states. We represent the Bayesian policy after

time step kas µa,B

k:∞= [µa,B

k, µa,B

k+1, ...], where µa,B

k+1 yields optimality with respect

to the information up to time step k+ 1. Thus, we can express the difference be-

tween the state-value functions of the proposed Bayesian policy and the optimal

Nash policy as:

µa,B

k:∞,πa,U∗

∗(xk)−Va

πa,U∗

∗,πu,U∗

∗(xk)≤0.(30)

It can be shown that the state value function of the Bayesian policy becomes close

to the optimal Nash policy as time progresses. In fact, for a sufﬁciently large value

of k, the posterior distribution over the cell spaces is expected to become peaked

over the true cell space, and according to (8), the Bayesian policy becomes the

same as the optimal Nash policy. In particular, the difference between the pro-

posed Bayesian policy at time step kand the optimal Nash policy can be expressed

as follows:

KL(πa,U∗

∗(.|xk), µa,B

k(.|xk)) = X

a∈A

πa,U∗

∗(a|xk) log πa,U∗

∗(a|xk)

µa,B

k(a|xk)

a∈A

πa,U∗

∗(a|xk)hlog πa,U∗

∗(a|xk)−log µa,B

k(a|xk)i,

(31)

where KL indicates the Kullback-Leibler divergence. The KL approaches zero if

the posterior peaks over a single cell space (i.e., the true cell space). Finally, unlike

existing deterministic intervention policies, the stochastic nature of the proposed

policy aligns with the stochastic nature of the optimal Nash policy. This stochastic-

ity prevents the cell from predicting a single deterministic intervention in different

cases, helping to ensure short-term and long-term success during the intervention

process.

7. Numerical Experiments

In this section, the performance of the proposed intervention policy is assessed

through two well-known gene regulatory networks: the p53-MDM2 Boolean net-

work model and the melanoma regulatory network.

7.1. P53-MDM2 Negative Feedback Loop Network

This paper utilizes a simpliﬁed p53-MDM2 Boolean network [44] with DNA

double-strand break (DNA-DSB) for the experiment. This network has been widely

studied for assessing the performance of various intervention policies. The p53

tumor suppressor is a crucial transcription factor that regulates essential cellular

processes, including DNA repair, cell cycle control, apoptosis, angiogenesis, and

senescence [45]. Fig. 4(a) illustrates the diagram of this network, where solid and

blunt arrows indicate activating and suppressive interactions, respectively. The

network consists of four genes: ATM, p53, WIP1, MDM2, and DNA-DSB, which

is an external stress to the cell. The system state is represented using the following

vector: xk= [ATMk,p53k,WIP1k,MDM2k]. The Boolean model described in (1)

represents the state transition of the healthy system as:

xk=





0 0 −1 0

+1 0 −1−1

0 +1 0 0

−1 +1 +1 0







xk−1+





dna_dsb







⊕ak−1⊕uk−1⊕nk,(32)

where vis a function that maps the element of the vector vgreater than 0 to 1 and

others to 0.

Figure 4: (a) The pathway diagram for the p53-MDM2 Boolean network. (b) The aver-

age reward gained by the Bayesian intervention policy, naive intervention policy, and the

Baseline. (c) The average absolute difference of the rewards.

In cells under normal conditions, the stress response is zero (i.e., dna_dsb =

0), whereas under stressed conditions, the stress is present (i.e., dna_dsb = 1).

For no stressed cells, the genes’ states are mostly at rest, i.e., the system remains

in the "0000" state. In stressed conditions, the activation and inactivation of p53

help the system control the genes’ activities and cell proliferation. However, when

p53, a tumor suppressor gene, undergoes a loss of function, other genes can exhibit

excessive activations and cell proliferation, leading to transitioning from a healthy

to a cancerous condition.

The cell defensive responses are modeled using single-gene and double-gene

perturbations. This represents realistic situations in which cells have the capability

to respond to therapies by altering the states of multiple genes simultaneously.

Therefore, the possible cell responses can be expressed through the following 7

actions:

u1= [0 0 0 0]T,u2= [1 0 0 0]T,u3= [0 0 1 0]T,u4= [0 0 0 1]T,

u5= [1 0 1 0]T,u6= [1 0 0 1]T,u7= [0 0 1 1]T.(33)

The cell might utilize one or multiple stimuli in response to interventions. In our

experiment, we consider the following cell space to be true but unknown:

U∗={u2,u6},(34)

where u2alters the state value of ATM, and u6simultaneously alters the state of

ATM and MDM2.

Toward modeling the possible cell spaces, we consider cell spaces to contain

any subset of one, two, and three elements from the above 7 possible cell actions

in (33). This leads to M=7

1+7

2+7

3= 63 possible cell spaces. Among them,

7 contain a single action, denoted by U1to U7, 21 contain two actions indicated

by U8to U28, and 35 consist of 3 actions, indicated by U29 to U63. Note that the

true cell space in (34) is the 17th space (i.e., U∗=U17), which is unknown during

the intervention.

The space of intervention (i.e., drugs/therapies) is assumed to be:

A={a1= [0 0 0 0]T,a2= [1 0 0 0]T,a3= [0 0 0 1]T},(35)

where the ﬁrst intervention a1corresponds to no therapy, whereas the second and

third interventions alter the state value of the ATM and MDM2 genes, respectively.

Intervention aims to reduce cell proliferation in cancerous situations and re-

store the system to a normal condition. For the p53-MDM2 network, this can be

achieved by reducing the activation of ATM, WIP1, and MDM2. This can be ex-

pressed through the following intervention reward function:

Ra(x,a,u,x′) = −x′(1) −x′(3) −x′(4).(36)

The activation of each of ATM, MDM2, and WIP1 yields a negative reward of

-1, resulting in an immediate reward ranging from -3 to 0. The objective of the

intervention is to maximize cumulative intervention rewards by maintaining ATM,

MDM2, and WIP1 in an inactivated state. Conversely, the cell with the opposing

reward seeks to increase the activation of these genes and drive the system closer

to states leading to uncontrolled cell proliferation.

We consider the optimal Nash policy associated with true cell space (i.e.,

πu,U∗

∗, and πa,U∗

∗) as a Baseline policy. The Baseline provides the best interven-

tion outcomes that could be achieved by any intervention policy (since it assumes

the full knowledge of true cell space). The following parameters are used for the

numerical experiments: p= 0.05,γ= 0.95, and ϵ= 0.01, the initial state "1011",

representing the cancerous condition.

The average reward over 100 independent runs obtained by the proposed Baye-

sian intervention policy, the naive intervention policy, and the Baseline is pre-

sented in Fig. 4(b). As can be seen, the reward gained by the proposed Bayesian

policy becomes closest to the Baseline after a few steps (i.e., a few numbers of

interventions). The performance of the naive intervention policy is notably poor,

with an average 2 out of 3 genes remaining activated. In contrast, the Bayesian

intervention policy demonstrates a signiﬁcant improvement by effectively deac-

tivating approximately 2.4 of the genes, which highlights the superiority of the

proposed approach. Furthermore, Fig. 4(c) shows the average absolute difference

between the rewards obtained by the Baseline and the proposed Bayesian pol-

icy and the Baseline and the naive intervention policy. As can be seen, a much

smaller absolute reward difference is achieved for the proposed intervention pol-

icy. In particular, the absolute reward difference approaches zero for the proposed

Bayesian policy as time progresses, which means the proposed method achieves

intervention performance (i.e., reward) similar to the Baseline. On the other hand,

one can see the poor performance of the naive policy with a large absolute reward

difference over time.

The prior and average posterior probability over cell spaces is shown in Fig.

5(a). A uniform prior is considered over cell spaces (blue bars). The average pos-

teriors after 20 steps are shown with red bars. As can be seen, the proposed method

has been almost able to discern the true cell space, i.e., U17. Aside from the true

cell space, another cell space (i.e., U12 ={u1,u6}) has a large posterior proba-

bility. This set shares a single cell action with the true cell space, making it prob-

abilistically indistinguishable from the true cell space, given 20 observed states.

Furthermore, the average posterior of the true cell space over time is shown in

Fig. 5(b). The average posterior of the true cell space is increasing over time. The

reason for not approaching 1 is the existence of another cell space, U12, with a

similar space-speciﬁc Nash policy.

Figure 5: (a) The prior and posterior (after 20 steps) probability over cell spaces. (b) The

average posterior of the true cell space over time.

Fig. 6(a) represents the probability assigned to each intervention (a1,a2, and

a3) by both the optimal Nash equilibrium policy and the proposed Bayesian policy

in a single run. It can be seen that the proposed Bayesian policy and Baseline

behave similarly after a few initial steps. In fact, the average result reveals that

the Bayesian intervention policy empirically converges toward the optimal Nash

intervention policy after approximately 7 steps.

In this part, the KL divergence is used as a distance measure between the

optimal Nash equilibrium policy and the proposed Bayesian intervention policy.

Fig. 6(b) represents the average KL divergence performed over 100 independent

runs. The results indicate that these two policies become close to each other not

only in individual runs (as shown in Fig. 6(a)), but also on average. This indicates

the empirical convergence of the proposed policy to the optimal Nash policy as

more interventions are taken, and more data are observed.

In this part of the experiment, we investigate the reason for obtaining a large

posterior probability for a non-true cell space in Fig. 5(a). Fig. 7(a) illustrate the

space-speciﬁc Nash policies under the true cell space U∗and the cell space U12.

The blue bars represent the probability assigned to each intervention at the 16

states under the true cell space’s Nash equilibrium policy, while the red bars rep-

resent the corresponding probabilities under the Nash policy associated with U12.

One can see the similarity between these two policies in different states.

Figure 6: (a) The proposed Bayesian intervention policy and the optimal Nash equilibrium

intervention policy (both stochastic) in one single run. (b) The average KL divergence

between the true Nash intervention policy and the proposed Bayesian intervention policy.

The average rate of state visitations under the proposed Bayesian policy is

shown in Fig. 7(b). One can see the subset of states {x1,x2,x10,x12}are the most

frequently visited states. At these most visited states, we can see the similarity

between the space-speciﬁc Nash policies associated with U∗and U12 in Fig. 7(a).

This explains the reason behind the similar performance of the proposed Bayesian

policy to the Baseline, despite a large posterior probability for a non-true cell

space.

This section analyzes the impact of the system stochasticity on the perfor-

mance of the proposed Bayesian policy. Fig. 8(a) illustrates the average posterior

of the true cell space under two levels of state stochasticity. The solid line corre-

sponds to the small noise level, characterized by a Bernoulli process noise with

p= 0.001, whereas the dashed line represents a higher noise level with p= 0.15.

The results indicate that when there is less randomness in the system (low stochas-

ticity), the average posterior of the true cell space becomes closer to 1. However,

when the stochasticity level increases (high stochasticity), there is greater uncer-

tainty in determining the true cell space. Therefore, as expected, the proposed

method performs better for less chaotic systems.

Fig. 8(b) shows the average reward obtained by the proposed Bayesian in-

tervention policy and the naive intervention policy under low and high levels of

stochasticity. The average rewards obtained by both policies have more ﬂuctuation

under a larger stochasticity level. The results indicate that the naive intervention

policy performs poorly when the stochasticity level is low. Under a high stochas-

Figure 7: (a) The space-speciﬁc Nash equilibrium intervention policy associated with U∗

and U12. (b) The average state visitation rate in 100 independent runs under the proposed

Bayesian intervention policy.

ticity level, it takes longer for the proposed policy to achieve a performance similar

to that of the optimal Nash equilibrium policy. However, the ﬁnal average reward

obtained by the proposed policy under low and high stochasticity levels is similar.

This demonstrates that the proposed Bayesian policy exhibits greater robustness

compared to the naive policy. In fact, in more chaotic systems characterized by

higher levels of noise, decision-making becomes more challenging for both cells

and intervention, resulting in similar performance regardless of changes in the

noise level.

This section of numerical experiments investigates the robustness of the pro-

posed policy with respect to different cell and intervention spaces. Table 1 presents

the average reward obtained by various policies across 9 pairs of intervention and

true cell spaces. The Bayesian policy and the Baseline outperform the naive pol-

icy in all cases. For a ﬁxed intervention space (i.e., the results in a single row),

a reduction in the reward can be seen for cell spaces with larger elements. This

is due to the greater power of cells with larger cell space to resist intervention.

Given a ﬁxed true cell space (a column in the table), a stronger intervention space

yields a larger or similar average reward. The improvement in the result is more

signiﬁcant when the size of the intervention space has increased from 2 to 3, and

less signiﬁcant once it is increased to 4.

Figure 8: (a) Average posterior of the true cell space for systems with low (p= 0.001) and

high (p= 0.15) levels of stochasticity. (b) The average reward gained by the Bayesian

intervention policy and naive intervention policy under low (p= 0.001) and high (p=

0.15) levels of stochasticity.

7.2. Melanoma Regulatory Network

In this part of the numerical experiment, we evaluate the effectiveness of the

proposed Bayesian intervention policy using the melanoma regulatory network.

Melanoma is a deadly type of skin cancer arising from melanocytes’ malignant

conversion [21, 46, 47]. In this paper, we consider a well-known Boolean network

model of melanoma network [21], which is widely studied in deriving genomics

interventions. Fig. 9(a) illustrates the regulatory relationships among the genes

in the network. This network consists of a total of 10 genes and 1,024 states. The

state vector shows the activation/inactivation of the following genes in sequential

order: WNT5A, pirin, S100P, RET1, MMP3, PHOC, MART1, HADHB, synu-

clein, and STC2. The network function can be expressed as:

Table 2: Average steady-state reward gained by different policies under different intervention sets

and true cell spaces

U∗=











U∗=





1 1

0 0

0 1







U∗=





1 1 0

0 0 0

0 0 1

0 1 1







A=





0 1

0 0







Baseline: −0.402 ±0.008 Baseline: −1.044 ±0.013 Baseline: −1.802 ±0.021

Bayesian: −0.415 ±0.026 Bayesian: −1.057 ±0.036 Bayesian: −1.885 ±0.039

Naive: −1.319 ±0.010 Naive: −2.207 ±0.012 Naive: −2.602 ±0.011

A=





0 1 0

0 0 0

0 0 1







Baseline: −0.288 ±0.011 Baseline: −0.627 ±0.016 Baseline: −0.833 ±0.026

Bayesian: −0.297 ±0.028 Bayesian: −0.637 ±0.041 Bayesian: −0.846 ±0.052

Naive: −1.188 ±0.010 Naive: −1.941 ±0.011 Naive: −2.131 ±0.009

A=





0 1 0 0

0 0 0 0

0 0 0 1

0 0 1 1







Baseline: −0.193 ±0.008 Baseline: −0.565 ±0.018 Baseline: −0.725 ±0.028

Bayesian: −0.209 ±0.032 Bayesian: −0.602 ±0.053 Bayesian: −0.744 ±0.062

Naive: −1.051 ±0.012 Naive: −1.740 ±0.014 Naive: −1.969 ±0.012

f(xk) = [f1(xk), f2(xk), ..., f10(xk)]T=







(S100P ∧MMP3 ∧PHOC)∨(MMP3 ∧PHOC)

(WNT5A ∧S100P ∧MMP3)∨(WNT5A ∧S100P ∧MMP3)

MART1

(WNT5A ∧pirin ∧RET1)∨(pirin ∧RET1)

(RET1 ∧synuclein)∨synuclein

(RET1 ∧MART1)∨(RET1 ∧MART1 ∧STC2)

MART1

(WNT5A ∧MMP3)∨(MMP3 ∧synuclein)∨(WNT5A ∧MMP3 ∧synuclein)

(RET1 ∧MART1 ∧STC2)∨(RET1 ∧MART1 ∧STC2)∨MART1

S100P







The intervention objective is to reduce the activation of two genes: WNT5A

and pirin. This can be expressed using the following intervention reward function:

Ra(x,a,u,x′) = 2 −x′(1) −x′(2),(37)

where the reward of 2 is reached if both genes are inactivated, 1 if one of them is

activated, and 0 when both genes are in the inactivated state.

Figure 9: (a) The pathway diagram for the melanoma regulatory network. (b) The aver-

age reward gained by the Bayesian intervention policy, naive intervention policy, and the

Baseline.

In our experiment, we consider modeling cell responses using single-gene per-

turbations, which lead to 11 distinct cell actions denoted as u1to u11. The action

u1represents no cell stimuli, and u2to u11 correspond to gene 1 to gene 10 stim-

uli, respectively. Similar to the previous experiment, cell spaces are assumed to

contain one, two, or three cell actions, resulting in 231 possible cell spaces. We

use the following true (unknown) cell space in our experiment:

U∗=U48 ={u5,u8},(38)

where the cell has the capability to alter the state value of the RET1 or MART1.

The intervention space contains three possible actions as A={a1,a2,a3}, where

a1indicates no intervention, and a2and a3represent interventions targeting RET1

and PHOC, respectively. All the parameters are the same as in the previous exper-

iment. The initial state is randomly selected from states with activated WNT5A

and pirin.

Fig. 9(b) represents the average reward obtained by the proposed Bayesian

intervention policy, naive intervention policy, and the Baseline. The average re-

ward achieved by the Bayesian policy gradually converges towards the Baseline

after a few steps. In contrast, the naive intervention policy performs poorly, with

an average reward of approximately half of the Bayesian policy. This difference

highlights the superiority of the Bayesian approach to probabilistically model the

cell space and ﬁght back against internal cell responses through stochastic policy.

Figure 10: (a) The average posterior of the true cell space over time. (b) The average KL

divergence between the true Nash intervention policy and the proposed Bayesian inter-

vention policy.

Fig. 10(a) illustrates the average posterior of the true cell space over time. As

can be seen, the true cell space has the largest posterior probability, and its prob-

ability approaches 1 after about 15 steps. Furthermore, Fig. 6(b) shows the aver-

age KL divergence between the true Nash equilibrium intervention policy and the

proposed Bayesian intervention policy. The KL divergence approaching zero in-

dicates the empirical convergence of the Bayesian policy converges to the optimal

Nash policy.

8. Conclusion

This paper develops a Bayesian intervention policy for gene regulatory net-

works (GRNs) that takes into account cell defensive responses. The temporal dy-

namics of GRNs are modeled using a Boolean network with perturbation (BNp)

model, and the interaction between the cell and the intervention is formulated as

a two-player zero-sum game. Given incomplete information about cell responses,

this paper provides a recursive and probabilistic method to capture the posterior

distribution of cell defensive responses. The Bayesian policy is introduced using

the combination of the cell-speciﬁc Nash policies for each cell space and the pos-

terior distribution associated with them. Our analytical results demonstrate the

superiority of the proposed intervention policy against several existing interven-

tion techniques. Meanwhile, the superiority of the proposed intervention policy

is demonstrated through comprehensive numerical experiments using the p53-

MDM2 negative feedback loop regulatory network and melanoma network.

Our future studies will explore the extension of the proposed game-theoretic

intervention policy to practical settings, including studying the partial observabil-

ity of the genes’ state through noisy gene-expression data, as well as addressing

scalability issues related to large gene regulatory networks and cell stimuli spaces.

Acknowledgment

The authors acknowledge the support of the National Institute of Health award

1R21EB032480-01, National Science Foundation awards IIS-2311969 and IIS-

2202395, ARMY Research Laboratory award W911NF2320179, ARMY Research

Ofﬁce award W911NF2110299, and Ofﬁce of Naval Research award N00014-23-

1-2850.

References

[1] H. Lähdesmäki, I. Shmulevich, O. Yli-Harja, On learning gene regulatory

networks under the Boolean network model, Machine learning 52 (2003)

147–167.

[2] A. Paul, J. Sil, Optimized time-lag differential method for constructing gene

regulatory network, Information Sciences 478 (2019) 222–238.

[3] vZ. Puvsnik, M. Mraz, N. Zimic, M. Movskon, Review and assessment

of Boolean approaches for inference of gene regulatory networks, Heliyon

(2022).

[4] Z. Zou, H. Chen, P. Poduval, Y. Kim, M. Imani, E. Sadredini, R. Cammarota,

M. Imani, BioHD: an efﬁcient genome sequence search platform using hy-

perdimensional memorization, in: Proceedings of the 49th Annual Interna-

tional Symposium on Computer Architecture, 2022, pp. 656–669.

[5] W.-P. Lee, Y.-T. Hsiao, Inferring gene regulatory networks using a hybrid

ga–pso approach with numerical constraints and network decomposition, In-

formation Sciences 188 (2012) 80–99.

[6] M. Alali, M. Imani, Inference of regulatory networks through temporally

sparse data, Frontiers in control engineering 3 (2022) 1017256.

[7] E. R. Dougherty, R. Pal, X. Qian, M. L. Bittner, A. Datta, Stationary and

structural control in gene regulatory networks: basic concepts, International

Journal of Systems Science 41 (2010) 5–16.

[8] A. Yerudkar, E. Chatzaroulas, C. Del Vecchio, S. Moschoyiannis, Sampled-

data control of probabilistic Boolean control networks: A deep reinforce-

ment learning approach, Information Sciences 619 (2023) 374–389.

[9] M. Takizawa, K. Kobayashi, Y. Yamashita, Design of reduced-order and

pinning controllers for probabilistic Boolean networks using reinforcement

learning, Applied Mathematics and Computation 457 (2023) 128211.

[10] S. Dai, B. Li, J. Lu, J. Zhong, Y. Liu, A uniﬁed transform method for general

robust property of probabilistic Boolean control networks, Applied Mathe-

matics and Computation 457 (2023) 128137.

[11] J. A. Aledo, E. Goles, M. Montalva-Medel, P. Montealegre, J. C. Valverde,

Symmetrizable Boolean networks, Information Sciences 626 (2023) 787–

804.

[12] A. Ravari, S. F. Ghoreishi, M. Imani, Optimal inference of hidden Markov

models through expert-acquired data, IEEE Transactions on Artiﬁcial Intel-

ligence (2024).

[13] C. Su, J. Pang, CABEAN: a software for the control of asynchronous

Boolean networks, Bioinformatics 37 (2021) 879–881.

[14] L. Van den Broeck, M. Gordon, D. Inzé, C. Williams, R. Sozzani, Gene

regulatory network inference: connecting plant biology and mathematical

modeling, Frontiers in genetics 11 (2020) 457.

[15] D. Mercatelli, L. Scalambra, L. Triboli, F. Ray, F. M. Giorgi, Gene regulatory

network inference resources: A practical overview, Biochimica et Biophys-

ica Acta (BBA)-Gene Regulatory Mechanisms 1863 (2020) 194430.

[16] Y. You, Z. Hua, An intelligent intervention strategy for patients to prevent

chronic complications based on reinforcement learning, Information Sci-

ences 612 (2022) 1045–1065.

[17] J. Zhong, Y. Liu, J. Lu, W. Gui, Pinning control for stabilization of Boolean

networks under knock-out perturbation, IEEE Transactions on Automatic

Control 67 (2021) 1550–1557.

[18] S. H. Hosseini, M. Imani, Learning to ﬁght against cell stimuli: A game

theoretic perspective, in: 2023 IEEE Conference on Artiﬁcial Intelligence

(CAI), IEEE, 2023, pp. 285–287.

[19] R. Pal, A. Datta, E. R. Dougherty, Optimal inﬁnite-horizon control for prob-

abilistic Boolean networks, IEEE Transactions on Signal Processing 54

(2006) 2375–2387.

[20] B. Faryabi, J.-F. Chamberland, G. Vahedi, A. Datta, E. R. Dougherty, Opti-

mal intervention in asynchronous genetic regulatory networks, IEEE Journal

of Selected Topics in Signal Processing 2 (2008) 412–423.

[21] X. Qian, E. R. Dougherty, Intervention in gene regulatory networks via

phenotypically constrained control policies based on long-run behavior,

IEEE/ACM Transactions on Computational Biology and Bioinformatics 9

(2011) 123–136.

[22] Q. Liu, Y. He, J. Wang, Optimal control for probabilistic Boolean networks

using discrete-time Markov decision processes, Physica A: Statistical Me-

chanics and its Applications 503 (2018) 1297–1307.

[23] M. Imani, U. M. Braga-Neto, Control of gene regulatory networks using

Bayesian inverse reinforcement learning, IEEE/ACM transactions on com-

putational biology and bioinformatics 16 (2018) 1250–1261.

[24] M. Imani, U. M. Braga-Neto, Control of gene regulatory networks with

noisy measurements and uncertain inputs, IEEE Transactions on Control of

Network Systems 5 (2017) 760–769.

[25] M. Imani, U. M. Braga-Neto, Finite-horizon LQR controller for partially-

observed Boolean dynamical systems, Automatica 95 (2018) 172–179.

[26] M. Imani, U. Braga-Neto, Multiple model adaptive controller for partially-

observed Boolean dynamical systems, in: 2017 American Control Confer-

ence (ACC), IEEE, 2017, pp. 1103–1108.

[27] M. Imani, M. Imani, S. F. Ghoreishi, Optimal Bayesian biomarker selection

for gene regulatory networks under regulatory model uncertainty, in: 2022

American Control Conference (ACC), IEEE, 2022, pp. 1379–1385.

[28] M. Imani, U. Braga-Neto, Point-based value iteration for partially-observed

Boolean dynamical systems with ﬁnite observation space, in: 2016 IEEE

55th Conference on Decision and Control (CDC), IEEE, 2016, pp. 4208–

4213.

[29] M. Imani, S. F. Ghoreishi, U. M. Braga-Neto, Bayesian control of large mdps

with unknown dynamics in data-poor environments, Advances in neural

information processing systems 31 (2018).

[30] M. Imani, U. Braga-Neto, Optimal control of gene regulatory networks

with unknown cost function, in: 2018 Annual American Control Confer-

ence (ACC), IEEE, 2018, pp. 3939–3944.

[31] I. Shmulevich, E. R. Dougherty, W. Zhang, From Boolean to probabilistic

Boolean networks as models of genetic regulatory networks, Proceedings of

the IEEE 90 (2002) 1778–1792.

[32] L. E. Chai, S. K. Loh, S. T. Low, M. S. Mohamad, S. Deris, Z. Zakaria, A

review on the computational approaches for gene regulatory network con-

struction, Computers in biology and medicine 48 (2014) 55–65.

[33] K. Zhang, Z. Yang, T. Ba¸sar, Multi-agent reinforcement learning: A selective

overview of theories and algorithms, Handbook of reinforcement learning

and control (2021) 321–384.

[34] K. Zhang, S. Kakade, T. Basar, L. Yang, Model-based multi-agent RL in

zero-sum Markov games with near-optimal sample complexity, Advances in

Neural Information Processing Systems 33 (2020) 1166–1178.

[35] K. Zhang, Z. Yang, T. Basar, Policy optimization provably converges to

nash equilibria in zero-sum linear quadratic games, Advances in Neural

Information Processing Systems 32 (2019).

[36] I. Bose, B. Ghosh, The p53-mdm2 network: from oscillations to apoptosis,

Journal of biosciences 32 (2007) 991–997.

[37] W. Abou-Jaoudé, M. Chaves, J.-L. Gouzé, A theoretical exploration of

birhythmicity in the p53-mdm2 network, PLOS one 6 (2011) e17075.

[38] J. S. Chauhan, M. Hölzel, J.-P. Lambert, F. M. Buffa, C. R. Goding, The

mitf regulatory network in melanoma, Pigment Cell & Melanoma Research

35 (2022) 517–533.

[39] A. Ravari, S. Ghoreishi, M. Imani, Structure-based inverse reinforcement

learning for quantiﬁcation of biological knowledge, in: IEEE Conference on

Artiﬁcial Intelligence, 2023.

[40] A. Ravari, S. F. Ghoreishi, M. Imani, Optimal recursive expert-enabled infer-

ence in regulatory networks, IEEE Control Systems Letters 7 (2023) 1027–

1032.

[41] M. Alali, M. Imani, Reinforcement learning data-acquiring for causal in-

ference of regulatory networks, in: American Control Conference (ACC),

IEEE, 2023.

[42] L. S. Shapley, Stochastic games, Proceedings of the national academy of

sciences 39 (1953) 1095–1100.

[43] A. Rubinstein, H. W. Kuhn, O. Morgenstern, J. Von Neumann, Theory of

Games and Economic Behavior, Princeton university press, 2007.

[44] E. Batchelor, A. Loewer, G. Lahav, The ups and downs of p53: understand-

ing protein dynamics in single cells, Nature Reviews Cancer 9 (2009) 371.

[45] S. Nag, J. Qin, S. KS, M. Wang, R. Zhang, The mdm2-p53 pathway revis-

ited, The Journal of Biomedical Research 27(4) (2013) 254–271.

[46] J. Paluncic, Z. Kovacevic, P. J. Jansson, D. Kalinowski, A. M. Merlot, M. L.-

H. Huang, H. C. Lok, S. Sahni, D. J. Lane, D. R. Richardson, Roads to

melanoma: Key pathways and emerging players in melanoma progression

and oncogenic signaling, Biochimica et Biophysica Acta (BBA) - Molecular

Cell Research 1863 (2016) 770–784.

[47] W. Guo, H. Wang, C. Li, Signal pathways of melanoma and targeted therapy,

Signal Transduction and Targeted Therapy 6 (2021) 424.

Bayesian Optimization for State and Parameter Estimation of Dynamic Networks with Binary Space

Conference Paper

Aug 2024

This paper focuses on joint state and parameter estimation in partially observed Boolean dynamical systems (POBDS), a hidden Markov model tailored for modeling complex networks with binary state variables. The majority of current techniques for parameter estimation rely on com-putationally expensive gradient-based methods, which become intractable in most practical applications with large size of networks. We propose a gradient-free approach that uses Gaussian processes to model the expensive log-likelihood function and utilizes Bayesian optimization for efficient likelihood search over parameter space. Joint state estimation is also achieved alongside parameter estimation using the Boolean Kalman filter. The performance of the proposed method is demonstrated using gene regulatory networks observed through synthetic gene-expression data. The numerical results demonstrate the scalability and effectiveness of the proposed method in the joint estimation of the model parameters and genes' states.

Kernel-Based Particle Filtering for Scalable Inference in Partially Observed Boolean Dynamical Systems

Conference Paper

Full-text available

May 2024

This paper addresses the inference challenges associated with a class of hidden Markov models with binary state variables, known as partially observed Boolean dynamical systems (POBDS). POBDS have demonstrated remarkable success in modeling the ON and OFF dynamics of genes, microbes, and bacteria in systems biology, as well as in network security to represent the propagation of attacks among interconnected elements. Despite existing optimal and approximate inference solutions for POBDS, scalability remains a significant issue due to the computational cost associated with likelihood evaluations and the exploration of extensive parameter spaces. To overcome these challenges, this paper proposes a kernel-based particle filtering approach for large-scale inference of POBDS. Our method employs a Gaussian process (GP) to efficiently represent the expensive-to-evaluate likelihood function across the parameter space. The likelihood evaluation is approximated using a particle filtering technique, enabling the GP to account for various sources of uncertainty, including limited likelihood evaluations. Leveraging the GP's predictive behavior, a Bayesian optimization strategy is derived for effectively seeking parameters yielding the highest likelihood, minimizing the overall computational burden while balancing exploration and exploitation. The proposed method's performance is demonstrated using two biological networks: the mammalian cell-cycle network and the T-cell large granular lymphocyte leukemia network.

Dynamic Intervention in Gene Regulatory Networks: A Partially Observed Zero-Sum Markov Game

Conference Paper

Full-text available

Aug 2024

Gene Regulatory Networks (GRNs) are pivotal in governing diverse cellular processes, such as stress response, DNA repair, and mechanisms associated with complex diseases like cancer. The interventions in GRNs aim to restore the system state to its normal condition by altering gene activities over time. Unlike most intervention approaches that rely on the direct observability of the system state and assume no response of the cell against intervention, this paper models the fight between intervention and cell dynamic response using a partially observed zero-sum Markov game with binary state variables. The paper derives a stochastic intervention policy under partial state observability of genes. The optimal Nash equilibrium intervention policy is first obtained for the underlying system. To overcome the challenges of partial state observability, the paper employs the optimal minimum mean-square error (MMSE) state estimator to estimate the system state, given all available information. The proposed intervention policy utilizes the optimal Nash intervention policy associated with the optimal MMSE state estimator. The performance of the proposed method is examined using numerical experiments on the melanoma regulatory network observed through gene-expression data.

Optimal Detection for Bayesian Attack Graphs under Uncertainty in Monitoring and Reimaging

Conference Paper

Full-text available

Jul 2024

Bayesian attack graphs (BAGs) are powerful models to capture the time-varying progression of attacks in complex interconnected networks. Network elements are modeled by graph nodes, and connections among components are represented through edges. The nodes take binary values, representing the compromised and uncompromised state of the network components. BAGs also offer a probabilistic representation of the likelihood of external and internal attacks progressing through exploit probabilities. The accuracy and timely detection of attacks are the main objectives in the security analysis of networks modeled by BAGs. This can ensure network safety by identifying network vulnerabilities and designing better defense strategies (e.g., reimaging devices, installing firewalls, changing connections, etc.). Two main challenges in achieving accurate detection in complex networks are 1) the partial monitoring of the network components due to the limited available resources and 2) the uncertainty in identifying and removing some compromises in the network due to the ever-evolving complexity of attacks. For a general class of BAGs, this paper presents an optimal minimum mean square error (MMSE) attack detection technique with arbitrary uncertainty in the monitoring and reimaging process. As with the Kalman filtering approach used for linear Gaussian state-space models, the derived solution exhibits the same optimality. A recursive matrix-form implementation of the proposed detection method is introduced, and its performance is examined through numerical experiments using a synthetic BAG.

Implicit Human Perception Learning in Complex and Unknown Environments

Conference Paper

Full-text available

Jul 2024

Autonomy through humans and autonomous agents becomes more prevalent in many complex domains, including time-sensitive and unknown environments. Examples include crisis response or operational planning, where partial knowledge about casualties, locations, and the number of victims in disaster zones might be available. Several approaches have been developed to tackle the issue arising from the partial knowledge of the environment by establishing communication among agents and humans. However, communication might be limited or non-existent in complex domains with no access to communication tools or no time to process information or respond to queries. This paper develops a perception learning approach that allows agents to implicitly reason about humans' perception of the environment using limited human data without direct communication. Human is modeled as a non-optimal reinforcement learning agent in a partially known Markov decision process. A recursive method is derived to optimally build a probabilistic model of the environment using agents' experience and quantified humans' perception. We demonstrate that the learned perception models can be incorporated into various decision-making policies relying on the environment model. The performance of the proposed method is investigated using a rescue operation team consisting of a human and an agent.

Learning to Fight Against Cell Stimuli: A Game Theoretic Perspective

Conference Paper

Full-text available

Jun 2023

Current genomics interventions have limitations in accounting for cell stimuli and the dynamic response to intervention. Although genomic sequencing and analysis have led to significant advances in personalized medicine, the complexity of cellular interactions and the dynamic nature of the cellular response to stimuli pose significant challenges. These limitations can lead to chronic disease recurrence and inefficient genomic interventions. Therefore, it is necessary to capture the full range of cellular responses to develop effective interventions. This paper presents a game-theoretic model of the fight between the cell and intervention, demonstrating analytically and numerically why current interventions become ineffective over time. The performance is analyzed using melanoma regulatory networks, and the role of artificial intelligence in deriving effective solutions is described.

Structure-Based Inverse Reinforcement Learning for Quantification of Biological Knowledge

Conference Paper

Full-text available

Jun 2023

Gene regulatory networks (GRNs) play crucial roles in various cellular processes, including stress response, DNA repair, and the mechanisms involved in complex diseases such as cancer. Biologists are involved in most biological analyses. Thus, quantifying their policies reflected in available biological data can significantly help us to better understand these complex systems. The primary challenges preventing the utilization of existing machine learning, particularly inverse reinforcement learning techniques, to quantify biologists' knowledge are the limitations and huge amount of uncertainty in biological data. This paper leverages the network-like structure of GRNs to define expert reward functions that contain exponentially fewer parameters than regular reward models. Numerical experiments using mammalian cell cycle and synthetic gene-expression data demonstrate the superior performance of the proposed method in quantifying biologists' policies.

Reinforcement Learning Data-Acquiring for Causal Inference of Regulatory Networks

Conference Paper

Full-text available

Jun 2023

Gene regulatory networks (GRNs) consist of multiple interacting genes whose activities govern various cellular processes. The limitations in genomics data and the complexity of the interactions between components often pose huge uncertainties in the models of these biological systems. Meanwhile, inferring/estimating the interactions between components of the GRNs using data acquired from the normal condition of these biological systems is a challenging or, in some cases, an impossible task. Perturbation is a well-known genomics approach that aims to excite targeted components to gather useful data from these systems. This paper models GRNs using the Boolean network with perturbation, where the network uncertainty appears in terms of unknown interactions between genes. Unlike the existing heuristics and greedy data-acquiring methods, this paper provides an optimal Bayesian formulation of the data-acquiring process in the reinforcement learning context, where the actions are perturbations, and the reward measures step-wise improvement in the inference accuracy. We develop a semi-gradient reinforcement learning method with function approximation for learning near-optimal data-acquiring policy. The obtained policy yields near-exact Bayesian optimality with respect to the entire uncertainty in the regulatory network model, and allows learning the policy offline through planning. We demonstrate the performance of the proposed framework using the well-known p53-Mdm2 negative feedback loop gene regulatory network.

Optimal Recursive Expert-Enabled Inference in Regulatory Networks

Article

Full-text available

Dec 2022

Accurate inference of biological systems, such as gene regulatory networks and microbial communities, is a key to a deep understanding of their underlying mechanisms. Despite several advances in the inference of regulatory networks in recent years, the existing techniques cannot incorporate expert knowledge into the inference process. Expert knowledge contains valuable biological information and is often reflected in available biological data, such as interventions made by biologists for treating diseases. Given the complexity of regulatory networks and the limitation of biological data, ignoring expert knowledge can lead to inaccuracy in the inference process. This paper models the regulatory networks using Boolean network with perturbation. We develop an expert-enabled inference method for inferring the unknown parameters of the network model using expert-acquired data. Given the availability of information about data-acquiring objectives and expert confidence, the proposed method optimally quantifies the expert knowledge along with the temporal changes in the data for the inference process. The numerical experiments investigate the performance of the proposed method using the well-known p53-MDM2 gene regulatory network.

Inference of Regulatory Networks Through Temporally Sparse Data

Article

Full-text available

Nov 2022

A major goal in genomics is to properly capture the complex dynamical behaviors of gene regulatory networks (GRNs). This includes inferring the complex interactions between genes, which can be used for a wide range of genomics analyses, including diagnosis or prognosis of diseases and finding effective treatments for chronic diseases such as cancer. Boolean networks have emerged as a successful class of models for capturing the behavior of GRNs. In most practical settings, inference of GRNs should be achieved through limited and temporally sparse genomics data. A large number of genes in GRNs leads to a large possible topology candidate space, which often cannot be exhaustively searched due to the limitation in computational resources. This paper develops a scalable and efficient topology inference for GRNs using Bayesian optimization and kernel-based methods. Rather than an exhaustive search over possible topologies, the proposed method constructs a Gaussian Process (GP) with a topology-inspired kernel function to account for correlation in the likelihood function. Then, using the posterior distribution of the GP model, the Bayesian optimization efficiently searches for the topology with the highest likelihood value by optimally balancing between exploration and exploitation. The performance of the proposed method is demonstrated through comprehensive numerical experiments using a well-known mammalian cell-cycle network.

Design of reduced-order and pinning controllers for probabilistic Boolean networks using reinforcement learning

Article

Nov 2023
APPL MATH COMPUT

A unified transform method for general robust property of probabilistic Boolean control networks

Article

Nov 2023
APPL MATH COMPUT

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Chapter

Jun 2021

Symmetrizable Boolean networks

Article

Jan 2023
INFORM SCIENCES

Sampled-data Control of Probabilistic Boolean Control Networks: A Deep Reinforcement Learning Approach

Article

Nov 2022
INFORM SCIENCES

The rise of reinforcement learning (RL) has guided a new paradigm: unraveling the intervention strategies to control systems with unknown dynamics. Model-free RL provides an exhaustive framework to devise therapeutic methods to alter the regulatory dynamics of gene regulatory networks (GRNs). This paper presents an RL-based technique to control GRNs modeled as probabilistic Boolean control networks (PBCNs). In particular, a double deep-network (DDQN) approach is proposed to address the sampled-data control (SDC) problem of PBCNs, and optimal state feedback controllers are obtained, rendering the PBCNs stabilized at a given equilibrium point. Our approach is based on options, i.e., the temporal abstractions of control actions in the Markov decision processes (MDPs) framework. First, we define options and hierarchical options and give their properties. Then, we introduce multi-time models to compute the optimal policies leveraging the options framework. Furthermore, we present a DDQN algorithm: i) to concurrently design the feedback controller and the sampling period; ii) wherein the controller intelligently decides the sampled period to update the control actions under the SDC scheme. The presented method is model-free and offers scalability, thereby providing an efficient way to control large-scale PBCNs. Finally, we compare our control policy with state-of-the-art control techniques and validate the presented results.

An optimal Bayesian intervention policy in response to unknown dynamic cell stimuli

Recommended publications

Bayesian policy learning with trans-dimensional MCMC

Bayesian Optimization for Policy Search in High-Dimensional Systems via Automatic Domain Selection

Gaussian Process Self-triggered Policy Searchガウス過程に基づく自己駆動型方策による方策探索

Graduate Admission Policy