Conference PaperPDF Available

Multivariate Cluster-Based Discretization for Bayesian Network Structure Learning

Authors:

Abstract and Figures

While there exist many efficient algorithms in the literature for learning Bayesian networks with discrete random variables, learning when some variables are discrete and others are continuous is still an issue. A common way to tackle this problem is to preprocess datasets by first discretizing continuous variables and, then, resorting to classical discrete variable-based learning algorithms. However, such a method is inefficient because the conditional dependences/arcs learnt during the learning phase bring valuable information that cannot be exploited by the discretization algorithm, thereby preventing it to be fully effective. In this paper, we advocate to discretize while learning and we propose a new multivariate discretization algorithm that takes into account all the conditional dependences/arcs learnt so far. Unlike popular discretization methods, ours does not rely on entropy but on clustering using an EM scheme based on a Gaussian mixture model. Experiments show that our method significantly outperforms the state-of-the-art algorithms.
Content may be subject to copyright.
Multivariate Cluster-based Discretization for
Bayesian Network Structure Learning
Ahmed Mabrouk
1
, Christophe Gonzales
2
, Karine Jabet-Chevalier
1
, and Eric
Chojnaki
1
1
Institut de radioprotection et de sˆuret´e nucl´eaire, France
{Ahmed.Mabrouk,Karine.Chevalier-Jabet,Eric.Chojnaki}@irsn.fr
2
Sorbonne Universit´es, UPMC Univ Paris 06, CNRS, UMR 7606, LIP6, France
Christophe.Gonzales@lip6.fr
Abstract. While there exist many efficient algorithms in the literature
for learning Bayesian networks with discrete random variables, learning
when some variables are discrete and others are continuous is still an
issue. A common way to tackle this problem is to preprocess datasets
by first discretizing continuous variables and, then, resorting to classical
discrete variable-based learning algorithms. However, such a method is
inefficient because the conditional dependences/arcs learnt during the
learning phase bring valuable information that cannot be exploited by
the discretization algorithm, thereby preventing it to be fully effective.
In this paper, we advocate to discretize while learning and we propose a
new multivariate discretization algorithm that takes into account all the
conditional dependences/arcs learnt so far. Unlike popular discretization
methods, ours does not rely on entropy but on clustering using an EM
scheme based on a Gaussian mixture model. Experiments show that our
method significantly outperforms the state-of-the-art algorithms.
Keywords: Multivariate discretization, Bayesian network learning.
1 Introduction
For several decades, Bayesian networks (BN) have been successfully exploited
for dealing with uncertainties. However, while their learning and inference mech-
anisms are relatively well understood when they involve only discrete variables,
their coping with continuous variables is still often unsatisfactory. One actually
has to trade-off between expressiveness and computational complexity: on one
hand, conditional Gaussian models and their mixing with discrete variables are
computationally efficient but they definitely lack some expressiveness [12]; on the
other hand, mixtures of exponentials, bases or polynomials are very expressive
but at the expense of tractability [15,20]. In between lie discretization meth-
ods which, by converting continuous variables into discrete ones, can provide a
satisfactory trade-off between expressiveness and tractability.
Acknowledgment: This work was supported by the French Institute for Radiopro-
tection and Nuclear Safety (IRSN), the Belgium’s nuclear safety authorities (Bel V)
and European project H2020-ICT-2014-1 #644425 Scissor.
2
In many real-world applications, BNs are learnt from data and, when there
exist continuous attributes, those are often discretized prior to learning, thereby
opening the path to exploiting efficient discrete variable-based learning algo-
rithms. However such an approach is doomed to be ineffective because the
conditional dependences/arcs learnt during the learning phase bring valuable
information that cannot be exploited by the discretization algorithm, thereby
severely limiting its effectiveness. However, there exist surprisingly few papers
on discretizing while learning, probably because it incurs substantial computa-
tional costs and it requires multivariate discretization instead of just a univariate
one. In this direction, MDL and Bayesian scores used by search algorithms have
been adapted to include multivariate discretizations taking into account the BN
structure learnt so far [6,13]. But, to be naturally included into these scores, the
latter heavily rely on entropy-related maximizations which, as we shall see, is
not very well suited for BN learning. In [21], a non-linear dimensionality reduc-
tion process called GP-LVM combined with a Gaussian mixture model-based
discretization is proposed for BN learning. Unfortunately, GP-LVM looses the
random variable’s semantics and the discretization does not rely on the BN struc-
ture. As a consequence, the method does not exploit all the useful information.
Unlike in BN learning, multivariate discretization has often been exploited
in Machine Learning for supervised classification tasks [1,2,5,9,22]. But the goal
is only to maximize the classification power w.r.t. one target variable. As such,
only the individual correlations of each variable with the target are of inter-
est and, thus, only bivariate discretization is needed. BN structure learning is
fundamentally different because the complete set of conditional dependences be-
tween all sets of variables is of interest and multivariate discretization shall most
often involve more than two variables. This makes these approaches not easily
transferable to BN learning. In [11], the authors propose a general multivariate
discretization relying on genetic algorithms to construct rulesets. However, the
approach is very limited because it is designed to cope with only one target and
the domain size of this variable needs to be small to keep the method tractable.
Discretizations have also been exploited in unsupervised learning (UL), but
those are essentially univariate [4,8,16,17], which make them usable per se only
as a preprocess prior to learning. However, BN learning can be related to UL
in the sense that all the BN’s variables can be thought of as targets whose dis-
cretized values are unobserved. This suggests that some key ideas underlying UL
algorithms might be adapted for learning BN structures. Clustering is one such
popular framework. In [14], for instance, multivariate discretization is performed
by clustering but, unfortunately, independences between random variables are
only considered given a latent variable. This limits considerably the range of
applications of the method because numerous continuous variables require the
latent one to have a large domain size in order to get good quality discretizations.
This approach is therefore limited to small datasets and, by not exploiting the
BN structure, it is best suited as a BN learning preprocess. Finally, by relying on
entropy, its effectiveness for BN learning is certainly not optimal. However, here,
we advocate to exploit clustering methods for discretization w.r.t. BN learning.
Discretization for Bayesian Network Structure Learning 3
More precisely, we propose a new clustering-based approach for multivariate
discretization that takes into account the conditional dependences among vari-
ables discovered during learning. By exploiting clustering rather than entropy,
it avoids the shortcomings induced by the latter and, by taking into account the
dependences between random variables, it significantly increases the quality of
the discretization compared to state-of-the-art clustering approaches.
The rest of the paper is organized as follows. Section 2 recalls BN learning
and discretizations. Then, in Section 3, we describe our approach and justify
its correctness. Its effectiveness is highlighted through experiments in Section 4.
Finally, some concluding remarks are given in Section 5.
2 Basics on BN Structure Learning and Discretization
Uppercase (resp. lowercase) letters X, Z, x, z, represent random variables and
their instantiations respectively. Boldface letters represent sets.
Definition 1. A (discrete) BN is a pair (G, θ) where G = (X, A) is a directed
acyclic graph (DAG), X = {X
1
, ..., X
n
} represents a set of discrete random
variables
1
, A is a set of arcs, and θ = {P (X
i
|Pa(X
i
))}
n
i=1
is the set of the
conditional probability distributions (CPT) of the variables X
i
in G given their
parents Pa(X
i
) in G. The BN encodes the joint probability over X as:
P (X) =
n
Y
i=1
P (X
i
|Pa(X
i
)). (1)
To avoid ambiguities between continuous variables and their discretized coun-
terparts, letters, when superscripted by ”, e.g.,
˚
X, ˚x, represent variables and
their instantiations prior to discretization, else they are discretized (for discrete
variables, X =
˚
X and x = ˚x). In the rest of the paper, n always denotes the num-
ber of variables in the BN, and we assume that
˚
X
1
, . . . ,
˚
X
l
are discrete whereas
˚
X
l+1
, . . . ,
˚
X
n
are continuous.
˚
D and D denote the input databases before and
after discretization respectively and are assumed to be complete, i.e., they do
not contain any missing data. N refers to their number of records.
Given
˚
D = {
˚
x
(1)
,
˚
x
(2)
, . . . ,
˚
x
(N)
}, BN learning consists of finding DAG G that
most likely accounts for the observed data in
˚
D. When all variables are discrete,
i.e., D =
˚
D, there exist many efficient algorithms in the literature for solving
this task. Those can be divided into 3 classes [10]: i) the search-based approaches
that look for the structure optimizing a score (BD, BDeu, BIC, AIC, K2, etc.);
ii) the constraint-based approaches that exploit statistical independence tests
(χ
2
, G
2
, etc.) to find the best structure G; iii) the hybrid methods that exploit
a combination of both. In the rest of the paper, we will focus on search-based
approaches because our closest competitors, [13,6], belong to this class.
Basically, these algorithms start with a structure G
0
(often empty). Then, at
each step, they look in the neighborhood of the current structure for another
1
By abuse of notation, we use interchangeably X
i
X to denote a node in the BN
and its corresponding random variable.
4
structure, say G, that increases the likelihood of structure G given observations
D, i.e., P (G|D). The neighborhood is often defined as the set of graphs that differ
from the current one only by one atomic graphical modification (arc addition, arc
deletion, arc reversal). P (G|D) is computed locally through the aforementioned
scores, their differences stemming essentially from different a priori hypotheses.
More precisely, assuming a uniform prior on all structures G, we have that:
P (G|D) =
P (D|G)P (G)
P (D)
P (D|G) =
Z
θ
P (D|G, θ)π(θ|G)dθ, (2)
where θ is the set of parameters of the CPTs of a (discrete) BN with structure
G. Different hypotheses on prior π and on θ result in the different scores (see,
e.g., [18] for the hypotheses for the BIC score used later).
When database
˚
D contains continuous variables, those can be discretized. A
discretization of a continuous variable
˚
X is a function f : R {0, . . . , g} defined
by an increasing sequence of g cut points {t
1
, t
2
, ..., t
g
} such that:
f(˚x) =
0 if ˚x < t
1
,
k if t
k
˚x < t
k+1
, for all k {1, . . . , g 1}
g if ˚x t
g
.
Let F be a set of discretization functions, one for each continuous variable. Then,
given F, if D denotes the (unique) database resulting from the discretization of
˚
D by F., Eq. (2) becomes:
P (G|
˚
D, F) P (
˚
D|G, F) = P (D|
˚
D, G, F)P (
˚
D|G, F) = P (
˚
D|D, G, F)P (D|G, F),
Assuming that all databases
˚
D compatible with D given F are equiprobable, we
thus have that:
P (G|
˚
D, F) P (D|G, F) =
Z
θ
P (D|G, F, θ)π(θ|F, G)dθ. (3)
BN structure learning therefore amounts to find structure G
such that G
=
Argmax
G
P (G|D, F). Note that P (D|G, F, θ) corresponds to a classical score
over discrete data. π(θ|F, G) is the prior over the parameters of the BN given F.
Eq. (3) is precisely the one used when discretization is performed as a preprocess
before learning.
When discretization is performed while learning, like in [6,13], both the struc-
ture and the discretization should be optimized simultaneously. In other words,
the problem consists of computing Argmax
F,G
P (G, F|
˚
D), where finding the best
discretization amounts to find the best set of cut points (including the best size
for this set) for each continuous random variable. And we have that:
P (G, F|
˚
D) = P (G|F,
˚
D)P (F|
˚
D) P (F|
˚
D)
Z
θ
P (D|G, F, θ)π(θ|F, G)dθ. (4)
As can be seen, the resulting equation combines the classical score on the
discretized data (the integral) with a score P (F|
˚
D) for the discretization al-
gorithm itself. The logarithm of latter corresponds to what [6] and [13] call
DL
Λ
(Λ) + DL
˚
D→D
(
˚
D, Λ) and S
c
(Λ;
˚
D) respectively.
Discretization for Bayesian Network Structure Learning 5
Input: a database
˚
D, an initial graph G, a score function sc on discrete variables
Output: the structure G of the Bayesian network
1 repeat
2 Find the best discretization F given G
3 {X
l+1
, . . . , X
n
} discretize variables {
˚
X
l+1
, . . . ,
˚
X
n
} given F
4 G G’s neighbor that maximizes scoring function sc w.r.t. {X
1
, . . . , X
n
}
5 until G maximizes the score;
Algorithm 1: Our structure learning architecture.
3 A New Multivariate Discretization-Learning Algorithm
As mentioned earlier, we believe that taking into account the conditional de-
pendences between random variables is important to provide high-quality dis-
cretizations. Our approach thus follows Eq. (4) and our goal is to compute
Argmax
F,G
P (G, F|
˚
D). Optimizing jointly over F and G is too computation-
ally intensive a task to be usable in practice. Fortunately, we can approximate
it efficiently through a gradient descent, alternating optimizations over F given
a fixed structure G and optimizations over G given a fixed discretization F. This
suggests the BN structure learning method described as Algo. 1.
Multivariate discretization is much more time consuming than univariate dis-
cretization. As such, Line 2 could thus incur a strong overhead to the learning
algorithm because the discretization search space increases exponentially with
the number of variables to discretize. To alleviate this problem without sacrific-
ing too much in accuracy, we suggest a local search algorithm that iteratively
fixes the discretizations of all the continuous variables but one and optimizes
the discretization of the latter (given the other variables) until some stopping
criterion is met. As such, discretizations being optimized one continuous variable
at a time, the combinatorics and the computation time are significantly limited.
Line 2 can thus be detailed as Algo. 2.
3.1 Discretization Criterion
To implement Algo. 2, a discretization criterion to be optimized is needed. Basic
ideas include trying to find cut points minimizing the discrepancy between the
frequencies or the sizes of intervals [t
k
, t
k+1
). A more sophisticated approach
Input: a database
˚
D, a graph G, a scoring function sc on discrete variables
Output: a discretization F
1 repeat
2 i
0
Select an element in {l + 1, . . . , n}
3 Discretize
˚
X
i
0
given G and {X
1
, . . . , X
i
0
1
, X
i
0
+1
, . . . , X
n
}
4 until stopping condition;
Algorithm 2: One-variable discretization architecture.
6
consists of limiting as much as possible the quantity of information lost after
discretization, or equivalently to maximize the quantity of information remain-
ing after discretization. This naturally calls for maximizing an entropy. This is
essentially what our closest competitors, [6,13], do.
But entropy may not be the most appropriate measure when dealing with
BNs. Actually, consider a variable A with domain {a
1
, a
2
, a
3
}. Then, it is pos-
sible that, for some BN, P (A = a
1
) =
1
6
, P (A = a
2
) =
1
3
and P (A = a
3
) =
1
2
.
With a sufficiently large database D, the frequencies of observations of a
1
, a
2
, a
3
in D would certainly lead to estimate P (A) [
1
6
,
1
3
,
1
2
]. Now, assume that the
observations in D are noisy, say with a Gaussian noise with an infinitely small
variance, as in Fig. 1. Then, after discretization, we shall expect to have 3 in-
tervals with respective frequencies
1
6
,
1
3
and
1
2
, i.e., intervals similar to (−∞, t
1
),
[t
1
, t
2
) and [t
2
, +) of Fig. 1. However, w.r.t. entropy, the best discretization
corresponds to intervals [−∞, s
1
), [s
1
, s
2
) and [s
2
, +) of Fig. 1 whose fre-
quencies are all approximately equal to
1
3
(entropy is maximal for equiprobable
intervals). Therefore, whatever the infinitesimal noise added to data in D, an
entropy-based discretization produces a discretized variable A with distribution
[
1
3
,
1
3
,
1
3
] instead of [
1
6
,
1
3
,
1
2
]. This suggests that entropy is probably not the best
criterion for discretizing continuous variables for BN learning.
Fig. 1 suggests that clustering would probably be more appropriate: here, one
cluster/interval per Gaussian would provide a better discretization. In this paper,
we assume that, within every interval, each continuous random variable, say
˚
X
i
0
,
is distributed w.r.t. a truncated Gaussian. Over its whole domain of definition,
it is thus distributed as a mixture of truncated Gaussians, the weights of the
latter being precisely the CPT of X
i
0
in the discrete BN. In particular, if
˚
X
i
0
has some parents, there are as many mixtures as the product of the domain sizes
of the parents. The parameters of such a discretization scheme are therefore: i) a
set of g cut points (to define g + 1 intervals) and ii) a mean and a variance for
each interval (to define its Gaussian). Fig. 1 actually illustrates the fact that the
means of the Gaussians need not necessarily correspond to the middles of the
intervals. For instance, the mean of the third Gaussian is a
3
whereas the third
interval, [t
3
, +), has no finite middle. Here, even finite interval middles, like
that of [t
1
, t
2
), do not correspond to the means of the Gaussians.
For each continuous random variable
˚
X
i
0
, this joint optimization problem is
really hard due to the normalization requirements that the integrals of the trun-
cated exponential of each interval must sum to 1 (which cannot be expressed
using closed-form formulas). Therefore, to alleviate the discretization computa-
a
1
a
2
a
3
t
1
t
2
s
1
s
2
Fig. 1. Discretization: entropy v.s. clustering.
Discretization for Bayesian Network Structure Learning 7
tional burden, we propose to approximate the computation of the cut points,
means and variances using a two-step process: first, we approximate the density
of the joint distribution of {X
1
, . . . , X
i
0
1
,
˚
X
i
0
, X
i
0
+1
, . . . , X
n
} as a mixture of
untruncated Gaussians and we determine by maximum likelihood the number
of cut-points as well as the means and variances of the Gaussians. This can be
easily done by an Expectation-Maximization (EM) approach. Then, in a second
step, we compute the best cut points w.r.t. the Gaussians. As each Gaussian is
associated with an interval, the parts of the Gaussian outside the interval can
be considered as a loss of information and we will therefore look for cut points
that minimize this loss. Now, let us delve into the details of the approach.
3.2 Discretization Exploiting the BN Structure
For the first discretization step of
˚
X
i
0
, we estimate the number g of cut-points
and the Gaussians’ means and variances. Assume that structure G is fixed and
that all the other variables are discrete. The density over all the variables, p(
˚
X),
is equal to p(
˚
X
i
0
|Pa(
˚
X
i
0
))
Q
i6=i
0
P (X
i
|Pa(X
i
)), where p(
˚
X
i
0
|Pa(
˚
X
i
0
)) repre-
sents a mixture of Gaussians for each value of
˚
X
i
0
’s parents (there are a finite
number of values since all the variables but
˚
X
i
0
are discrete). P (X
i
|Pa(X
i
))
should be the CPT of discrete variable X
i
but, unfortunately, it is not well de-
fined if
˚
X
i
0
Pa(X
i
) because, in this case, Pa(X
i
) has infinitely many values.
This is a serious issue since this CPT is used in the computation of P (D|G, F, θ)
of Eq. (4). Fortunately, this problem can be overcome by enforcing that
˚
X
i
0
has
no child while guaranteeing that the density remains unchanged. Actually, in
[19], an arc reversal operator is provided that, when applied, never alters the
density/probability distribution. More precisely, when reverting arc X Y ,
Shachter showed that if all the parents of X are added to Y and all the parents
of Y except X are added to X, then the resulting BN encodes the same distribu-
tion. As an example of these transformations, reversing arc X V of Fig. 2.(a)
results in Fig. 2.(b) and, then, reversing arc X W results in Fig. 2.(c).
Therefore, to enforce that
˚
X
i
0
has no child, if {i
1
, . . . , i
c
} denotes the set of
indices of the children variables of
˚
X
i
0
, sorted by a topological order of G, then,
by reversing sequentially all the arcs X
i
j
˚
X
i
0
, j = 1, . . . , c, we get:
p(
˚
X) = p(
˚
X
i
0
|Pa(
˚
X
i
0
)) ×
Y
i6={i
0
,...,i
c
}
P (X
i
|Pa(X
i
)) ×
c
Y
j=1
P (X
i
j
|Pa(X
i
j
)),
Q
P
X
Y
Z
U
W V
Q
P
X
Y
Z
U
W V
Q
P
X
Y
Z
U
W V
(a) (b) (c)
Fig. 2. Shachter’s arc reversals.
8
p(
˚
X) = p(
˚
X
i
0
|MB(
˚
X
i
0
)) ×
Y
i6={i
0
,...,i
c
}
P (X
i
|Pa(X
i
))
×
c
Y
j=1
P (X
i
j
|
j
[
h=1
(Pa(X
i
h
)\{
˚
X
i
0
}) Pa(
˚
X
i
0
)),
where MB(
˚
X
i
0
) is the Markov blanket of
˚
X
i
0
in G:
Definition 2. The Markov blanket of any node in G is the set of its parents, its
children and the other parents of its children.
Note that, in the last expression of p(
˚
X), only the first term involves
˚
X
i
0
,
hence all the other CPTs are well defined (they are finite CPTs). As a side effect,
only p(
˚
X
i
0
|MB(
˚
X
i
0
)) needs be taken into account to discretize
˚
X
i
0
since none
of the other terms is related to
˚
X
i
0
. It shall be noted here that these arc reversals
are applied only for determining the parameters of the discretization, i.e., the
set of cut points, means and variances of the Gaussians, they are never used to
learn the BN structure. Now, let us see how the parameters of the mixture of
Gaussians p(
˚
X
i
0
|MB(
˚
X
i
0
)) maximizing the likelihood of dataset
˚
D can be easily
estimated using an EM algorithm.
3.3 Parameter Estimation by an EM Algorithm
Let q
i
0
represent the (finite) number of values of MB(
˚
X
i
0
). For simplicity, we
will denote by {1, . . . , q
i
0
} the set of values of the joint discrete random variable
MB(
˚
X
i
0
). Let g denote the number of cut points in the discretization and let
{N (µ
k
, σ
k
) : k {0, . . . , g}} be the corresponding set of Gaussians. Then:
p(
˚
X
i
0
= ˚x
i
0
|MB(
˚
X
i
0
) = j) =
g
X
k=0
π
jk
f(˚x
i
0
|θ
k
) j {1, . . . , q
i
0
},
where f (·|θ
k
) represents the density of the normal distribution of parameters
θ
k
= (µ
k
, σ
k
), and π
jk
represents the weights of the mixture (with the constraint
that π
jk
0 for all j, k and
P
g
k=0
π
jk
= 1 for all j). Remember that each value
of MB(
˚
X
i
0
) induces its own set of weights {π
j0
, . . . , π
jg
}. Now, we propose to
estimate parameters θ
k
from
˚
D by maximum likelihood. For this, EM is well-
known to efficiently provide good approximations [3] (due to the mixture, direct
maximum likelihood is actually hard to estimate). Assuming that data in
˚
D are
i.i.d., the log-likelihood of
˚
D given Θ =
S
g
k=0
(
S
q
i
0
j=1
{π
jk
} {θ
k
}) is equal to:
L(
˚
D|Θ) =
N
X
m=1
log p(
˚
X
i
0
= ˚x
(m)
i
0
|MB(˚x
i
0
)
(m)
, Θ),
where ˚x
(m)
i
0
represents the observed value of
˚
X
i
0
in the mth record of
˚
D. Thus:
L(
˚
D|Θ) =
q
i
0
X
j=1
X
m:MB(˚x
i
0
)
(m)
=j
log
"
g
X
k=0
π
jk
f(˚x
(m)
i
0
|θ
k
)
#
. (5)
Discretization for Bayesian Network Structure Learning 9
To solve Argmax
Θ
L(
˚
D|Θ), EM [3] iteratively alternates expectations (E-
step) and maximizations (M-step) until convergence toward a local maximum
which is guaranteed to correspond to the Argmax we look for due to the concavity
of the log-likelihood function. In this paper, we just need to apply the standard
EM, considering for weights π
jk
only the records in the database that correspond
to MB(˚x
i
0
)
(m)
= j. More precisely, for each record of
˚
D, let Z
(m)
be a random
variable whose domain is {0, . . . , g}, and such that Z
(m)
= k if and only if
observation ˚x
(m)
i
0
has been generated from the kth Gaussian. Let Q
t
m
(Z
(m)
) =
P (Z
(m)
|˚x
(m)
i
0
, Θ
t
), i.e., Q
t
m
(Z
(m)
) represents the distribution that, at the tth
step of the algorithm, ˚x
(m)
i
0
is believed to have been generated by such and such
Gaussian. Then, EM is described in Algo. 3.
In the EM algorithm, only the M-step can be computationally intensive.
Fortunately, here, we can derive in closed-form the optimal values of Line 4:
Proposition 1. At the E-step, probability Q
t+1
m
(k) =
π
t
jk
f(˚x
(m)
i
0
|θ
t
k
)
P
g
k
0
=0
π
t
jk
0
f(˚x
(m)
i
0
|θ
t
k
0
)
,
where π
t
jk
and θ
t
k
are weights, means and variances in Θ
t
. The optimal param-
eters of the M-step are respectively:
π
t+1
jk
=
P
m:MB(˚x
i
0
)
(m)
=j
Q
t+1
m
(k)
P
m:MB(˚x
i
0
)
(m)
=j
P
g
k
0
=0
Q
t+1
m
(k
0
)
,
µ
t+1
k
=
P
N
m=1
Q
t+1
m
(k)˚x
(m)
i
0
P
N
m=1
Q
t+1
m
(k)
σ
t+1
k
=
v
u
u
t
P
N
m=1
Q
t+1
m
(k)(˚x
(m)
i
0
µ
t+1
k
)
2
P
N
m=1
Q
t+1
m
(k)
.
Using Algo. 3 with the formulas of Proposition 1, it is thus possible to deter-
mine the means and variances of the Gaussians. However, our ultimate goal is not
to compute them but to exploit them to discretize variable
˚
X
i
0
, i.e., to determine
the best cut points t
1
, . . . , t
g
. Let us see how this task can be performed.
Input: a database
˚
D, a number g of cut points
Output: an optimal set of parameters Θ
1 Select (randomly) an initial value Θ
0
2 repeat
// E-step (expectation)
3 Q
t+1
m
(Z
(m)
) P (Z
(m)
|˚x
(m)
i
0
, Θ
t
) m {1, . . . , N }
// M-step (maximization)
4 Θ
t+1
Argmax
Θ
q
i
0
X
j=1
X
m:MB(˚x
i
0
)
(m)
=j
g
X
k=0
Q
t+1
m
(k) log
"
π
jk
f(˚x
(m)
i
0
|θ
k
)
Q
t+1
m
(k)
#
5 until convergence;
Algorithm 3: The EM algorithm.
10
3.4 Determination of the Cut Points
As mentioned at the end of Subsection 3.1, each Gaussian N (µ
k
, σ
k
) is associated
with an interval [t
k
, t
k+1
)
2
and the parts of the Gaussian outside the interval
can be considered as a loss of information. The optimal set of cut points
b
T =
{
ˆ
t
1
, . . . ,
ˆ
t
g
} is thus that which minimizes this loss. In other words, it is equal to:
b
T = Argmin
{t
1
,...,t
g
}
g
X
k=1
Z
+
t
k
f(x|θ
k1
)dx +
Z
t
k
−∞
f(x|θ
k
)dx,
where θ
k
represents pairs (µ
k
, σ
k
). As each Gaussian N (µ
k
, σ
k
) is associated
with interval [t
k
, t
k+1
), we can assume that
ˆ
t
k
[µ
k1
, µ
k
), for all k. Therefore:
b
T =
(
Argmin
t
k
[µ
k1
k
)
Z
+
t
k
f(x|θ
k1
)dx +
Z
t
k
−∞
f(x|θ
k
)dx : k {1, . . . , g}
)
. (6)
All the
ˆ
t
k
can thus be determined independently. In addition, as shown below,
their values are the solution of a quadratic equation:
Proposition 2. Let u(t
k
) represent the sum of the integrals in Eq. (6). Let α
k
be a solution (if any) within interval (µ
k1
, µ
k
) of the quadratic equation in t
k
:
t
2
k
1
σ
2
k1
1
σ
2
k
+ 2t
k
µ
k
σ
2
k
µ
k1
σ
2
k1
+
µ
2
k1
σ
2
k1
µ
2
k
σ
2
k
2log
σ
k
σ
k1
= 0. (7)
Then
ˆ
t
k
is, among {µ
k1
, µ
k
, α
k
}, the element with the highest value of u(·)
(which can be quickly approximated using a table of the Normal distribution).
Proof. Let g(·) and h(·) be two functions such that g(x)/∂x = f(x|θ
k1
) and
h(x)/∂x = f (x|θ
k
). Then:
ˆ
t
k
= Argmin
t
k
[µ
k1
k
)
u(t
k
) = Argmin
t
k
[µ
k1
k
)
Z
+
t
k
g(x)
x
dx +
Z
t
k
−∞
h(x)
x
dx
= Argmin
t
k
[µ
k1
k
)
g(t
k
) + h(t
k
) + lim
t+
[g(t) h(t)].
Let us relax the optimization problem and try to find the Argmin over R. Then
the min is obtained when u(t
k
)/∂t
k
= 0 or, equivalently, when (g(t
k
) +
h(t
k
))/∂t
k
= f(t
k
|θ
k1
) + f(t
k
|θ
k
) = 0. Since f (·|θ) represents the density of
the Normal distribution of parameters θ, this is equivalent to:
1
2πσ
k1
exp
1
2
t
k
µ
k1
σ
k1
2
+
1
2πσ
k
exp
1
2
t
k
µ
k
σ
k
2
= 0,
or, equivalently:
2
Without loss of generality, we consider here that the the µ
k
’s resulting from the EM
algorithm are sorted by increasing order.
Discretization for Bayesian Network Structure Learning 11
σ
k
σ
k1
=
exp
1
2
t
k
µ
k
σ
k
2
exp
1
2
t
k
µ
k1
σ
k1
2
= exp
"
1
2
t
k
µ
k1
σ
k1
2
1
2
t
k
µ
k
σ
k
2
#
,
which, by a log transformation, is equivalent to:
2 log
σ
k
σ
k1
=
t
2
k
σ
2
k1
2µ
k1
t
k
σ
2
k1
+
µ
2
k1
σ
2
k1
t
2
k
σ
2
k
+
2µ
k
t
k
σ
2
k
µ
2
k
σ
2
k
.
This corresponds precisely to Eq. (7). So, to summarize, if the optimal solution
lies inside interval (µ
k1
, µ
k
), then it satisfies Eq. (7). Otherwise, either u(t
k
)
is strictly increasing or strictly decreasing within (µ
k1
, µ
k
), which implies that
the optimal solution for
ˆ
t
k
is either µ
k1
or µ
k
, which completes the proof.
3.5 Score and Number of Cut Points
To complete the description of the algorithm, there remains to determine the
number of cut points. Of course, the higher the number of cut points, the higher
the likelihood but the lower the compactness of the representation. To reach
of good trade-off, we simply propose to exploit the penalty functions included
into the score used for the evaluation of different BN structures (see Line 5 of
Algo. 1). Here, we used the BIC score [18], which can be locally expressed as:
BIC(
˚
X
i
0
|MB(
˚
X
i
0
)) = L(
˚
D|Θ)
|Θ|
2
log(N) (8)
where L(
˚
D|Θ) is the log-likelihood with the parameters estimated by EM, given
the current structure G. |Θ| represents the number of parameters, i.e., |Θ| =
q
i
0
×g+2×(g+1): the 1st and 2nd terms correspond to the number of parameters
π
jk
and of (µ
k
, σ
k
) needed to encode the conditional distributions (recall that
there are g + 1 Gaussians and q
i
0
represents the domain size of MB(
˚
X
i
0
)). Now,
the best number of cut points is simply that which optimizes Eq. (8).
4 Experimentations
In this section, we highlight the effectiveness of our method, hereafter denoted
MGCD (for Mixture of Gaussians Clustering-based Discretization), by compar-
ing it with the algorithms provided in [17] and [6], hereafter called Ruichu and
Friedman respectively. Step 4 of Algo. 1 was performed using a simple Tabu
search method. For the comparisons, three criteria have been taken into account:
i) the quality of the structure learnt by the algorithm (which strongly depends
on that of the discretization); ii) the computation time and iii) the quality of
the learnt CPT parameters, which has been evaluated by their prediction power
on the values taken by some variables given observations.
For the first two criteria, we randomly generated discrete BNs following the
guidelines given in [7]. Those contained from 10 to 30 nodes and from 12 to 56
12
Fig. 3. Averages of the TPR (left) and FPR (right) metrics for BNs with 10 to 30
nodes in function of the sample sizes.
arcs. Each node had at most 6 parents and its domain size was randomly chosen
between 2 and 5. The CPTs of these BNs represented the π
jk
of the preceding
section. From these BNs, we generated continuous datasets containing from 1000
to 10000 records as follows: for each random variable X
i
, we mapped its finite set
of values into a set of consecutive intervals {[t
k1
, t
k
)}
|X
i
|
k=1
of arbitrary lengths.
Then, we assigned a truncated Gaussian to each interval, the parameters (µ
k
, σ
k
)
of which were randomly chosen. Finally, to generate a continuous record, we
first generated a discrete record from the discrete BN using a logic sampling
algorithm. Then, this record was mapped into a continuous one by sampling from
the truncated exponentials. Overall, 350 continuous datasets were generated.
To compare them, the BN structures produced by Ruichu, Friedman and
MGCD were converted into their Markov equivalence class, i.e., into a partially
directed DAG (CPDAG). Such a transformation increases the quality of com-
parisons since two BNs encode the same distribution iff they belong to the same
equivalence class. The CPDAGs were then compared w.r.t. their true and false
positive rate metrics (TPR and FPR). TPR (resp. FPR) represents the percent-
age of arcs/edges belonging to the learnt CPDAG that also exist (resp. do not
exist) in the original CPDAG. Both metrics describe how well the dependences
between variables are preserved by learning/discretization. Fig. 3 shows the av-
erage TPR and FPR over the 350 generated databases. As can be seen, MGCD
outperforms the others for all dataset sizes: MGCD’s TPR is about 10% higher
than Ruichu and 40% higher than Friedman, and MGCD’s FPR is between 20%
and 40% lower than the other methods. MGCD’s performance w.r.t. Ruichu’s
can be explained by that fact that, unlike Ruichu’s, it fully takes into account the
conditional dependences between all the random variables. Its performance w.r.t.
Friedman’s can be explained by our choice of exploiting clustering rather than
an entropy-based approach. Table 1 provides computation time ratios (other
method’s runtime / MGCD’s runtime). As can be seen, our method slightly
outperforms Ruichu’s (but is 10% better in terms of TPR and more than 20%
better in terms of FPR) and it significantly outperforms Friedman’s (about 3
times faster) while at the same time being 40% higher in terms of TPR.
Discretization for Bayesian Network Structure Learning 13
Approaches / Dataset sizes 1000 5000 7500 10000
Friedman 2.762444 3.350782 3.404958 3.361540
Ruichu 0.8872389 1.1402535 1.1334637 1.1032982
Table 1. Runtime ratio comparisons between discretization approaches.
Finally, we compared the discretizations w.r.t. the quality of the produced
CPTs. To do so, we generated from two classical BNs, Child and Sachs, 100
continuous databases using the same process as above except that: i) the dis-
tributions inside intervals were uniform instead of Gaussians (to penalize our
approach since data do not fit its hypotheses), and ii) some small sets of vari-
ables were kept discrete and served as multilabel targets. Databases were split
into a learning (2/3) and a test (1/3) part. For each record in the latter, we
computed the distribution (learnt by each of the 3 algorithms on the learning
database) of each target given some observations on their Markov blanket and
we estimated the value of the target by sampling it from the learnt distribution.
The percentages of correct predictions are shown in Table 2. As we can see, our
algorithm outperforms the other algorithms, especially Ruichu’s, which fails to
have correct predictions due to its univariate discretization not taking into ac-
count the conditional dependencies among random variables. Friedman’s results
are closer to ours but recall that it is about 3 times slower than ours.
5 Conclusion
We have proposed a new multivariate discretization algorithm designed for BN
structure learning, taking into account the dependences among variables ac-
quired during learning. Our experiments highlighted its efficiency and effective-
ness compared to state-of-the-art algorithms, but more experiments are of course
needed to better assess the strengths and the shortcoming of our proposed ap-
proach. For future work, we plan to improve our algorithm, notably by directly
working with truncated Gaussians instead of the current approximation by mix-
ture of Gaussians. But such an improvement is not trivial due to the fact that,
in this case, no closed-form solution exists for determining the cut points.
datasets sizes 30 % Markov blanket 60% Markov blanket 100 % Markov blanket
MGCD Ruichu Friedman MGCD Ruichu Friedman MGCD Ruichu Friedman
Child
1000 60.90 59.30 58.99 62.76 60.90 59.96 67.53 65.34 63.33
2000 61.59 56.62 58.05 62.41 59.24 60.55 67.29 64.71 63.03
5000 64.88 62.29 60.05 66.07 62.95 61.94 69.42 65.39 63.82
10000 65.81 62.48 61.75 67.44 63.85 63.51 70.49 66.92 65.79
Sachs
1000 56.63 54.78 57.59 57.22 55.04 58.74 65.67 61.06 64.65
2000 56.96 56.16 54.02 59.72 57.58 56.64 65.80 62.24 60.22
5000 57.69 55.00 55.15 59.80 57.96 56.38 65.51 64.15 64.47
10000 60.35 57.50 57.33 61.67 58.26 59.22 70.04 65.74 64.61
Table 2. Prediction accuracy rates for discrete target variables in the Child and Sachs
standard BNs (http://www.bnlearn.com/bnrepository/) w.r.t. the percentage of ob-
served variables in the Markov blanket.
14
References
1. Boull´e, M.: MODL: a Bayes optimal discretization method for continuous at-
tributes. Machine learning 65(1), 131–165 (2006)
2. Boull´e, M.: Khiops: A statistical discretization method of continuous attributes.
Machine Learning 55(1), 53–69 (2004)
3. Dempster, A.P., Laird, N., Rubin, D.: Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, pp. 1–38 (1977)
4. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization
of continuous features. In: proc. of ICML’95. pp. 194–202 (1995)
5. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes
for classification learning. In: Proc. of IJCAI’93. pp. 1022–1029 (1993)
6. Friedman, N., Goldszmidt, M.: Discretizing continuous attributes while learning
Bayesian networks. In: proc. of ICML’96. pp. 157–165 (1996)
7. Ide, J.S., Cozman, F.G., Ramos, F.T.: Generating random Bayesian networks with
constraints on induced width. In: Proc. of ECAI’04. pp. 323–327 (2004)
8. Jiang, S., Li, X., Zheng, Q., Wang, L.: Approximate equal frequency discretization
method. In: proc. of GCIS’09. pp. 514–518 (2009)
9. Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. of AAAI’92.
pp. 123–128 (1992)
10. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press (2009)
11. Kwedlo, W., Kr¸etowski, M.: An evolutionary algorithm using multivariate dis-
cretization for decision rule induction. In: proc. of PKDD’99. pp. 392–397 (1999)
12. Lauritzen, S., Wermuth, N.: Graphical models for associations between variables,
some of which are qualitative and some quantitative. Annals of Statistics 17(1),
31–57 (1989)
13. Monti, S., Cooper, G.: A multivariate discretization method for learning Bayesian
networks from mixed data. In: Proc. of UAI’98. pp. 404–413 (1998)
14. Monti, S., Cooper, G.: A latent variable model for multivariate discretization. In:
proc. of AIS’99. pp. 249–254 (1999)
15. Moral, S., Rumi, R., Salmeron, A.: Mixtures of truncated exponentials in hybrid
Bayesian networks. In: Proc. of ECSQARU’01. Lecture Notes in Artificial Intelli-
gence, vol. 2143, pp. 156–167 (2001)
16. Ratanamahatana, C.: CloNI: Clustering of sqrt(n)-interval discretization. In: proc.
of Int. Conf. on Data Mining & Comm Tech. (2003)
17. Ruichu, C., Zhifeng, H., Wen, W., Lijuan, W.: Regularized Gaussian mixture model
based discretization for gene expression data association mining. Applied intelli-
gence 39(3), 607–613 (2013)
18. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461–
464 (1978)
19. Shachter, R.: Evaluating influence diagrams. Operations Research 34(6), 871–882
(1986)
20. Shenoy, P., West, J.: Inference in hybrid Bayesian networks using mixtures of poly-
nomials. International Journal of Approximate Reasoning 52(5), 641–657 (2011)
21. Song, D., Ek, C., Huebner, K., Kragic, D.: Multivariate discretization for Bayesian
network structure learning in robot grasping. In: proc. of ICRA’11. pp. 1944–1950
(2011)
22. Zighed, D., Rabas´eda, S., Rakotomalala, R.: FUSINTER: a method for discretiza-
tion of continuous attributes. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems 6(03), 307–326 (1998)
... However, it is worth observing that an evaluation of its efficacy was notably absent in the original paper. Mabrouk et al. [16] proposed a multivariate quantization algorithm able to systematically assess learned dependencies while employing clustering techniques based on expectation maximization through the Gaussian mixture model. Furthermore, this algorithm outperformed the method devised by Friedman and Goldszmidt [8] in terms of prediction accuracy, quality of learned structures and computational efficiency. ...
... This approach is a component of a multi-variable quantization technique, wherein each continuous variable is discretized based on its interactions (dependencies) with other variables [18]. An alternative methodology for BN structural learning involves the application of expectation maximization on the Gaussian mixture model, which is utilized to represent bins and determine quantizations by considering the clustering patterns within the distribution of variables [16]. In addition, an alternative approach in BN structural learning involves choosing the quantization with the highest probability based on the available data, considering the dependencies imposed by the BN structure. ...
... Thus, BNs with 5 and 6 edges had higher penalization as a result of model complexity and achieved a worse score than that found for the best BNs with 3 and 4 edges. An analysis 16 Landscape analysis for all datasets with the expected BN value and score depicted in dashed red lines. For the Weather dataset, the expected BN is the one found using the PC algorithm. ...
Article
Full-text available
Bayesian Networks (BN) are robust probabilistic graphical models mainly used with discrete random variables requiring discretization and quantization of continuous data. Quantization is known to affect model accuracy, speed and interpretability, and there are various quantization methods and performance comparisons proposed in literature. Therefore, this paper introduces a novel approach called CPT limit-based quantization (CLBQ) aimed to address the trade-off among model quality, data fidelity and structure score. CLBQ sets CPT size limitation based on how large the dataset is so as to optimize the balance between the structure score of BNs and mean squared error. For such a purpose, a range of quantization values for each variable was evaluated and a Pareto set was designed considering structure score and mean squared error (MSE). A quantization value was selected from the Pareto set in order to balance MSE and structure score, and the method’s effectiveness was tested using different datasets, such as discrete variables with added noise, continuous variables and real continuous data. In all tests, CLBQ was compared to another quantization method known as Dynamic Discretization. Moreover, this study assesses the suitability of CLBQ for the search and score of BN structure learning, in addition to examining the landscape of BN structures while varying dataset sizes and confirming its consistency. It was sought to find the expected structure location through a landscape analysis and optimal BNs on it so as to confirm whether the expected results were actually achieved in the search and score of BN structure learning. Results demonstrate that CLBQ is quite capable of striking a balance between model quality, data fidelity and structure score, in addition to evidencing its potential application in the search and score of BN structure learning, thus further research should explore different structure scores and quantization methods through CLBQ. Furthermore, its code and used datasets have all been made available.
... Discretization is the process of transforming quantitative data into nominal qualitative data. This process leads to a loss of information in the data and finding an optimal discretization is an NP-complete task (Mabrouk & Gonzales, 2010). ...
... Therefore, many approaches tried to perform the discretization based on the structure of the BN. This leads to the development of multiple Bayesian Discretization-Learning algorithms which performs the discretization alongside with the search for the best structure for the BN (Friedman & Goldszmidt, 1996;Mabrouk & Gonzales, 2010;Stefano Monti & Cooper, 1998). ...
... Mabrouk et al. (2010) have shown that entropy-based discretization result is suboptimal and developed an algorithm which performs discretization for BNs based on clustering scheme which has outperformed the previous methods. In their method, first, they approximate the joint distribution of continuous variables with a mixture of non-truncated Gaussian distributions and then use the EM method to determine the number of cut point and the mean values and the variances of the Gaussian distributions. ...
Thesis
Full-text available
Bayesian networks in additive manufacturing and reliability engineering A Bayesian network (BN) is a powerful tool to represent the quantitative and qualitative features of a system in an intuitive yet sophisticated manner. The qualitative aspect is represented with a directed acyclic graph (DAG), depicting dependency relations between the random variables of the system. In a DAG, the variables of the system are shown with a set of nodes and the dependencies between them are shown with a directed edge. A DAG in the Bayesian network can be a causal graph under certain circumstances. The quantitative aspect is the local conditional probabilities associated with each variable, which is a factorization of the joint probability distribution of the variables in the system based on the dependency relation represented in the DAG. In this study, the benefits of using BNs in reliability engineering and additive manufacturing is investigated. In the case of reliability engineering, there are several methods to create predictive models for reliability features of a system. Predicting the possibility and the time of a possible failure is one of the important tasks in the reliability engineering principle. The quality of the cor-rective maintenance after each failure is affecting consecutive failure times. If a maintenance task after each failure involves replacing all the components of an equipment, called perfect maintenance , it is considered that the equipment is restored to an "as good as new" (AGAN) condition, and based on that, the consecutive failure times are considered independent. Not only in most of the cases the maintenance is not perfect, but the environment of the equipment and the usage patterns have a significant effect on the consecutive failure times. In this study, this effect is investigated by using Bayesian network structural learning algorithms to learn a BN based on the failure data of an industrial water pump. In additive manufacturing (AM) field, manufacturing systems are normally a complex combination of multiple components. This complex nature and the associated uncertainties in design and manufacturing parameters in additive manufacturing promotes the need for models that can handle uncertainties and are efficient in calculations. Moreover, the lack of AM knowledge in practitioners is one of the main obstacles for democratizing it. In this study, a method is developed for creating Bayesian network models for AM systems that includes experts' and domain knowledge. To form the structure of the model, causal graphs obtained through dimensional analysis conceptual modeling (DACM) framework is used as the DAG for a Bayesian network after some modifications. DACM is a framework for extracting the causal graph and the governing equations between the variables of a complex system. The experts' knowledge is extracted through a probability assessment process, called the analytical hierarchy process (AHP) and encoded into local probability tables associated with the independent variables of the model. To complete the model, a sampling technique is used along with the governing equations between the intermediate and output variables to obtain the rest of the probability tables. Such models can be used in many use cases, namely domain knowledge representation, defect prognosis and diagnosis and design space exploration. The qualitative aspect of the model is obtained from the physical phenomena in the system and the quantitative aspect is obtained from the experts' knowledge, therefore the model can interactively represent the domain and the experts' knowledge. In prognosis tasks, the probability distribution for the values that an output variable can take is calculated based on the values chosen for the input variables. In diagnosis tasks, the designer can investigate the reason for having a specific value in an output variable among the inputs. Finally, the model can be used to perform design space exploration. The model reduces the design space into a discretized and interactive Bayesian network space which is very convenient for design space exploration.
... It represents compactly mixed probability distributions. The model is derived from the result of an algorithm for learning BNs from datasets containing both discrete and continuous random variables [18]. ...
... and ii) will the loss of information affect significantly the results of inference? A possible answer to the first question consists of exploiting "conditional truncated densities" [18]. The answer to the second question of course strongly depends on the discretization performed but, as we shall see, conditional truncated densities can limit the discrepancy between the exact a posteriori marginal density functions of the continuous random variables and the approximation they provide. ...
... But the expressive power shall be increased. In some sense, a ctdBN learning algorithm has already been provided in [18]. This algorithm raises some issues, notably the fact that computing discretizations conditionally to the nodes in the Markov blankets of each discretized node limits the ctdBNs that can be learnt. ...
Article
Full-text available
The majority of Bayesian networks learning and inference algorithms rely on the assumption that all random variables are discrete, which is not necessarily the case in real-world problems. In situations where some variables are continuous, a trade-off between the expressive power of the model and the computational complexity of inference has to be done: on one hand, conditional Gaussian models are computationally efficient but they lack expressive power; on the other hand, mixtures of exponentials (MTE), basis functions (MTBF) or polynomials (MOP) are expressive but this comes at the expense of tractability. In this paper, we introduce an alternative model called a ctdBN that lies in between. It is composed of a “discrete” Bayesian network (BN) combined with a set of univariate conditional truncated densities modeling the uncertainty over the continuous random variables given their discrete counterpart resulting from a discretization process. We prove that ctdBNs can approximate (arbitrarily well) any Lipschitz mixed probability distribution. They can therefore be exploited in many practical situations. An efficient inference algorithm is also provided and its computational complexity justifies theoretically why inference computation times in ctdBNs are very close to those in discrete BNs. Experiments confirm the tractability of the model and highlight its expressive power, notably by comparing it with BNs on classification problems and with MTEs and MOPs on marginal distributions estimations.
... Using an arbitrary discretization by frequency or by even intervals generally leads to sub-optimal networks. Even though this discretization arbitrariness can be reduced using multi-modal approaches (Boullé, 2006;Mabrouk et al., 2015;Chen et al., 2017), these often come at a considerable computing cost. ...
... We are also planning to work on improvements of our discretization process, for instance with a multivariate process intertwined with the BN learning (Mabrouk et al., 2015) or by using mixtures of truncated Gaussians (Gonzales, 2019) instead of Gaussians in order to avoid spurious discretization effects. ...
Conference Paper
Full-text available
The last decades improvements in processing abilities have quickly led to an increasing use of data analyses implying massive data-sets. To retrieve insightful information from any data driven approach, a pivotal aspect to ensure is good data quality. Manual correction of massive data-sets requires tremendous efforts, is prone to errors, and results being really costly. If knowledge in a specific field can often allow the development of efficient models for anomaly detection and data correction, this knowledge can sometimes be unavailable and a more generic approach should be found. This paper presents a novel approach to anomaly detection and correction in mixed tabu-lar data using Bayesian Networks. We present an algorithm for detecting anomalies and offering correction hints based on Jensen scores computed within the Markov Blankets of considered variables. We also discuss the incremental corrections of detection model using user's feedback, as well as additional aspects related to discretization in mixed data and its effects on detection efficiency. Finally we also discuss how functional dependencies can be managed to detect errors while improving faithfulness and computation speed.
... As such, this graphical model is able to represent very faithfully very complex probability density functions. A ctdBN learning algorithm is described in [12] and we use it in this work. It allows us to learn a good discretization but also to learn the structure of a ctdBN and its parameters (both the CPTs and the conditional truncated densities). ...
Conference Paper
In this paper, we propose an approach for detecting events online in video sequences. This one requires no prior knowledge, the events being defined as spatio-temporal breaks. For this purpose, we propose to combine non-stationary dynamic Bayesian networks (nsDBN) to model the scene and particle filter (PF) to track objects in the sequence. In this framework, an event corresponds to a significant difference between a new particle set provided by PF and the sampled density encoded by the nsDBN. Whenever an event is detected, the particle set is exploited to learn a new nsDBN representing the scene. Unfortunately, nsDBNs are designed for discrete random variables and particles are instantiations of continuous ones. We therefore propose to discretize them using a new discretization method well suited for nsDBNs. Our approach has been tested on real video sequences and allowed to detect two different events (forbidden stop and fight).
Article
This research utilizes an Object-Oriented Bayesian Network (OOBN) to model the relationships between the Sustainable Development Goal (SDGs) and resilience and sustainability at national, regional, and global levels. The ability of the OOBN to learn the parameters, i.e., the conditional probability distributions between the variables included in the network, was exploited to explore the impacts of progress of SDGs on the sustainability and resilience of nations. The resulting OOBN is used to examine different situations pertinent to policy analysis and design at the times of disasters, particularly in the wake of the COVID-19 pandemic. Three case studies are used to illustrate the step by step process of using the proposed OOBN as well as the expected results of its application in policy analysis and evaluation contexts. The proposed is able to provide insight regarding which SDGs will have more significant impacts on both resilience and sustainability as well as their constituent components. The results of this research indicate how data induced OOBNs can be utilised by policy makers to prioritize new policies and evaluate the impacts of existing policies on both the resilience and sustainability of societies.
Chapter
Uncertain reasoning over both continuous and discrete random variables is important for many applications in artificial intelligence. Unfortunately, dealing with continuous variables is not an easy task. In this tutorial, we will study some of the methods and models developed in the literature for this purpose. We will start with the discretization of continuous random variables. A special focus will be made on the numerous issues they raise, ranging from which discretization criterion to use, to the appropriate way of using them during structure learning. These issues will justify the exploitation of hybrid models designed to encode mixed probability distributions. Several such models have been proposed in the literature. Among them, Conditional Linear Gaussian models are very popular. They can be used very efficiently for inference but they lack flexibility in the sense that they impose that the continuous random variables follow conditional Normal distributions and are related to other variables through linear relations. Other popular models are mixtures of truncated exponentials, mixtures of polynomials and mixtures of truncated basis functions. Through a clever use of mixtures of distributions, these models can approximate very well arbitrary mixed probability distributions. However, exact inference can be very time consuming in these models. Therefore, when choosing which model to exploit, one has to trade-off between the flexibility of the uncertainty model and the computational complexity of its learning and inference mechanisms.
Article
The CESAM FP7 project (Van Dorsselaere et al., 2015) of EURATOM has been conducted from April 2013 until March 2017 in the aftermath of the Fukushima Dai-ichi accidents. Nineteen international partners from Europe and India, including the European Joint Research Centre, have participated under the coordination of GRS and with a strong involvement of IRSN that were both ASTEC code developers. The Project objectives were: to understand all relevant phenomena during the Fukushima Dai-ichi accidents and their importance for Severe Accident Management (SAM) measures; and to improve the ASTEC computer code to simulate plant behaviour throughout accident sequences including SAM measures. The starting point was the analysis of current SAM measures implemented in European nuclear power plants. To achieve these goals, simulations of relevant experiments that allow a solid validation of the ASTEC code against single and separate effect tests have been conducted. Covered validation topics in the CESAM project have been grouped in 9 different areas among which are re-flooding of degraded cores, pool scrubbing, hydrogen combustion, or spent fuel pool behaviour. Furthermore, modelling improvements have been implemented in the current ASTEC V2.1 series for the estimation of source term consequences in the environment and the prediction of plant status in emergency centres. Finally, ASTEC reference input decks have been created for all reactor types operated in Europe today as well as for spent fuel pools. These reference input decks generically describe plant types like PWR, VVER, PHWR, and BWR without defining proprietary data of a special plant and they account for the best recommendations from code developers and users. In addition, a generic input deck for a spent fuel pool was elaborated. These input decks can be used as basis by all (and especially new) ASTEC users in order to understand code basic requirements and model features and to implement the specificities of their own NPP type. Based on these generic inputs, benchmark calculations have been performed with other codes (such as MELCOR, MAAP, ATHLET-CD, COCOSYS…) with a focus on applicability of ASTEC models to currently implemented SAM measures. This article provides a final summary of the CESAM project. Therefore, an overview of the improved modelling capabilities of the recent ASTEC V2.1 version is given followed by the validation status of ASTEC V2.1 as concluded after CESAM. Further, plant applications performed by CESAM partners will be summarized with a special focus on simulation of SAM measures in various NPP types, and insights gained on SAM measures will be derived.
Article
Full-text available
Association rule has shown its usefulness in the gene expression data based disease diagnosis for its good interpretability. The large number of rules generated from the high dimensional gene expression data is one of the main challenges of its applications. In this work, we reveal that the discretization preprocessing is one of the reasons for the association rule number explosion problem. To alleviate this problem, a Regularized Gaussian Mixture Model (RGMM) is proposed to discretize the continuous gene expression data. RGMM explores both the complexity of the discretization model and the information loss of the discretization procedure, under the Minimal Description Length framework. Extensive experiments show the effectiveness of RGMM on real-life gene expression data sets.
Article
Full-text available
We define and investigate classes of statistical models for the analysis of associations between variables, some of which are qualitative and some quantitative. In the cases where only one kind of variables is present, the models are well-known models for either contingency tables or covariance structures. We characterize the subclass of decomposable models where the statistical theory is especially simple. All models can be represented by a graph with one vertex for each variable. The vertices are possibly connected with arrows or lines corresponding to directional or symmetric associations being present. Pairs of vertices that are not connected are conditionally independent given some of the remaining variables according to specific rules.
Article
Full-text available
In this paper we address the problem of discretization in the context of learning Bayesian networks (BNs) from data containing both continuous and discrete variables. We describe a new technique for multivariate discretization, whereby each continuous variable is discretized while taking into account its interaction with the other variables. The technique is based on the use of a Bayesian scoring metric that scores the discretization policy for a continuous variable given a BN structure and the observed data. Since the metric is relative to the BN structure currently being evaluated, the discretization of a variable needs to be dynamically adjusted as the BN structure changes.
Article
Full-text available
Many algorithms for data mining and machine learning can only process discrete attributes. In order to use these algorithms when some attributes are numeric, the numeric attributes must be discretized. Because of the prevalent of normal distribution, an approximate equal frequency discretization method based on normal distribution is presented. The method is simple to implement. Computing complexity of this method is nearly linear with the size of dataset and can be applied to large size dataset. We compare this method with some other discretization methods on the UCI datasets. The experiment result shows that this unsupervised discretization method is effective and practicable
Conference Paper
Full-text available
We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning that handle discrete variables only.
Conference Paper
Full-text available
In this paper we propose the use of mixtures of truncated exponential (MTE) distributions in hybrid Bayesian networks. We study the properties of the MTE distribution and show how exact probability propagation can be carried out by means of a local computation algorithm. One feature of this model is that no restriction is made about the order among the variables either discrete or continuous. Computations are performed over a representation of probabilistic potentials based on probability trees, expanded to allow discrete and continuous variables simultaneously. Finally, a Markov chain Monte Carlo algorithm is described with the aim of dealing with complex networks.
Conference Paper
We develop an algorithm that can evaluate any well-formed influence diagram and determine the optimal policy for its decisions. Since the diagram can be analyzed directly, there is no need to construct other representations such as a decision tree. As a result, the analysis can be performed using the decision maker's perspective on the problem. Questions of sensitivity and the value of information are natural and easily posed. Modifications to the model suggested by such analyses can be made directly to the problem formulation, and then evaluated directly.