Content uploaded by Ahmed Mabrouk
Author content
All content in this area was uploaded by Ahmed Mabrouk on Aug 27, 2015
Content may be subject to copyright.
Multivariate Cluster-based Discretization for
Bayesian Network Structure Learning
Ahmed Mabrouk
1
, Christophe Gonzales
2
, Karine Jabet-Chevalier
1
, and Eric
Chojnaki
1
1
Institut de radioprotection et de sˆuret´e nucl´eaire, France
{Ahmed.Mabrouk,Karine.Chevalier-Jabet,Eric.Chojnaki}@irsn.fr
2
Sorbonne Universit´es, UPMC Univ Paris 06, CNRS, UMR 7606, LIP6, France
Christophe.Gonzales@lip6.fr
Abstract. While there exist many efficient algorithms in the literature
for learning Bayesian networks with discrete random variables, learning
when some variables are discrete and others are continuous is still an
issue. A common way to tackle this problem is to preprocess datasets
by first discretizing continuous variables and, then, resorting to classical
discrete variable-based learning algorithms. However, such a method is
inefficient because the conditional dependences/arcs learnt during the
learning phase bring valuable information that cannot be exploited by
the discretization algorithm, thereby preventing it to be fully effective.
In this paper, we advocate to discretize while learning and we propose a
new multivariate discretization algorithm that takes into account all the
conditional dependences/arcs learnt so far. Unlike popular discretization
methods, ours does not rely on entropy but on clustering using an EM
scheme based on a Gaussian mixture model. Experiments show that our
method significantly outperforms the state-of-the-art algorithms.
Keywords: Multivariate discretization, Bayesian network learning.
1 Introduction
For several decades, Bayesian networks (BN) have been successfully exploited
for dealing with uncertainties. However, while their learning and inference mech-
anisms are relatively well understood when they involve only discrete variables,
their coping with continuous variables is still often unsatisfactory. One actually
has to trade-off between expressiveness and computational complexity: on one
hand, conditional Gaussian models and their mixing with discrete variables are
computationally efficient but they definitely lack some expressiveness [12]; on the
other hand, mixtures of exponentials, bases or polynomials are very expressive
but at the expense of tractability [15,20]. In between lie discretization meth-
ods which, by converting continuous variables into discrete ones, can provide a
satisfactory trade-off between expressiveness and tractability.
Acknowledgment: This work was supported by the French Institute for Radiopro-
tection and Nuclear Safety (IRSN), the Belgium’s nuclear safety authorities (Bel V)
and European project H2020-ICT-2014-1 #644425 Scissor.
2
In many real-world applications, BNs are learnt from data and, when there
exist continuous attributes, those are often discretized prior to learning, thereby
opening the path to exploiting efficient discrete variable-based learning algo-
rithms. However such an approach is doomed to be ineffective because the
conditional dependences/arcs learnt during the learning phase bring valuable
information that cannot be exploited by the discretization algorithm, thereby
severely limiting its effectiveness. However, there exist surprisingly few papers
on discretizing while learning, probably because it incurs substantial computa-
tional costs and it requires multivariate discretization instead of just a univariate
one. In this direction, MDL and Bayesian scores used by search algorithms have
been adapted to include multivariate discretizations taking into account the BN
structure learnt so far [6,13]. But, to be naturally included into these scores, the
latter heavily rely on entropy-related maximizations which, as we shall see, is
not very well suited for BN learning. In [21], a non-linear dimensionality reduc-
tion process called GP-LVM combined with a Gaussian mixture model-based
discretization is proposed for BN learning. Unfortunately, GP-LVM looses the
random variable’s semantics and the discretization does not rely on the BN struc-
ture. As a consequence, the method does not exploit all the useful information.
Unlike in BN learning, multivariate discretization has often been exploited
in Machine Learning for supervised classification tasks [1,2,5,9,22]. But the goal
is only to maximize the classification power w.r.t. one target variable. As such,
only the individual correlations of each variable with the target are of inter-
est and, thus, only bivariate discretization is needed. BN structure learning is
fundamentally different because the complete set of conditional dependences be-
tween all sets of variables is of interest and multivariate discretization shall most
often involve more than two variables. This makes these approaches not easily
transferable to BN learning. In [11], the authors propose a general multivariate
discretization relying on genetic algorithms to construct rulesets. However, the
approach is very limited because it is designed to cope with only one target and
the domain size of this variable needs to be small to keep the method tractable.
Discretizations have also been exploited in unsupervised learning (UL), but
those are essentially univariate [4,8,16,17], which make them usable per se only
as a preprocess prior to learning. However, BN learning can be related to UL
in the sense that all the BN’s variables can be thought of as targets whose dis-
cretized values are unobserved. This suggests that some key ideas underlying UL
algorithms might be adapted for learning BN structures. Clustering is one such
popular framework. In [14], for instance, multivariate discretization is performed
by clustering but, unfortunately, independences between random variables are
only considered given a latent variable. This limits considerably the range of
applications of the method because numerous continuous variables require the
latent one to have a large domain size in order to get good quality discretizations.
This approach is therefore limited to small datasets and, by not exploiting the
BN structure, it is best suited as a BN learning preprocess. Finally, by relying on
entropy, its effectiveness for BN learning is certainly not optimal. However, here,
we advocate to exploit clustering methods for discretization w.r.t. BN learning.
Discretization for Bayesian Network Structure Learning 3
More precisely, we propose a new clustering-based approach for multivariate
discretization that takes into account the conditional dependences among vari-
ables discovered during learning. By exploiting clustering rather than entropy,
it avoids the shortcomings induced by the latter and, by taking into account the
dependences between random variables, it significantly increases the quality of
the discretization compared to state-of-the-art clustering approaches.
The rest of the paper is organized as follows. Section 2 recalls BN learning
and discretizations. Then, in Section 3, we describe our approach and justify
its correctness. Its effectiveness is highlighted through experiments in Section 4.
Finally, some concluding remarks are given in Section 5.
2 Basics on BN Structure Learning and Discretization
Uppercase (resp. lowercase) letters X, Z, x, z, represent random variables and
their instantiations respectively. Boldface letters represent sets.
Definition 1. A (discrete) BN is a pair (G, θ) where G = (X, A) is a directed
acyclic graph (DAG), X = {X
1
, ..., X
n
} represents a set of discrete random
variables
1
, A is a set of arcs, and θ = {P (X
i
|Pa(X
i
))}
n
i=1
is the set of the
conditional probability distributions (CPT) of the variables X
i
in G given their
parents Pa(X
i
) in G. The BN encodes the joint probability over X as:
P (X) =
n
Y
i=1
P (X
i
|Pa(X
i
)). (1)
To avoid ambiguities between continuous variables and their discretized coun-
terparts, letters, when superscripted by “◦”, e.g.,
˚
X, ˚x, represent variables and
their instantiations prior to discretization, else they are discretized (for discrete
variables, X =
˚
X and x = ˚x). In the rest of the paper, n always denotes the num-
ber of variables in the BN, and we assume that
˚
X
1
, . . . ,
˚
X
l
are discrete whereas
˚
X
l+1
, . . . ,
˚
X
n
are continuous.
˚
D and D denote the input databases before and
after discretization respectively and are assumed to be complete, i.e., they do
not contain any missing data. N refers to their number of records.
Given
˚
D = {
˚
x
(1)
,
˚
x
(2)
, . . . ,
˚
x
(N)
}, BN learning consists of finding DAG G that
most likely accounts for the observed data in
˚
D. When all variables are discrete,
i.e., D =
˚
D, there exist many efficient algorithms in the literature for solving
this task. Those can be divided into 3 classes [10]: i) the search-based approaches
that look for the structure optimizing a score (BD, BDeu, BIC, AIC, K2, etc.);
ii) the constraint-based approaches that exploit statistical independence tests
(χ
2
, G
2
, etc.) to find the best structure G; iii) the hybrid methods that exploit
a combination of both. In the rest of the paper, we will focus on search-based
approaches because our closest competitors, [13,6], belong to this class.
Basically, these algorithms start with a structure G
0
(often empty). Then, at
each step, they look in the neighborhood of the current structure for another
1
By abuse of notation, we use interchangeably X
i
∈ X to denote a node in the BN
and its corresponding random variable.
4
structure, say G, that increases the likelihood of structure G given observations
D, i.e., P (G|D). The neighborhood is often defined as the set of graphs that differ
from the current one only by one atomic graphical modification (arc addition, arc
deletion, arc reversal). P (G|D) is computed locally through the aforementioned
scores, their differences stemming essentially from different a priori hypotheses.
More precisely, assuming a uniform prior on all structures G, we have that:
P (G|D) =
P (D|G)P (G)
P (D)
∝ P (D|G) =
Z
θ
P (D|G, θ)π(θ|G)dθ, (2)
where θ is the set of parameters of the CPTs of a (discrete) BN with structure
G. Different hypotheses on prior π and on θ result in the different scores (see,
e.g., [18] for the hypotheses for the BIC score used later).
When database
˚
D contains continuous variables, those can be discretized. A
discretization of a continuous variable
˚
X is a function f : R → {0, . . . , g} defined
by an increasing sequence of g cut points {t
1
, t
2
, ..., t
g
} such that:
f(˚x) =
0 if ˚x < t
1
,
k if t
k
≤ ˚x < t
k+1
, for all k ∈ {1, . . . , g − 1}
g if ˚x ≥ t
g
.
Let F be a set of discretization functions, one for each continuous variable. Then,
given F, if D denotes the (unique) database resulting from the discretization of
˚
D by F., Eq. (2) becomes:
P (G|
˚
D, F) ∝ P (
˚
D|G, F) = P (D|
˚
D, G, F)P (
˚
D|G, F) = P (
˚
D|D, G, F)P (D|G, F),
Assuming that all databases
˚
D compatible with D given F are equiprobable, we
thus have that:
P (G|
˚
D, F) ∝ P (D|G, F) =
Z
θ
P (D|G, F, θ)π(θ|F, G)dθ. (3)
BN structure learning therefore amounts to find structure G
∗
such that G
∗
=
Argmax
G
P (G|D, F). Note that P (D|G, F, θ) corresponds to a classical score
over discrete data. π(θ|F, G) is the prior over the parameters of the BN given F.
Eq. (3) is precisely the one used when discretization is performed as a preprocess
before learning.
When discretization is performed while learning, like in [6,13], both the struc-
ture and the discretization should be optimized simultaneously. In other words,
the problem consists of computing Argmax
F,G
P (G, F|
˚
D), where finding the best
discretization amounts to find the best set of cut points (including the best size
for this set) for each continuous random variable. And we have that:
P (G, F|
˚
D) = P (G|F,
˚
D)P (F|
˚
D) ∝ P (F|
˚
D)
Z
θ
P (D|G, F, θ)π(θ|F, G)dθ. (4)
As can be seen, the resulting equation combines the classical score on the
discretized data (the integral) with a score P (F|
˚
D) for the discretization al-
gorithm itself. The logarithm of latter corresponds to what [6] and [13] call
DL
Λ
(Λ) + DL
˚
D→D
(
˚
D, Λ) and S
c
(Λ;
˚
D) respectively.
Discretization for Bayesian Network Structure Learning 5
Input: a database
˚
D, an initial graph G, a score function sc on discrete variables
Output: the structure G of the Bayesian network
1 repeat
2 Find the best discretization F given G
3 {X
l+1
, . . . , X
n
} ← discretize variables {
˚
X
l+1
, . . . ,
˚
X
n
} given F
4 G ← G’s neighbor that maximizes scoring function sc w.r.t. {X
1
, . . . , X
n
}
5 until G maximizes the score;
Algorithm 1: Our structure learning architecture.
3 A New Multivariate Discretization-Learning Algorithm
As mentioned earlier, we believe that taking into account the conditional de-
pendences between random variables is important to provide high-quality dis-
cretizations. Our approach thus follows Eq. (4) and our goal is to compute
Argmax
F,G
P (G, F|
˚
D). Optimizing jointly over F and G is too computation-
ally intensive a task to be usable in practice. Fortunately, we can approximate
it efficiently through a gradient descent, alternating optimizations over F given
a fixed structure G and optimizations over G given a fixed discretization F. This
suggests the BN structure learning method described as Algo. 1.
Multivariate discretization is much more time consuming than univariate dis-
cretization. As such, Line 2 could thus incur a strong overhead to the learning
algorithm because the discretization search space increases exponentially with
the number of variables to discretize. To alleviate this problem without sacrific-
ing too much in accuracy, we suggest a local search algorithm that iteratively
fixes the discretizations of all the continuous variables but one and optimizes
the discretization of the latter (given the other variables) until some stopping
criterion is met. As such, discretizations being optimized one continuous variable
at a time, the combinatorics and the computation time are significantly limited.
Line 2 can thus be detailed as Algo. 2.
3.1 Discretization Criterion
To implement Algo. 2, a discretization criterion to be optimized is needed. Basic
ideas include trying to find cut points minimizing the discrepancy between the
frequencies or the sizes of intervals [t
k
, t
k+1
). A more sophisticated approach
Input: a database
˚
D, a graph G, a scoring function sc on discrete variables
Output: a discretization F
1 repeat
2 i
0
← Select an element in {l + 1, . . . , n}
3 Discretize
˚
X
i
0
given G and {X
1
, . . . , X
i
0
−1
, X
i
0
+1
, . . . , X
n
}
4 until stopping condition;
Algorithm 2: One-variable discretization architecture.
6
consists of limiting as much as possible the quantity of information lost after
discretization, or equivalently to maximize the quantity of information remain-
ing after discretization. This naturally calls for maximizing an entropy. This is
essentially what our closest competitors, [6,13], do.
But entropy may not be the most appropriate measure when dealing with
BNs. Actually, consider a variable A with domain {a
1
, a
2
, a
3
}. Then, it is pos-
sible that, for some BN, P (A = a
1
) =
1
6
, P (A = a
2
) =
1
3
and P (A = a
3
) =
1
2
.
With a sufficiently large database D, the frequencies of observations of a
1
, a
2
, a
3
in D would certainly lead to estimate P (A) ≈ [
1
6
,
1
3
,
1
2
]. Now, assume that the
observations in D are noisy, say with a Gaussian noise with an infinitely small
variance, as in Fig. 1. Then, after discretization, we shall expect to have 3 in-
tervals with respective frequencies
1
6
,
1
3
and
1
2
, i.e., intervals similar to (−∞, t
1
),
[t
1
, t
2
) and [t
2
, +∞) of Fig. 1. However, w.r.t. entropy, the best discretization
corresponds to intervals [−∞, s
1
), [s
1
, s
2
) and [s
2
, +∞) of Fig. 1 whose fre-
quencies are all approximately equal to
1
3
(entropy is maximal for equiprobable
intervals). Therefore, whatever the infinitesimal noise added to data in D, an
entropy-based discretization produces a discretized variable A with distribution
[
1
3
,
1
3
,
1
3
] instead of [
1
6
,
1
3
,
1
2
]. This suggests that entropy is probably not the best
criterion for discretizing continuous variables for BN learning.
Fig. 1 suggests that clustering would probably be more appropriate: here, one
cluster/interval per Gaussian would provide a better discretization. In this paper,
we assume that, within every interval, each continuous random variable, say
˚
X
i
0
,
is distributed w.r.t. a truncated Gaussian. Over its whole domain of definition,
it is thus distributed as a mixture of truncated Gaussians, the weights of the
latter being precisely the CPT of X
i
0
in the discrete BN. In particular, if
˚
X
i
0
has some parents, there are as many mixtures as the product of the domain sizes
of the parents. The parameters of such a discretization scheme are therefore: i) a
set of g cut points (to define g + 1 intervals) and ii) a mean and a variance for
each interval (to define its Gaussian). Fig. 1 actually illustrates the fact that the
means of the Gaussians need not necessarily correspond to the middles of the
intervals. For instance, the mean of the third Gaussian is a
3
whereas the third
interval, [t
3
, +∞), has no finite middle. Here, even finite interval middles, like
that of [t
1
, t
2
), do not correspond to the means of the Gaussians.
For each continuous random variable
˚
X
i
0
, this joint optimization problem is
really hard due to the normalization requirements that the integrals of the trun-
cated exponential of each interval must sum to 1 (which cannot be expressed
using closed-form formulas). Therefore, to alleviate the discretization computa-
a
1
a
2
a
3
t
1
t
2
s
1
s
2
Fig. 1. Discretization: entropy v.s. clustering.
Discretization for Bayesian Network Structure Learning 7
tional burden, we propose to approximate the computation of the cut points,
means and variances using a two-step process: first, we approximate the density
of the joint distribution of {X
1
, . . . , X
i
0
−1
,
˚
X
i
0
, X
i
0
+1
, . . . , X
n
} as a mixture of
untruncated Gaussians and we determine by maximum likelihood the number
of cut-points as well as the means and variances of the Gaussians. This can be
easily done by an Expectation-Maximization (EM) approach. Then, in a second
step, we compute the best cut points w.r.t. the Gaussians. As each Gaussian is
associated with an interval, the parts of the Gaussian outside the interval can
be considered as a loss of information and we will therefore look for cut points
that minimize this loss. Now, let us delve into the details of the approach.
3.2 Discretization Exploiting the BN Structure
For the first discretization step of
˚
X
i
0
, we estimate the number g of cut-points
and the Gaussians’ means and variances. Assume that structure G is fixed and
that all the other variables are discrete. The density over all the variables, p(
˚
X),
is equal to p(
˚
X
i
0
|Pa(
˚
X
i
0
))
Q
i6=i
0
P (X
i
|Pa(X
i
)), where p(
˚
X
i
0
|Pa(
˚
X
i
0
)) repre-
sents a mixture of Gaussians for each value of
˚
X
i
0
’s parents (there are a finite
number of values since all the variables but
˚
X
i
0
are discrete). P (X
i
|Pa(X
i
))
should be the CPT of discrete variable X
i
but, unfortunately, it is not well de-
fined if
˚
X
i
0
∈ Pa(X
i
) because, in this case, Pa(X
i
) has infinitely many values.
This is a serious issue since this CPT is used in the computation of P (D|G, F, θ)
of Eq. (4). Fortunately, this problem can be overcome by enforcing that
˚
X
i
0
has
no child while guaranteeing that the density remains unchanged. Actually, in
[19], an arc reversal operator is provided that, when applied, never alters the
density/probability distribution. More precisely, when reverting arc X → Y ,
Shachter showed that if all the parents of X are added to Y and all the parents
of Y except X are added to X, then the resulting BN encodes the same distribu-
tion. As an example of these transformations, reversing arc X → V of Fig. 2.(a)
results in Fig. 2.(b) and, then, reversing arc X → W results in Fig. 2.(c).
Therefore, to enforce that
˚
X
i
0
has no child, if {i
1
, . . . , i
c
} denotes the set of
indices of the children variables of
˚
X
i
0
, sorted by a topological order of G, then,
by reversing sequentially all the arcs X
i
j
→
˚
X
i
0
, j = 1, . . . , c, we get:
p(
˚
X) = p(
˚
X
i
0
|Pa(
˚
X
i
0
)) ×
Y
i6={i
0
,...,i
c
}
P (X
i
|Pa(X
i
)) ×
c
Y
j=1
P (X
i
j
|Pa(X
i
j
)),
Q
P
X
Y
Z
U
W V
Q
P
X
Y
Z
U
W V
Q
P
X
Y
Z
U
W V
(a) (b) (c)
Fig. 2. Shachter’s arc reversals.
8
p(
˚
X) = p(
˚
X
i
0
|MB(
˚
X
i
0
)) ×
Y
i6={i
0
,...,i
c
}
P (X
i
|Pa(X
i
))
×
c
Y
j=1
P (X
i
j
|
j
[
h=1
(Pa(X
i
h
)\{
˚
X
i
0
}) ∪ Pa(
˚
X
i
0
)),
where MB(
˚
X
i
0
) is the Markov blanket of
˚
X
i
0
in G:
Definition 2. The Markov blanket of any node in G is the set of its parents, its
children and the other parents of its children.
Note that, in the last expression of p(
˚
X), only the first term involves
˚
X
i
0
,
hence all the other CPTs are well defined (they are finite CPTs). As a side effect,
only p(
˚
X
i
0
|MB(
˚
X
i
0
)) needs be taken into account to discretize
˚
X
i
0
since none
of the other terms is related to
˚
X
i
0
. It shall be noted here that these arc reversals
are applied only for determining the parameters of the discretization, i.e., the
set of cut points, means and variances of the Gaussians, they are never used to
learn the BN structure. Now, let us see how the parameters of the mixture of
Gaussians p(
˚
X
i
0
|MB(
˚
X
i
0
)) maximizing the likelihood of dataset
˚
D can be easily
estimated using an EM algorithm.
3.3 Parameter Estimation by an EM Algorithm
Let q
i
0
represent the (finite) number of values of MB(
˚
X
i
0
). For simplicity, we
will denote by {1, . . . , q
i
0
} the set of values of the joint discrete random variable
MB(
˚
X
i
0
). Let g denote the number of cut points in the discretization and let
{N (µ
k
, σ
k
) : k ∈ {0, . . . , g}} be the corresponding set of Gaussians. Then:
p(
˚
X
i
0
= ˚x
i
0
|MB(
˚
X
i
0
) = j) =
g
X
k=0
π
jk
f(˚x
i
0
|θ
k
) ∀ j ∈ {1, . . . , q
i
0
},
where f (·|θ
k
) represents the density of the normal distribution of parameters
θ
k
= (µ
k
, σ
k
), and π
jk
represents the weights of the mixture (with the constraint
that π
jk
≥ 0 for all j, k and
P
g
k=0
π
jk
= 1 for all j). Remember that each value
of MB(
˚
X
i
0
) induces its own set of weights {π
j0
, . . . , π
jg
}. Now, we propose to
estimate parameters θ
k
from
˚
D by maximum likelihood. For this, EM is well-
known to efficiently provide good approximations [3] (due to the mixture, direct
maximum likelihood is actually hard to estimate). Assuming that data in
˚
D are
i.i.d., the log-likelihood of
˚
D given Θ =
S
g
k=0
(
S
q
i
0
j=1
{π
jk
} ∪ {θ
k
}) is equal to:
L(
˚
D|Θ) =
N
X
m=1
log p(
˚
X
i
0
= ˚x
(m)
i
0
|MB(˚x
i
0
)
(m)
, Θ),
where ˚x
(m)
i
0
represents the observed value of
˚
X
i
0
in the mth record of
˚
D. Thus:
L(
˚
D|Θ) =
q
i
0
X
j=1
X
m:MB(˚x
i
0
)
(m)
=j
log
"
g
X
k=0
π
jk
f(˚x
(m)
i
0
|θ
k
)
#
. (5)
Discretization for Bayesian Network Structure Learning 9
To solve Argmax
Θ
L(
˚
D|Θ), EM [3] iteratively alternates expectations (E-
step) and maximizations (M-step) until convergence toward a local maximum
which is guaranteed to correspond to the Argmax we look for due to the concavity
of the log-likelihood function. In this paper, we just need to apply the standard
EM, considering for weights π
jk
only the records in the database that correspond
to MB(˚x
i
0
)
(m)
= j. More precisely, for each record of
˚
D, let Z
(m)
be a random
variable whose domain is {0, . . . , g}, and such that Z
(m)
= k if and only if
observation ˚x
(m)
i
0
has been generated from the kth Gaussian. Let Q
t
m
(Z
(m)
) =
P (Z
(m)
|˚x
(m)
i
0
, Θ
t
), i.e., Q
t
m
(Z
(m)
) represents the distribution that, at the tth
step of the algorithm, ˚x
(m)
i
0
is believed to have been generated by such and such
Gaussian. Then, EM is described in Algo. 3.
In the EM algorithm, only the M-step can be computationally intensive.
Fortunately, here, we can derive in closed-form the optimal values of Line 4:
Proposition 1. At the E-step, probability Q
t+1
m
(k) =
π
t
jk
f(˚x
(m)
i
0
|θ
t
k
)
P
g
k
0
=0
π
t
jk
0
f(˚x
(m)
i
0
|θ
t
k
0
)
,
where π
t
jk
and θ
t
k
are weights, means and variances in Θ
t
. The optimal param-
eters of the M-step are respectively:
π
t+1
jk
=
P
m:MB(˚x
i
0
)
(m)
=j
Q
t+1
m
(k)
P
m:MB(˚x
i
0
)
(m)
=j
P
g
k
0
=0
Q
t+1
m
(k
0
)
,
µ
t+1
k
=
P
N
m=1
Q
t+1
m
(k)˚x
(m)
i
0
P
N
m=1
Q
t+1
m
(k)
σ
t+1
k
=
v
u
u
t
P
N
m=1
Q
t+1
m
(k)(˚x
(m)
i
0
− µ
t+1
k
)
2
P
N
m=1
Q
t+1
m
(k)
.
Using Algo. 3 with the formulas of Proposition 1, it is thus possible to deter-
mine the means and variances of the Gaussians. However, our ultimate goal is not
to compute them but to exploit them to discretize variable
˚
X
i
0
, i.e., to determine
the best cut points t
1
, . . . , t
g
. Let us see how this task can be performed.
Input: a database
˚
D, a number g of cut points
Output: an optimal set of parameters Θ
1 Select (randomly) an initial value Θ
0
2 repeat
// E-step (expectation)
3 Q
t+1
m
(Z
(m)
) ← P (Z
(m)
|˚x
(m)
i
0
, Θ
t
) ∀ m ∈ {1, . . . , N }
// M-step (maximization)
4 Θ
t+1
← Argmax
Θ
q
i
0
X
j=1
X
m:MB(˚x
i
0
)
(m)
=j
g
X
k=0
Q
t+1
m
(k) log
"
π
jk
f(˚x
(m)
i
0
|θ
k
)
Q
t+1
m
(k)
#
5 until convergence;
Algorithm 3: The EM algorithm.
10
3.4 Determination of the Cut Points
As mentioned at the end of Subsection 3.1, each Gaussian N (µ
k
, σ
k
) is associated
with an interval [t
k
, t
k+1
)
2
and the parts of the Gaussian outside the interval
can be considered as a loss of information. The optimal set of cut points
b
T =
{
ˆ
t
1
, . . . ,
ˆ
t
g
} is thus that which minimizes this loss. In other words, it is equal to:
b
T = Argmin
{t
1
,...,t
g
}
g
X
k=1
Z
+∞
t
k
f(x|θ
k−1
)dx +
Z
t
k
−∞
f(x|θ
k
)dx,
where θ
k
represents pairs (µ
k
, σ
k
). As each Gaussian N (µ
k
, σ
k
) is associated
with interval [t
k
, t
k+1
), we can assume that
ˆ
t
k
∈ [µ
k−1
, µ
k
), for all k. Therefore:
b
T =
(
Argmin
t
k
∈[µ
k−1
,µ
k
)
Z
+∞
t
k
f(x|θ
k−1
)dx +
Z
t
k
−∞
f(x|θ
k
)dx : k ∈ {1, . . . , g}
)
. (6)
All the
ˆ
t
k
can thus be determined independently. In addition, as shown below,
their values are the solution of a quadratic equation:
Proposition 2. Let u(t
k
) represent the sum of the integrals in Eq. (6). Let α
k
be a solution (if any) within interval (µ
k−1
, µ
k
) of the quadratic equation in t
k
:
t
2
k
1
σ
2
k−1
−
1
σ
2
k
+ 2t
k
µ
k
σ
2
k
−
µ
k−1
σ
2
k−1
+
µ
2
k−1
σ
2
k−1
−
µ
2
k
σ
2
k
− 2log
σ
k
σ
k−1
= 0. (7)
Then
ˆ
t
k
is, among {µ
k−1
, µ
k
, α
k
}, the element with the highest value of u(·)
(which can be quickly approximated using a table of the Normal distribution).
Proof. Let g(·) and h(·) be two functions such that ∂g(x)/∂x = f(x|θ
k−1
) and
∂h(x)/∂x = f (x|θ
k
). Then:
ˆ
t
k
= Argmin
t
k
∈[µ
k−1
,µ
k
)
u(t
k
) = Argmin
t
k
∈[µ
k−1
,µ
k
)
Z
+∞
t
k
∂g(x)
∂x
dx +
Z
t
k
−∞
∂h(x)
∂x
dx
= Argmin
t
k
∈[µ
k−1
,µ
k
)
−g(t
k
) + h(t
k
) + lim
t→+∞
[g(t) − h(−t)].
Let us relax the optimization problem and try to find the Argmin over R. Then
the min is obtained when ∂u(t
k
)/∂t
k
= 0 or, equivalently, when ∂(−g(t
k
) +
h(t
k
))/∂t
k
= −f(t
k
|θ
k−1
) + f(t
k
|θ
k
) = 0. Since f (·|θ) represents the density of
the Normal distribution of parameters θ, this is equivalent to:
−
1
√
2πσ
k−1
exp
−
1
2
t
k
−µ
k−1
σ
k−1
2
+
1
√
2πσ
k
exp
−
1
2
t
k
−µ
k
σ
k
2
= 0,
or, equivalently:
2
Without loss of generality, we consider here that the the µ
k
’s resulting from the EM
algorithm are sorted by increasing order.
Discretization for Bayesian Network Structure Learning 11
σ
k
σ
k−1
=
exp
−
1
2
t
k
−µ
k
σ
k
2
exp
−
1
2
t
k
−µ
k−1
σ
k−1
2
= exp
"
1
2
t
k
− µ
k−1
σ
k−1
2
−
1
2
t
k
− µ
k
σ
k
2
#
,
which, by a log transformation, is equivalent to:
2 log
σ
k
σ
k−1
=
t
2
k
σ
2
k−1
−
2µ
k−1
t
k
σ
2
k−1
+
µ
2
k−1
σ
2
k−1
−
t
2
k
σ
2
k
+
2µ
k
t
k
σ
2
k
−
µ
2
k
σ
2
k
.
This corresponds precisely to Eq. (7). So, to summarize, if the optimal solution
lies inside interval (µ
k−1
, µ
k
), then it satisfies Eq. (7). Otherwise, either u(t
k
)
is strictly increasing or strictly decreasing within (µ
k−1
, µ
k
), which implies that
the optimal solution for
ˆ
t
k
is either µ
k−1
or µ
k
, which completes the proof.
3.5 Score and Number of Cut Points
To complete the description of the algorithm, there remains to determine the
number of cut points. Of course, the higher the number of cut points, the higher
the likelihood but the lower the compactness of the representation. To reach
of good trade-off, we simply propose to exploit the penalty functions included
into the score used for the evaluation of different BN structures (see Line 5 of
Algo. 1). Here, we used the BIC score [18], which can be locally expressed as:
BIC(
˚
X
i
0
|MB(
˚
X
i
0
)) = L(
˚
D|Θ) −
|Θ|
2
log(N) (8)
where L(
˚
D|Θ) is the log-likelihood with the parameters estimated by EM, given
the current structure G. |Θ| represents the number of parameters, i.e., |Θ| =
q
i
0
×g+2×(g+1): the 1st and 2nd terms correspond to the number of parameters
π
jk
and of (µ
k
, σ
k
) needed to encode the conditional distributions (recall that
there are g + 1 Gaussians and q
i
0
represents the domain size of MB(
˚
X
i
0
)). Now,
the best number of cut points is simply that which optimizes Eq. (8).
4 Experimentations
In this section, we highlight the effectiveness of our method, hereafter denoted
MGCD (for Mixture of Gaussians Clustering-based Discretization), by compar-
ing it with the algorithms provided in [17] and [6], hereafter called Ruichu and
Friedman respectively. Step 4 of Algo. 1 was performed using a simple Tabu
search method. For the comparisons, three criteria have been taken into account:
i) the quality of the structure learnt by the algorithm (which strongly depends
on that of the discretization); ii) the computation time and iii) the quality of
the learnt CPT parameters, which has been evaluated by their prediction power
on the values taken by some variables given observations.
For the first two criteria, we randomly generated discrete BNs following the
guidelines given in [7]. Those contained from 10 to 30 nodes and from 12 to 56
12
2000 4000 6000 8000 10000
0.0 0.2 0.4 0.6 0.8
dataset sizes
True Positive Rate (TPR)
MGCD
Ruichu
Friedman
2000 4000 6000 8000 10000
0.00 0.02 0.04 0.06 0.08 0.10
dataset sizes
False Positive Rate (FPR)
MGCD
Ruichu
Friedman
Fig. 3. Averages of the TPR (left) and FPR (right) metrics for BNs with 10 to 30
nodes in function of the sample sizes.
arcs. Each node had at most 6 parents and its domain size was randomly chosen
between 2 and 5. The CPTs of these BNs represented the π
jk
of the preceding
section. From these BNs, we generated continuous datasets containing from 1000
to 10000 records as follows: for each random variable X
i
, we mapped its finite set
of values into a set of consecutive intervals {[t
k−1
, t
k
)}
|X
i
|
k=1
of arbitrary lengths.
Then, we assigned a truncated Gaussian to each interval, the parameters (µ
k
, σ
k
)
of which were randomly chosen. Finally, to generate a continuous record, we
first generated a discrete record from the discrete BN using a logic sampling
algorithm. Then, this record was mapped into a continuous one by sampling from
the truncated exponentials. Overall, 350 continuous datasets were generated.
To compare them, the BN structures produced by Ruichu, Friedman and
MGCD were converted into their Markov equivalence class, i.e., into a partially
directed DAG (CPDAG). Such a transformation increases the quality of com-
parisons since two BNs encode the same distribution iff they belong to the same
equivalence class. The CPDAGs were then compared w.r.t. their true and false
positive rate metrics (TPR and FPR). TPR (resp. FPR) represents the percent-
age of arcs/edges belonging to the learnt CPDAG that also exist (resp. do not
exist) in the original CPDAG. Both metrics describe how well the dependences
between variables are preserved by learning/discretization. Fig. 3 shows the av-
erage TPR and FPR over the 350 generated databases. As can be seen, MGCD
outperforms the others for all dataset sizes: MGCD’s TPR is about 10% higher
than Ruichu and 40% higher than Friedman, and MGCD’s FPR is between 20%
and 40% lower than the other methods. MGCD’s performance w.r.t. Ruichu’s
can be explained by that fact that, unlike Ruichu’s, it fully takes into account the
conditional dependences between all the random variables. Its performance w.r.t.
Friedman’s can be explained by our choice of exploiting clustering rather than
an entropy-based approach. Table 1 provides computation time ratios (other
method’s runtime / MGCD’s runtime). As can be seen, our method slightly
outperforms Ruichu’s (but is 10% better in terms of TPR and more than 20%
better in terms of FPR) and it significantly outperforms Friedman’s (about 3
times faster) while at the same time being 40% higher in terms of TPR.
Discretization for Bayesian Network Structure Learning 13
Approaches / Dataset sizes 1000 5000 7500 10000
Friedman 2.762444 3.350782 3.404958 3.361540
Ruichu 0.8872389 1.1402535 1.1334637 1.1032982
Table 1. Runtime ratio comparisons between discretization approaches.
Finally, we compared the discretizations w.r.t. the quality of the produced
CPTs. To do so, we generated from two classical BNs, Child and Sachs, 100
continuous databases using the same process as above except that: i) the dis-
tributions inside intervals were uniform instead of Gaussians (to penalize our
approach since data do not fit its hypotheses), and ii) some small sets of vari-
ables were kept discrete and served as multilabel targets. Databases were split
into a learning (2/3) and a test (1/3) part. For each record in the latter, we
computed the distribution (learnt by each of the 3 algorithms on the learning
database) of each target given some observations on their Markov blanket and
we estimated the value of the target by sampling it from the learnt distribution.
The percentages of correct predictions are shown in Table 2. As we can see, our
algorithm outperforms the other algorithms, especially Ruichu’s, which fails to
have correct predictions due to its univariate discretization not taking into ac-
count the conditional dependencies among random variables. Friedman’s results
are closer to ours but recall that it is about 3 times slower than ours.
5 Conclusion
We have proposed a new multivariate discretization algorithm designed for BN
structure learning, taking into account the dependences among variables ac-
quired during learning. Our experiments highlighted its efficiency and effective-
ness compared to state-of-the-art algorithms, but more experiments are of course
needed to better assess the strengths and the shortcoming of our proposed ap-
proach. For future work, we plan to improve our algorithm, notably by directly
working with truncated Gaussians instead of the current approximation by mix-
ture of Gaussians. But such an improvement is not trivial due to the fact that,
in this case, no closed-form solution exists for determining the cut points.
datasets sizes 30 % Markov blanket 60% Markov blanket 100 % Markov blanket
MGCD Ruichu Friedman MGCD Ruichu Friedman MGCD Ruichu Friedman
Child
1000 60.90 59.30 58.99 62.76 60.90 59.96 67.53 65.34 63.33
2000 61.59 56.62 58.05 62.41 59.24 60.55 67.29 64.71 63.03
5000 64.88 62.29 60.05 66.07 62.95 61.94 69.42 65.39 63.82
10000 65.81 62.48 61.75 67.44 63.85 63.51 70.49 66.92 65.79
Sachs
1000 56.63 54.78 57.59 57.22 55.04 58.74 65.67 61.06 64.65
2000 56.96 56.16 54.02 59.72 57.58 56.64 65.80 62.24 60.22
5000 57.69 55.00 55.15 59.80 57.96 56.38 65.51 64.15 64.47
10000 60.35 57.50 57.33 61.67 58.26 59.22 70.04 65.74 64.61
Table 2. Prediction accuracy rates for discrete target variables in the Child and Sachs
standard BNs (http://www.bnlearn.com/bnrepository/) w.r.t. the percentage of ob-
served variables in the Markov blanket.
14
References
1. Boull´e, M.: MODL: a Bayes optimal discretization method for continuous at-
tributes. Machine learning 65(1), 131–165 (2006)
2. Boull´e, M.: Khiops: A statistical discretization method of continuous attributes.
Machine Learning 55(1), 53–69 (2004)
3. Dempster, A.P., Laird, N., Rubin, D.: Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, pp. 1–38 (1977)
4. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization
of continuous features. In: proc. of ICML’95. pp. 194–202 (1995)
5. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes
for classification learning. In: Proc. of IJCAI’93. pp. 1022–1029 (1993)
6. Friedman, N., Goldszmidt, M.: Discretizing continuous attributes while learning
Bayesian networks. In: proc. of ICML’96. pp. 157–165 (1996)
7. Ide, J.S., Cozman, F.G., Ramos, F.T.: Generating random Bayesian networks with
constraints on induced width. In: Proc. of ECAI’04. pp. 323–327 (2004)
8. Jiang, S., Li, X., Zheng, Q., Wang, L.: Approximate equal frequency discretization
method. In: proc. of GCIS’09. pp. 514–518 (2009)
9. Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. of AAAI’92.
pp. 123–128 (1992)
10. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press (2009)
11. Kwedlo, W., Kr¸etowski, M.: An evolutionary algorithm using multivariate dis-
cretization for decision rule induction. In: proc. of PKDD’99. pp. 392–397 (1999)
12. Lauritzen, S., Wermuth, N.: Graphical models for associations between variables,
some of which are qualitative and some quantitative. Annals of Statistics 17(1),
31–57 (1989)
13. Monti, S., Cooper, G.: A multivariate discretization method for learning Bayesian
networks from mixed data. In: Proc. of UAI’98. pp. 404–413 (1998)
14. Monti, S., Cooper, G.: A latent variable model for multivariate discretization. In:
proc. of AIS’99. pp. 249–254 (1999)
15. Moral, S., Rumi, R., Salmeron, A.: Mixtures of truncated exponentials in hybrid
Bayesian networks. In: Proc. of ECSQARU’01. Lecture Notes in Artificial Intelli-
gence, vol. 2143, pp. 156–167 (2001)
16. Ratanamahatana, C.: CloNI: Clustering of sqrt(n)-interval discretization. In: proc.
of Int. Conf. on Data Mining & Comm Tech. (2003)
17. Ruichu, C., Zhifeng, H., Wen, W., Lijuan, W.: Regularized Gaussian mixture model
based discretization for gene expression data association mining. Applied intelli-
gence 39(3), 607–613 (2013)
18. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461–
464 (1978)
19. Shachter, R.: Evaluating influence diagrams. Operations Research 34(6), 871–882
(1986)
20. Shenoy, P., West, J.: Inference in hybrid Bayesian networks using mixtures of poly-
nomials. International Journal of Approximate Reasoning 52(5), 641–657 (2011)
21. Song, D., Ek, C., Huebner, K., Kragic, D.: Multivariate discretization for Bayesian
network structure learning in robot grasping. In: proc. of ICRA’11. pp. 1944–1950
(2011)
22. Zighed, D., Rabas´eda, S., Rakotomalala, R.: FUSINTER: a method for discretiza-
tion of continuous attributes. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems 6(03), 307–326 (1998)