Conference PaperPDF Available

Calibrating Probability with Undersampling for Unbalanced Classification

Authors:

Abstract and Figures

Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.
Content may be subject to copyright.
Calibrating Probability with Undersampling
for Unbalanced Classification
Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, Gianluca Bontempi∗§
Machine Learning Group, Computer Science Department, Universit´
e Libre de Bruxelles, Brussels, Belgium.
Email: adalpozz@ulb.ac.be
Fraud Risk Management Analytics, Worldline S.A., Brussels, Belgium.
Email: olivier.caelen@worldline.com
iCeNSA, Computer Science and Engineering Department, University of Notre Dame, Notre Dame IN, USA.
Email: rjohns15@nd.edu
§Interuniversity Institute of Bioinformatics in Brussels (IB)2, Brussels, Belgium.
Email: gbonte@ulb.ac.be
Abstract—Undersampling is a popular technique for unbal-
anced datasets to reduce the skew in class distributions. However,
it is well-known that undersampling one class modifies the
priors of the training set and consequently biases the poste-
rior probabilities of a classifier [9]. In this paper, we study
analytically and experimentally how undersampling affects the
posterior probability of a machine learning model. We formalize
the problem of undersampling and explore the relationship
between conditional probability in the presence and absence of
undersampling. Although the bias due to undersampling does not
affect the ranking order returned by the posterior probability, it
significantly impacts the classification accuracy and probability
calibration. We use Bayes Minimum Risk theory to find the
correct classification threshold and show how to adjust it after
undersampling. Experiments on several real-world unbalanced
datasets validate our results.
I. INTRODUCTION
In several binary classification problems, the two classes
are not equally represented in the dataset. For example, in fraud
detection, fraudulent transactions are normally outnumbered
by genuine ones [8]. When one class is underrepresented in a
dataset, the data is said to be unbalanced. In such problems,
typically, the minority class is the class of interest. Having
few instances of one class means that the learning algorithm
is often unable to generalize the behavior of the minority
class well, hence the algorithm performs poorly in terms of
predictive accuracy [16].
A common strategy for dealing with unbalanced classi-
fication tasks is to under-sample the majority class in the
training set before learning a classifier [1]. The assumption
behind this strategy is that in the majority class there are
many redundant observations and randomly removing some
of them does not change the estimation of the within-class
distribution. If we make the assumption that training and
testing sets come from the same distribution, then when the
training is unbalanced, the testing set has a skewed distribution
as well. By removing majority class instances, the training set
is artificially rebalanced. As a consequence, we obtain different
distributions for the training and testing sets, violating the basic
assumption in machine learning that the training and testing
sets are drawn from the same underlying distribution.
In this paper, we study the impact of the bias introduced
by undersampling on classification tasks with unbalanced data.
We start by discussing literature results showing how the
posterior probability of an algorithm learnt in the presence
of undersampling is related to the conditional probability of
the original distribution. Using synthetic data we see that the
larger the overlap between the two within-class distributions
(i.e. the greater the non-separability of the classification task),
the larger the bias in the posterior probability. The mismatch
between the posterior probability obtained with the original
dataset and after undersampling is assessed in terms of loss
measure (Brier Score), predictive accuracy (G-mean) and rank-
ing (AUC).
Based on the previous works of Saerens et al. [21] and
Elkan [13], we propose an analytical method to correct the
bias introduced by undersampling that can produce well-
calibrated probabilities. The method is equivalent to adjusting
the posterior probability in the presence of new priors. The use
of unbiased probability estimates requires an adjustment to the
probability threshold used to classify instances. When using
class priors as misclassification costs, we show that this new
threshold corresponds to the one used before undersampling.
In order to have complete control over the data generation
process, we have first recourse to synthetic datasets. This
allows us to simulate problems of different difficulty and see
the impact of undersampling on the probability estimates. To
confirm the results obtained with the simulated data, we also
run our experiments on several UCI datasets and a real-world
fraud detection dataset made available to the public.
This paper has the following contributions. First, we review
how undersampling can induce a bias in the posterior proba-
bilities generated by machine learning methods. Second, we
leverage this understanding to develop an analytical method
that can counter and reduce this bias. Third, we show how
to use unbiased probability estimates for decision making in
unbalanced classificiation. We note that while the framework
we derive in this work is theoretically equivalent to the
problem of a change in class priors [21], our perspective is
different. We interpret undersampling as a problem of sample
selection bias, wherein the bias is not intrinsic to the data but
rather introduced artificially [19].
The paper is organized as follows. Section II introduces
some well-known methods for unbalanced datasets and sec-
tion III formalizes the sampling selection bias due to under-
sampling. Undersampling is responsible for a shift in the pos-
terior probability which leads to biased probability estimates,
for which we propose a corrective method. Section IV shows
how to set the classification threshold to take into account
the change in the priors. Finally, section VI uses real-world
datasets to validate the probability transformation presented in
section III and the use of the classification threshold proposed
in IV.
II. SAMPLING FOR UNBALANCED CLASSIFICATION
Let us consider a binary classification task where the
distribution of the target class is highly skewed. When the
data is unbalanced, standard machine learning algorithms that
maximise overall accuracy tend to classify all observations
as majority class instances [16]. This translates into poor
accuracy on the minority class (low recall), which is typically
the class of interest. There are several methods that deal with
this problem, which we can distinguish between methods that
operate at the data and algorithmic levels [6].
At the data level, the unbalanced strategies are used as a
pre-processing step to re-balance the two classes before any
algorithm is applied. At the algorithmic level, algorithms are
themselves adjusted to deal with the minority class detec-
tion [2]. Here we will restrict ourselves to consider a subset
of data-level methods known as sampling techniques.
Undersampling [11] consists of down-sizing the majority
class by removing observations at random until the dataset
is balanced. In an unbalanced problem, it is often realistic
to assume that many observations of the majority class are
redundant and that by removing some of them at random the
data distribution will not change significantly. However the
risk of removing relevant observations from the dataset is still
present, since the removal is performed in an unsupervised
manner. In practice, this technique is often adopted since it is
simple and speeds up the learning phase.
Oversampling [11] consists of up-sizing the minority class
at random, decreasing the level of class imbalance. By repli-
cating the minority class until the two classes have equal
frequency, oversampling increases the risk of over-fitting by
biasing the model towards the minority class. Other drawbacks
of the approach are that it does not add any new valuable
minority examples and that it increases the training time. This
can be particularly ineffective when the original dataset is
fairly large.
SMOTE [7] over-samples the minority class by generating
synthetic minority examples in the neighborhood of observed
ones. The idea is to form new minority examples by interpo-
lating between examples of the same class. This has the effect
of creating clusters around each minority observation.
In this paper we focus on understanding how under-
sampling affects the posterior probability of a classification
algorithm.
III. THE IMPACT OF SAMPLING ON POSTERIOR
PROBABILITIES
In binary classification we typically learn a model on
training data and use it to generate predictions (class or
posterior probability) on a testing set with the assumption that
both come from the same distribution. When this assumption
does not hold, we encounter the so-called problem of sampling
selection bias [19]. Sampling selection bias can occur due to
a bad choice of the training set. For example, consider the
problem where a bank wants to predict whether someone who
is applying for a credit card will be able to repay the credit at
the end of the month. The bank has data available on customers
whose applications have been approved, but has no information
on rejected customers. This means that the data available to
the bank is a biased sample of the whole population. The bias
in this case is intrinsic to the dataset collected by the bank.
A. Sample Selection Bias due to undersampling
Rebalancing unbalanced data is just the sample selection
bias problem with a known selection bias introduced by design
(rather than by constraint or accident) [19]. In this section,
we investigate the sampling selection bias that occurs when
undersampling a skewed training set.
To begin, let us consider a binary classification task where
the goal is to learn a classifier f:Rn {0,1}, where XRn
is the input and Y {0,1}the output domain. Let us call class
0negative and class 1positive. Further, assume that the number
of positive observations is small compared to the number of
negatives, with rebalancing performed via undersampling.
Let us denote as (X,Y)the original unbalanced training
sample and as (X, Y )a balanced sample of (X,Y). This
means that (X, Y )(X,Y)and it contains a subset of
the negatives in (X,Y). Let us define sas a random binary
selection variable for each of the Nsamples in (X,Y), which
takes the value 1if the point is in (X, Y )and 0otherwise.
It is possible to derive the relationship between the posterior
probability of a model learnt on a balanced subset and the one
learnt on the original unbalanced dataset.
We assume that the selection variable sis independent
of the input xgiven the class y(class-dependent selection):
p(s|y, x) = p(s|y). This assumption implies p(x|y, s) =
p(x|y), i.e. by removing observation at random in the ma-
jority class we do not change within-class distributions. With
undersampling there is a change in the prior probabilities
(p(y|s= 1) 6=p(y)) and as a consequence the class-
conditional probabilities are different as well, p(y|x, s =
1) 6=p(y|x). The probability that a point (x, y)is included
in the balanced training sample is given by p(s= 1|y, x).
Let the sign +denote y= 1 and denote y= 0, e.g.
p(+, x) = p(y= 1, x)and p(, x) = p(y= 0, x). From
Bayes’ rule, using p(s|y, x) = p(s|y), we can write:
p(+|x, s = 1) = p(s= 1|+)p(+|x)
p(s= 1|+)p(+|x) + p(s= 1|−)p(−|x)
(1)
As shown in our previous work [9], since p(s= 1|+) = 1
we can write (1) as:
p(+|x, s = 1) = p(+|x)
p(+|x) + p(s= 1|−)p(−|x)(2)
Let us denote β=p(s= 1|−)as the probability of selecting
a negative instance with undersampling, p=p(+|x)as the
posterior probability of the positive class on the original
dataset, and ps=p(+|x, s = 1) as the posterior probability
after sampling. We can rewrite equation (2) as:
ps=p
p+β(1 p)(3)
Using (3) we can obtain an expression of pas a function of
ps:
p=βps
βpsps+ 1 (4)
Balancing an unbalanced problem corresponds to the case
when β=p(+)
p()N+
N, where N+and Ndenote the
number of positive and negative instances in the dataset. In
the following we will assume that N+
Nprovides an accurate
estimation of the ratio of the prior probabilities. For such level
of β, a small variation at the high values of psinduces a large
change in p, while the opposite occurs for small values of
ps[9]. When β= 1, all the negative instances are used for
training, while for β < 1, a subset of negative instances are
included in the training set. As βdecreases towards N+
N, the
resulting training set becomes more balanced. Note that N+
N
is the minimum value for β, as for β < N+
Nwe would have
more positives than negatives.
Let’s suppose we have an unbalanced problem where the
positives account for 10% of 10,000 observations (i.e., we
have 1,000 positives and 9,000 negatives). Suppose we want
to have a balanced dataset β=N+
N0.11, where 88.9%
(8000/9000) of the negative instances are discharged. Table I
shows how, by reducing β, the original unbalanced dataset
becomes more balanced and smaller as negative instances are
removed. After undersampling, the number of negatives is
N
s=βN , while the number of positives stays the same
N+
s=N+. The percentage of negatives (perc) in the dataset
decreases as N
sN+.
TABLE I. UN DER SA MPL ING A D ATASET W ITH 1,000 POSITIVES IN
10,000 OB SERVATIO NS.NsD EFI NES T HE S IZE O F TH E DATASET A FT ER
UNDERSAMPLING AND N
s(N+
s)THE NUMBER OF NEGATIVE (POSITIVE)
IN STAN CES F OR A GI VE N β. WHE N β= 0.11 THE N EG ATIV E SAM PLE S
REPRESENT 50% OF T HE OB SERVATIO NS I N THE DATAS ET.
NsN
sN+
sβ perc
2,000 1,000 1,000 0.11 50.00
2,800 1,800 1,000 0.20 64.29
3,700 2,700 1,000 0.30 72.97
4,600 3,600 1,000 0.40 78.26
5,500 4,500 1,000 0.50 81.82
6,400 5,400 1,000 0.60 84.38
7,300 6,300 1,000 0.70 86.30
8,200 7,200 1,000 0.80 87.80
9,100 8,100 1,000 0.90 89.01
10,000 9,000 1,000 1.00 90.00
B. Bias and class separability
In this section we are going to show how the impact of
bias depends on the separability nature of the classification
task. Let ω+and ωdenote the class conditional probabilities
p(x|+) and p(x|−), and π+(π+
s) the class priors before (after)
undersampling. It is possible to derive the relation between
the bias and the difference δ=ω+ωbetween the class
conditional distributions. From Bayes’ theorem we have:
p=ω+π+
ω+π++ωπ(5)
Suppose δ=ω+ω, we can write (5) as:
p=ω+π+
ω+π++ (ω+δ)π=ω+π+
ω+(π++π)δπ=ω+π+
ω+δπ
(6)
since π++π= 1. Similarly, since ω+does not change with
undersampling:
ps=ω+π+
s
ω+δπ
s
(7)
Now we can write pspas:
psp=ω+π+
s
ω+δπ
sω+π+
ω+δπ(8)
Since pspbecause of (3), 1ps0and 1p0we
have: 1psp0. In Figure 1 we plot pspas a function
of δwhen π+
s= 0.5and π+= 0.1. For small values of the
class conditional densities it appears that the bias takes the
highest values for δvalues close to zero. This means that the
bias is higher for similar class conditional probabilities (i.e.
low separable configurations).
Fig. 1. pspas a function of δ, where δ=ω+ωfor values of
ω+ {0.01,0.1}when π+
s= 0.5and π+= 0.1. Note that δis upper
bounded to guarantee 1ps0and 1p0.
C. Adjusting posterior probabilities to new priors
Equation (3) shows how the conditional distribution of the
balanced configuration relates to the conditional distribution in
the original unbalanced setting. However, after a classification
model is learnt on a balanced training set, it is normally used
to predict a testing set, which is likely to have an unbalanced
distribution similar to the original training set. This means that
the posterior probability of a model learnt on the balanced
training set should be adjusted for the change in priors between
the training and testing sets. In this paper we propose to use
equation (4) to correct the posterior probability estimates after
undersampling. Let us call p0the bias-corrected probability
obtained from psusing (4):
p0=βps
βpsps+ 1 (9)
Equation (9) can be seen as a special case of the framework
proposed by Saerens et al. [21] and Elkan [13] for correcting
the posterior probability in the case of testing and training sets
sharing the same priors (see Appendix). When we know the
priors in the testing set we can correct the probability with
Elkan’s and Saerens’ equations. However, these probabilities
are usually unknown and must be estimated. If we make the
assumption that training and testing have the same priors we
can used (9) for calibrating ps. Note that the above transforma-
tion will not affect the ranking produced by ps. Equation (9)
defines a monotone transformation, hence the ranking of ps
will be the same as p0. While pis estimates using all the
samples in the unbalanced dataset, psand p0are computed
considering a subset of the original samples and therefore their
estimations are subjected to higher variance [9]. The variance
effect is typically addressed by the use of averaging strategies
(e.g. UnderBagging [23]), but is not the focus of our paper.
D. Synthetic datasets
We now use two synthetic datasets to analysis the bias
introduced by undersampling and understand how it affects
the posterior probability. Given the simulated setting we are
able to control the true posterior probability pand measure
the sampling bias embedded in ps. We see that the bias is
larger when the two classes are overlapping and that stronger
undersampling induces a larger bias.
Let us consider two binary classification tasks, wherein
positive and negative observations are drawn randomly from
two distinct normal distributions. For both datasets we set
the number of positives to be 10% of 10,000 observations,
with ωN(0, σ)and ω+N(µ, σ), where µ > 0. The
distance between the two normal distributions, µ, is used to
control the degree of separation between the classes. When µ
is large, the two classes are well-separated, while for small
µthey strongly overlap. In the first dataset, we simulate a
classification problem with a very low degree of separation
(using µ= 3), in the second a task with well-separated classes
using µ= 15 (see Figure 2). The first simulates a difficult
classification task, the latter an easy one. For both dataset we
set σ= 3.
3
15
0
500
1000
1500
−10 0 10 20 −10 0 10 20
x
Count
class
0
1
Fig. 2. Synthetic datasets with positive and negative observations sampled
from two different normal distributions. Positives account for 10% of the
10,000 random values. On the left we have a difficult problem with overlapping
classes (µ= 3), on the right an easy problem where the classes are well-
separated (µ= 15).
Figure 3 shows how pschanges with β(pcorresponds to
β= 1). When βN+
Nthe probability shifts to the left,
allowing for higher probabilities on the right hand side of the
Fig. 3. Posterior probability as a function of β. On the left the task with
µ= 3 and on the right the one with µ= 15. Note that pcorresponds to
β= 1 and psto β < 1.
chart (where positive observations are located). In other words,
removing negative samples with undersampling increases the
positive posterior probability, moving the classification bound-
ary so that more samples are classified as positive. The stronger
the undersampling, the larger the shift, i.e. the drift of psfrom
p. The drift is larger in the dataset with non-separable classes
confirming the results of Section III-B.
Figure 4 displays ps,p0and pfor β=N+
Nin the dataset
with overlapping classes (µ= 3) and we see that p0closely
approximates p. As p0p, we can say that the above
transformation based on (9) is able to correct the probability
drift that occurs with undersampling. The correction seems
particularly effective on the left-hand side (where the majority
class is located), while is less precise on the right-hand side
where we expect to have larger variance on p0due to the small
number of positive samples.
0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10 15
x
Posterior probability
Probability
ps
p'
p
Fig. 4. Posterior probabilities ps,p0and pfor β=N+
Nin the dataset with
overlapping classes (µ= 3).
IV. CLASSIFICATION THRESHOLD WITH UNBIASED
PROBABILITIES
In the previous section we showed how undersampling
induces biased posterior probabilities and presented a method
to correct for this bias. We now want to investigate how to use
them for classification.
A. Threshold with Bayes Minimum Risk
Standard decision making process based on Bayes decision
theory developed in most textbooks on pattern recognition or
machine learning (see for example [24], [3], [12]) defines the
optimal class of a sample as the one minimizing the risk
(expected value of the loss function). In a binary classification
problem, the risk of the positive and negative class is defined
as follows:
r+= (1 p)l1,0+pl1,1
r= (1 p)l0,0+pl0,1
where p=p(+|x)and li,j is the loss (cost) incurred in
deciding iwhen the true class is j.
TABLE II. LO SS MATRI X
Actual Positive Actual Negative
Predicted Positive l1,1l1,0
Predicted Negative l0,1l0,0
Bayes decision rule for minimizing the risk can be stated as
follows: assign the positive class to samples with r+r, and
the negative otherwise. This is equivalent to predict a sample
as positive when p>τ and the threshold τis:
τ=l1,0l0,0
l1,0l0,0+l0,1l1,1
Typically the cost of a correct prediction is zero, hence l0,0= 0
and l1,1= 0. In an unbalanced problem, the cost of missing a
positive instance (false negative) is usually higher than the cost
of missing a negative (false positive). When the costs of a false
negative and false positive are unknown, a natural solution is
to set the costs using the priors. Let l1,0=π+and l0,1=π,
where π+=p(+) and π=p(). Then, since π> π+we
have l0,1> l1,0as desired. We can then write:
τ=l1,0
l1,0+l0,1
=π+
π++π=π+(10)
since π++π= 1. This is also the optimal threshold in a
cost-sensitive application where the costs are defined using the
priors [13].
B. Classification threshold adjustment
Even if undersampling produces biased probability esti-
mates, it is often used to balance datasets with skewed class
distributions because several classifiers have empirically shown
better performance when trained on balanced dataset [25], [14].
Let τsdenote the threshold used to classify an observation
after undersampling, form (10) we have τs=π+
s, where π+
s
is the positive class prior after undersampling. In the case of
undersampling with β=N+
N(balanced training set) we have
τs= 0.5.
When correcting pswith (9), we must also correct the
probability threshold to maintain the predictive accuracy de-
fined by τs(this is needed otherwise we would use different
misclassification costs for p0). Let τ0be the threshold for the
unbiased probability p0. From Elkan [13]:
τ0
1τ0
1τs
τs
=β(11)
τ0=βτs
(β1)τs+ 1 (12)
Using τs=π+
s, (12) becomes:
τ0=βπ+
s
(β1)π+
s+ 1
τ0=βN+
N++βN
(β1) N+
N++βN + 1 =N+
N++N=π+
The optimal threshold to use with p0is equal to the one for p.
As an alternative to classifying observations using pswith τs,
we can obtain equivalent results using p0with τ0. In summary,
as a result of undersampling, a higher number of observations
are predicted as positive, but the posterior probabilities are
biased due to a change in the priors. Equation (12) allows
us find the threshold that guarantees equal accuracy after the
posterior probability correction. Therefore, in order to classify
observations with unbiased probabilities after undersampling,
we have to first obtain p0from pswith (9) and then use τ0as
a classification threshold.
V. ME AS UR ES O F CL AS SI FICATIO N ACC UR ACY AND
PROBABILITY CALIBRATION
The choice of balancing the training set or leaving it
unbalanced has a direct influence on the classification model
that is learnt. A model learnt on a balanced training set has the
two classes equally represented. In the case of an unbalanced
training set, the model learns from a dataset skewed towards
one class. Hence, the classification model learnt after under-
sampling is different from the one learnt on the original dataset.
In this section we compare the probability estimates of two
models, one learnt in the presence and the other in the absence
of undersampling. The probabilities are evaluated in terms of
ranking produced, classification accuracy and calibration.
To asses the impact of undersampling, we first use accuracy
measures based on the confusion matrix (Table III).
TABLE III. CONFUSION MATRIX
Actual Positive Actual Negative
Predicted Positive TP FP
Predicted Negative FN TN
In an unbalanced class problem, it is well-known that
quantities like TPR ( T P
T P +F N ), TNR ( T N
F P +T N ) and average
accuracy ( T P +T N
T P +F N+F P+T N ) are misleading assessment mea-
sures [10]. Let us define Precision = T P
T P +F P and Recall
=T P
T P +F N . Typically we want to have high confidence that
observations predicted as positive are actually positive (high
Precision) as well as a high detection rate of the positives
(high Recall). However, Precision and Recall share an inverse
relationship, whereby high Precision comes at the cost of
low Recall and vice versa. An accuracy measure based on
both Precision and Recall is the F-measure, also known as
F1-score or F-score. F-measure (2P recision×Recall
P recision+Recall ) and G-
mean (TPR×T N R) are often considered to be useful and
effective performance measures for unbalanced datasets.
An alternative way to measure the quality of a probability
estimate is to look at the ranking produced by the probability.
A good probability estimate should rank first all the minority
class observations and then those from the majority class.
In other words, if ˆpis a good estimate of p(+|x), then ˆp
should give high probability to the positive examples and
small probability to the negatives. A well-accepted ranking
measure for unbalanced dataset is AUC (Area Under the ROC
curve) [5]. To avoid the problem of different misclassification
costs, we use an estimation of AUC based on the Mann-
Whitney statistic [10]. This estimate measures the probability
that a random minority class example ranks higher than a
random majority class example [15].
In order to measure the probability calibration, we used the
Brier Score (BS) [4]. BS is a measure of average squared loss
between the estimated probabilities and the actual class value.
It allows to evaluate how well the probabilities are calibrated,
the lower the BS the more accurate are the probabilistic
predictions of a model. Let ˆp(yi|xi)be the probability estimate
of sample xito have class yi {1,0}, BS is defined as:
BS =1
N
N
X
i=1 {yiˆp(yi|xi)}2(13)
VI. EX PE RI ME NTAL R ES ULT S
In the previous sections we used synthetic datasets to
study the effect of undersampling. We now consider real-world
unbalanced datasets from the UCI repository used in [9]. For
each dataset we adopt a 10-fold cross validation (CV) to test
our models and we repeated the CV 10 times. In particular,
we used a stratified CV, where the class proportion in the
datasets is kept the same over all the folds. As the original
datasets are unbalanced, the resulting folds are unbalanced as
well. For each fold of CV we learn two models: one using
all the observations and the other with the ones remaining
after undersampling. Then both models are tested on the
same testing set. We used several supervised classification
algorithms available in R [20]: Random Forest [18], SVM [17],
and Logit Boost [22].
We denote as ˆpsand ˆpthe posterior probability estimates
obtained with and without undersampling and as ˆp0the bias-
corrected probability obtained from ˆpswith equation (9). Let
τ,τsand τ0be the probability thresholds used for ˆp,ˆpsand ˆp0
respectively, where τ=π+,τs=π+
sand τ0=π+. The
goal of these experiments is to compare which probability
estimates return the highest ranking (AUC), calibration (BS)
and classification accuracy (G-mean) when coupled with the
thresholds defined before. In undersampling, the amount of
sampling defined by βis usually set to be equal to N+
N, leading
to a balanced dataset where π+
s=π
s= 0.5. However, there
is no reason to believe that this is the optimal sampling rate.
Often, the optimal rate can be found only a posteriori after
trying different values of β[9]. For this reason we replicate
the CV with different βsuch that {N+
Nβ1}and for
each CV the accuracy is computed as the average G-mean (or
AUC) over all the folds.
In table V we report the results over all the datasets. For
each dataset, we rank the probability estimates ˆps,ˆpand ˆp0
from the worst to the best performing for different values of
β. We then sum the ranks over all the values of βand over all
datasets. More formally, let Ri,k,b {1,2,3}be the rank of
probability ion dataset kwhen β=b. The probability with
the highest accuracy in kwhen β=bhas Ri,k,b = 3 and the
TABLE IV. DATASET S FRO M THE UCI R EP OSI TORY U SE D IN [9].
Datasets N N+NN+/N
ecoli 336 35 301 0.10
glass 214 17 197 0.08
letter-a 20000 789 19211 0.04
letter-vowel 20000 3878 16122 0.19
ism 11180 260 10920 0.02
letter 20000 789 19211 0.04
oil 937 41 896 0.04
page 5473 560 4913 0.10
pendigits 10992 1142 9850 0.10
PhosS 11411 613 10798 0.05
satimage 6430 625 5805 0.10
segment 2310 330 1980 0.14
boundary 3505 123 3382 0.04
estate 5322 636 4686 0.12
cam 18916 942 17974 0.05
compustat 13657 520 13137 0.04
covtype 38500 2747 35753 0.07
one with the lowest has Ri,k,b = 1. Then the sum of ranks for
the probability iis defined as PkPbRi,k,b. The higher the
sum, the higher the number of times that one probability has
higher accuracy than the others.
For AUC, a higher rank sum means a higher AUC and
hence a better ranking returned by the probability. Similarly,
with G-mean, a higher rank sum corresponds to higher predic-
tive accuracy. However, in the case of BS, a higher rank sum
means poorer probability calibration (larger bias). Table V has
in bold the probabilities with the best rank sum according to
the different metrics. For each metric and classifier it reports
the p-values of the paired t-test based on the ranks between ˆp
and ˆp0and between ˆpand ˆps.
In terms of AUC, we see that ˆpsand ˆp0have better
performances than ˆpfor LB and SVM. The rank sum is the
same for ˆpsand ˆp0since the two probabilities are linked by a
monotone transformation (equation (9)). If we look at G-mean,
ˆpsand ˆp0return better accuracy than ˆptwo times out of three.
In this case, the rank sums of ˆpsand ˆp0are the same since
we used τsand τ0as the classification threshold, where τ0is
obtained from τsusing (12). If we look at the p-values, we
can strongly reject the null hypothesis that the accuracy of ˆps
and ˆpare from the same distribution. For all classifiers, ˆpis
the probability estimate with the best calibration (lower rank
sum with BS), followed by ˆp0and ˆps. The rank sum of ˆp0is
always lower than the one of ˆps, indicating that ˆp0has lower
bias than ˆps. This result confirms our theory that equation (9)
allows one to reduce the bias introduced by undersampling.
In summary from this experiment we can conclude that
undersampling does not always improve the ranking or classi-
fication accuracy of an algorithm, but when it is the case we
should use ˆp0instead of ˆpsbecause the first has always better
calibration.
We now consider a real-world dataset, composed of credit
card transactions from September 2013 made available by our
industrial partner. 1It contains a subset of online transactions
that occurred in two days, where we have 492 frauds out
of 284,807 transactions. The dataset is highly unbalanced,
where the positive class (frauds) account for 0.172% of all
transactions, and the minimum value of βis 0.00173.
In Figure 5 we have the AUC for different values of β.
1The dataset is available at http://www.ulb.ac.be/di/map/adalpozz/data/
creditcard.Rdata
TABLE V. S UM O F RAN KS A ND P-VAL UE S OF TH E PAIR ED T-TE ST
BETWEEN THE RANKS OF ˆpA ND ˆp0A ND BE TW EEN ˆpAND ˆpsFO R
DIFFERENT METRICS. INBO LD THE PROBABILITIES WITH THE BEST RANK
SU M (HIGHER FOR AUC AND G- MEA N,L OWE R FOR BS).
Metric Algo PRˆpPRˆpsPRˆp0ρ(Rˆp, R ˆps)ρ(Rˆp, Rˆp0)
AUC LB 22,516 23,572 23,572 0.322 0.322
AUC RF 24,422 22,619 22,619 0.168 0.168
AUC SVM 19,595 19,902.5 19,902.5 0.873 0.873
G-mean LB 23,281 23,189.5 23,189.5 0.944 0.944
G-mean RF 22,986 23,337 23,337 0.770 0.770
G-mean SVM 19,550 19,925 19,925 0.794 0.794
BS LB 19809.5 29448.5 20402 0.000 0.510
BS RF 18336 28747 22577 0.000 0.062
BS SVM 17139 23161 19100 0.001 0.156
The boxplots of ˆpsand ˆp0are identical because of (9), they
increase with βN+
Nand have higher median than the
one of ˆp. This example shows how in case of extreme class
imbalance, undersampling can improve predictive accuracy of
several classification algorithms.
LB
RF
SVM
0.900
0.925
0.950
0.975
1.000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
beta
AUC
Probability
p
p'
ps
Credit−card
Fig. 5. Boxplot of AUC for different values of βin the Credit-card dataset.
LB
RF
SVM
3e−04
6e−04
9e−04
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.91
beta
BS
Probability
p
p'
ps
Credit−card
Fig. 6. Boxplot of BS for different values of βin the Credit-card dataset.
In Figure 6 we have the BS for different values of β. The
boxplots of ˆp0show in general smaller calibration error (lower
BS) than those of ˆpsand the latter have higher BS especially
for small values of β. This supports our previous results, which
found that the loss in probability calibration for ˆpsis greater
the stronger the undersampling.
VII. CONCLUSION
In this paper, we study the bias introduced in the posterior
probabilities that occurs as an artifact of undersampling. We
use several synthetic datasets to analyze this problem from a
theoretical perspective, and then ground our findings with an
empirical evaluation over several real-world datasets.
The first result of the paper is that the bias due to the
instance selection procedure in undersampling is essentially
equivalent to the bias the occurs with a change in the priors
when class-within distributions remain stable. With undersam-
pling, we create a different training set, where the classes are
less unbalanced. However, if we make the assumption that the
training and testing sets come from the same distribution, it
follows that the probability estimates obtained after undersam-
pling are biased. As a result of undersampling, the posterior
probability ˆpsis shifted away from the true distribution, and
the optimal separation boundary moves towards the majority
class so that more cases are classified into the minority class.
By making the assumptions that prior probabilities do not
change from training and testing, i.e. they both come form the
same data generating process, we propose the transformation
given in (9), which allows us to remove the drift in ˆpsdue to
undersampling. The bias on ˆpsregistered by BS gets larger for
small values of β, which means stronger undersampling pro-
duces probabilities with poorer calibration (larger loss). With
synthetic, UCI and Credit-card datasets, the drift-corrected
probability (ˆp0) has significantly better calibration than ˆps
(lower Brier Score).
Even if undersampling produces poorly calibrated proba-
bility estimates ˆps, several studies have shown that it often
provides better predictive accuracy than ˆp[25], [14]. To
improve the calibration of ˆpswe propose to use ˆp0since
this transformation does not affect the ranking. In order to
maintain the accuracy obtained with ˆpsand the probability
threshold τs, we proposed to use ˆp0together with τ0to account
for the change in priors. By changing the undersampling
rate βwe give different costs to false positives and false
negatives, combining ˆp0with τ0allows one to maintain the
same misclassification costs of a classification strategy with
ˆpuand τufor any value of β.
Finally, we considered a highly unbalanced dataset (Credit-
card), where the minority class accounts for only 0.172% of all
observations. In this dataset, the large improvement in accuracy
obtained with undersampling was coupled with poor calibrated
probabilities (large BS). By correcting the posterior probability
and changing the threshold we were able to improve calibration
without losing predictive accuracy. Obtaining well-calibrated
classifiers is particularly important in decision systems based
on fraud detection. This is one of the rare papers making
available the fraud detection dataset used for testing.
ACK NOW LE DG ME NT S
A. Dal Pozzolo is supported by the Doctiris scholarship
funded by Innoviris, Brussels, Belgium. G. Bontempi is sup-
ported by the BridgeIRIS and BruFence projects funded by
Innoviris, Brussels, Belgium.
APPENDIX
Let pt=p(yt= +|xt)be the posterior probability for
a testing instance (xt, yt), where the testing set has priors:
π
t=N
t
Ntand π+
t=N+
t
Nt. In the unbalanced training set we
have π=N
N,π+=N+
Nand p=p(+|x). After undersam-
pling the training set π
s=βN
N++βN ,π+
s=N+
N++βN and
ps=p(+|x, s = 1). If we assume that the class conditional
distributions p(x|+) and p(x|−)remain the same between
the training and testing sets, Saerens et al. [21] show that,
given different priors between the training and testing sets,
the posterior probability can be corrected with the following
equation:
pt=
π+
t
π+
s
ps
π+
t
π+
s
ps+π
t
π
s
(1 ps)
(14)
Let us assume that the training and testing sets share the same
priors: π+
t=π+and π
t=π:
pt=
π+
π+
s
ps
π+
π+
s
ps+π
π
s
(1 ps)
Then, since
π+
π+
s
=
N+
N++N
N+
N++βN
=N++βN
N++N(15)
π
π
s
=
N
N++N
βN
N++βN
=N++βN
β(N++N)(16)
we can write
pt=
N++βN
N++Nps
N++βN
N++Nps+N++βN
β(N++N)(1 ps)=βps
βpsps+ 1
The transformation proposed by Saerens et al. [21] is equiv-
alent to equation (4) and the one developed independently by
Elkan [13] for cost-sensitive learning:
pt=π+
t
psπ+
sps
π+
sπ+
sps+π+
tpsπ+
tπ+
s
(17)
pt=(1 π+
s)ps
π+
s
π+
t
(1 ps) + psπ+
s
using (15), π+
t=π+and π
t=π:
pt=
βN
N++βN ps
N++N
N++βN (1 ps) + psN+
N++βN
=βps
βpsps+ 1
REFERENCES
[1] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying
support vector machines to imbalanced datasets. In Machine Learning:
ECML 2004, pages 39–50. Springer, 2004.
[2] Urvesh Bhowan, Michael Johnston, Mengjie Zhang, and Xin Yao.
Evolving diverse ensembles using genetic programming for classifica-
tion with unbalanced data. Evolutionary Computation, IEEE Transac-
tions on, 17(3):368–386, 2013.
[3] Christopher M Bishop et al. Pattern recognition and machine learning,
volume 4. springer New York, 2006.
[4] Glenn W Brier. Verification of forecasts expressed in terms of
probability. Monthly weather review, 78(1):1–3, 1950.
[5] Nitesh V Chawla. Data mining for imbalanced datasets: An overview.
In Data mining and knowledge discovery handbook, pages 853–867.
Springer, 2005.
[6] Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Editorial:
special issue on learning from imbalanced data sets. ACM SIGKDD
Explorations Newsletter, 6(1):1–6, 2004.
[7] NV Chawla, KW Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.
Smote: synthetic minority over-sampling technique. Journal of Artificial
Intelligence Research (JAIR), 16:321–357, 2002.
[8] Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi,
and Gianluca Bontempi. Credit card fraud detection and concept-drift
adaptation with delayed supervised information. In Neural Networks
(IJCNN), 2015 International Joint Conference on. IEEE, 2015.
[9] Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi. When is
undersampling effective in unbalanced classification tasks? In Machine
Learning and Knowledge Discovery in Databases. Springer, 2015.
[10] Andrea Dal Pozzolo, Olivier Caelen, Yann-Ael Le Borgne, Serge
Waterschoot, and Gianluca Bontempi. Learned lessons in credit card
fraud detection from a practitioner perspective. Expert Systems with
Applications, 41(10):4915–4928, 2014.
[11] C. Drummond and R.C. Holte. C4.5, class imbalance, and cost
sensitivity: why under-sampling beats over-sampling. In Workshop on
Learning from Imbalanced Datasets II. Citeseer, 2003.
[12] Richard O Duda, Peter E Hart, and David G Stork. Pattern classifica-
tion. John Wiley & Sons, 2012.
[13] C. Elkan. The foundations of cost-sensitive learning. In International
Joint Conference on Artificial Intelligence, volume 17, pages 973–978.
Citeseer, 2001.
[14] Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. A multiple
resampling method for learning from imbalanced data sets. Computa-
tional Intelligence, 20(1):18–36, 2004.
[15] David J Hand and Robert J Till. A simple generalisation of the area
under the roc curve for multiple class classification problems. Machine
Learning, 45(2):171–186, 2001.
[16] N. Japkowicz and S. Stephen. The class imbalance problem: A
systematic study. Intelligent data analysis, 6(5):429–449, 2002.
[17] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis.
kernlab-an s4 package for kernel methods in r. 2004.
[18] Andy Liaw and Matthew Wiener. Classification and regression by
randomforest. R News, 2(3):18–22, 2002.
[19] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer,
and Neil D Lawrence. Dataset shift in machine learning. The MIT
Press, 2009.
[20] R Development Core Team. R: A Language and Environment for
Statistical Computing. R Foundation for Statistical Computing, Vienna,
Austria, 2011. ISBN 3-900051-07-0.
[21] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting
the outputs of a classifier to new a priori probabilities: a simple
procedure. Neural computation, 14(1):21–41, 2002.
[22] Jarek Tuszynski. caTools: Tools: moving window statistics, GIF, Base64,
ROC AUC, etc., 2013. R package version 1.16.
[23] Shuo Wang, Ke Tang, and Xin Yao. Diversity exploration and negative
correlation learning on imbalanced data sets. In Neural Networks,
2009. IJCNN 2009. International Joint Conference on, pages 3259–
3266. IEEE, 2009.
[24] Andrew R Webb. Statistical pattern recognition. John Wiley & Sons,
2003.
[25] Gary M Weiss and Foster Provost. The effect of class distribution on
classifier learning: an empirical study. Rutgers Univ, 2001.
... Research Status: Bellinger et al. [88] proposed a confidence calibration method under class imbalance data, named ReMix, which leverages min-batch resampling, instance mixing, and soft-labels to improve confidence by enhancing the robustness of deep learning models. Pozzolo et al. [89] found that undersampling can lead to new biases on the posterior probabilities in binary classification tasks under class imbalance data, and the reasons are as follows. Lets x = 1 represent that the sample x in the training set is sampled to the new balanced set, ands x = 0 represent that the sample x in the training set is not sampled to the new balanced set. ...
... (24) It can be seen from Eq. 24 that by estimating the correct β value, the posterior probability bias caused by undersampling can be readjust. Pozzolo et al. [89] proposed β = N + N − , where N + represents the positive samples size, N − represents the negative samples size. Guilbert et al. [90] found that the method in [89] can readjust the posterior probability bias caused by undersampling but does not correct miscalibration due to the inherent properties of the model. ...
... Pozzolo et al. [89] proposed β = N + N − , where N + represents the positive samples size, N − represents the negative samples size. Guilbert et al. [90] found that the method in [89] can readjust the posterior probability bias caused by undersampling but does not correct miscalibration due to the inherent properties of the model. Therefore, they combine the undersampling bias calibration and Platt [29] calibration to make calibration more accurate. ...
Preprint
Full-text available
Confidence calibration in classification models, a technique to achieve accurate posterior probability estimation for classification results, is crucial for assessing the likelihood of correct decisions in real-world applications. Class imbalance data, which biases the learning of the model and subsequently skews the posterior probabilities of the model, makes confidence calibration more challenging. Especially for often more important minority classes with high uncertainty, confidence calibration is more complex and necessary. Unlike previous surveys that typically separately investigate confidence calibration or class imbalance, this paper comprehensively investigates confidence calibration methods for deep learning-based classification models under class imbalance. Firstly, the problem of confidence calibration under class imbalance data is outlined. Secondly, a novel exploratory analysis regarding the impact of class imbalance data on confidence calibration is carried out, which can explain some experimental findings in existing studies. Then, this paper conducts a comprehensive review of 57 state-of-the-art confidence calibration methods under class imbalance data, divides these methods into six groups according to method differences, and systematically compares seven properties to evaluate their superiority. Subsequently, some commonly used and emerging evaluation methods in this field are summarized. Finally, we discuss several promising research directions that may serve as a guideline for future studies.
... We use four of the original six datasets from [34] as the other two are not publicly available -specifically, Kaggle Credit [53], Kaggle Cervical Cancer [23], UCI ISOLET [16], UCI Epileptic Seizure [5]. We summarize the main characteristics of these datasets in Table 5. ...
Preprint
Synthetic data created by differentially private (DP) generative models is increasingly used in real-world settings. In this context, PATE-GAN has emerged as a popular algorithm, combining Generative Adversarial Networks (GANs) with the private training approach of PATE (Private Aggregation of Teacher Ensembles). In this paper, we analyze and benchmark six open-source PATE-GAN implementations, including three by (a subset of) the original authors. First, we shed light on architecture deviations and empirically demonstrate that none replicate the utility performance reported in the original paper. Then, we present an in-depth privacy evaluation, including DP auditing, showing that all implementations leak more privacy than intended and uncovering 17 privacy violations and 5 other bugs. Our codebase is available from https://github.com/spalabucr/pategan-audit.
... In addition, its variables are the result of PCA transformation, making it perfect for our purpose. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of Université Libre de Bruxelles on big data mining and fraud detection [7]. ...
Article
Full-text available
In the last decade, research and corporate have shown a dramatically growing interest in the field of machine learning, mostly due to the performances of deep neural networks. These increasingly complex architectures solved a wide range of problems. However, training these sophisticated architectures require many computation on advanced hardware. With this paper, we introduce a new approach based on the One-Step procedure that may fasten their training. In this procedure, an initial guess estimator is computed on a subsample that is then improved with only one step of the Newton gradient descent on the whole dataset. To show the efficiency of this framework, we consider regression and classification tasks using simulated and real datasets. We consider classic architectures, namely multi-layer perceptrons and show, on our examples, that the One-Step procedure is often halving the computation time to train the neural networks while preserving the performances.
... To evaluate the performance of HS-CGK in processing imbalanced datasets, we selected four credit card imbalance datasets from UCI and a real credit card dataset used in [37]. These datasets have varying numbers of features and imbalance rates, and their detailed descriptions are provided in Table 1. ...
Article
Class imbalance problem in datasets can lead to biased classification decisions in favor of majority class samples. Additionally, class overlap can cause fuzzy classification boundaries, affecting the performance of classification algorithms. To address these issues, we propose a hybrid sampling method based on conditional tabular generative adversarial network (CTGAN) and K-nearest neighbor (KNN) algorithm. Firstly, we introduce an oversampling algorithm, named DB-CTGAN, based on CTGAN. This algorithm filters noisy and boundary samples using the density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm and generates synthetic samples that conform to the real data distribution using CTGAN. Finally, we combine the expanded fraudulent samples generated by DB-CTGAN with the normal samples and use the KNN overlap undersampling algorithm to remove the samples in the overlap region, solving the class overlap problem. Experimental results show that compared with eight sampling methods using four standard classification models (Random Forest, Decision Tree, Support Vector Classification, and XGBoost), the proposed method significantly improves the F1, AUC, and G-mean metrics on five real datasets.
... In order to efficiently train the ANN to classify whether a sheet metal forming process will be successful or not for a given set of process parameters, it is crucial to assure the quality of the training data. Ideally, the database should be balanced, with roughly the same number of instances in the different classes, i.e. successful and unsuccessful forming operations [2]. Consequently, it is advantageous to get a sense of the mutual dependencies between important process parameters when generating the training data. ...
Conference Paper
Full-text available
The sheet metal forming process of a floating photovoltaic (FPV) structure is simulated in LS-DYNA. An anisotropic yield criterion and a two-term Voce hardening law are used to model the plastic behavior of AA5083-H111 sheets. The numerical model incorporates thickness variations to trigger local necking and uses a critical thickness strain as a fracture criterion. To establish a methodology that can be expanded for further studies, the research explores the relationship between cup depth and drawbead distance by proposing an algorithm to distinguish between successful and unsuccessful sheet metal forming operations.
Article
The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing workflow followed by prediction with an ML model. Current approaches for in-database inference implement preprocessing operators and ML algorithms in the database either natively, by transpiling code to SQL, or by executing user-defined functions in guest languages such as Python. In this work, we present a radically different approach that approximates an end-to-end inference pipeline (preprocessing plus prediction) using a light-weight embedding that discretizes a carefully selected subset of the input features and an index that maps data points in the embedding space to aggregated predictions of an ML model. We replace a complex preprocessing workflow and model-based inference with a simple feature transformation and an index lookup. Our framework improves inference latency by several orders of magnitude while maintaining similar prediction accuracy compared to the pipeline it approximates.
Article
Online transactions are still the backbone of the financial industry worldwide today. Millions of consumers use credit cards for their daily transactions, which has led to an exponential rise in credit card fraud. Over time, many variations and schemes of fraudulent transactions have been reported. Nevertheless, it remains a difficult task to detect credit card fraud in real-time. It can be assumed that each person has a unique transaction pattern that may change over time. The work in this article aims to (1) understand how deep reinforcement learning can play an important role in detecting credit card fraud with changing human patterns, and (2) develop a solution architecture for real-time fraud detection. Our proposed model utilizes the Deep Q network for real-time detection. The Kaggle dataset available online was used to train and test the model. As a result, a validation performance of 97.10% was achieved with the proposed deep learning component. In addition, the reinforcement learning component has a learning rate of 80%. The proposed model was able to learn patterns autonomously based on previous events. It adapts to the pattern changes over time and can take them into account without further manual training.
Article
Full-text available
The digitization of financial systems has brought unprecedented convenience, but it has also increased fraud. This article explores the important intersection of cybersecurity and fraud detection in financial transactions. As the need to effectively combat fraud increases, he explores a variety of cybersecurity approaches and technologies. This article examines advanced technologies such as data mining, machine learning, biometric authentication, and blockchain through a comprehensive review of existing literature. It also highlights the challenges and limitations faced by modern fraud detection methodologies, including sophisticated cyberattacks and regulatory issues. By recognizing these challenges, stakeholders can work to implement holistic solutions that address both technical and regulatory aspects. Ultimately, the purpose of this document is to provide practical guidance for strengthening fraud detection capabilities, strengthening financial systems, and protecting consumer interests in the digital economy.
Conference Paper
Full-text available
Most fraud-detection systems (FDSs) monitor streams of credit card transactions by means of classifiers returning alerts for the riskiest payments. Fraud detection is notably a challenging problem because of concept drift (i.e. customers' habits evolve) and class unbalance (i.e. genuine transactions far outnumber frauds). Also, FDSs differ from conventional classification because, in a first phase, only a small set of supervised samples is provided by human investigators who have time to assess only a reduced number of alerts. Labels of the vast majority of transactions are made available only several days later, when customers have possibly reported unauthorized transactions. The delay in obtaining accurate labels and the interaction between alerts and supervised information have to be carefully taken into consideration when learning in a concept-drifting environment. In this paper we address a realistic fraud-detection setting and we show that investigator's feedbacks and delayed labels have to be handled separately. We design two FDSs on the basis of an ensemble and a sliding-window approach and we show that the winning strategy consists in training two separate classifiers (on feedbacks and delayed labels, respectively), and then aggregating the outcomes. Experiments on large dataset of real-world transactions show that the alert precision, which is the primary concern of investigators, can be substantially improved by the proposed approach.
Conference Paper
Full-text available
A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis exists about the impact of undersampling on the accuracy of the final classifier. This paper aims to fill this gap by proposing an integrated analysis of the two elements which have the largest impact on the effectiveness of an undersampling strategy: the increase of the variance due to the reduction of the number of samples and the warping of the posterior distribution due to the change of priori probabilities. In particular we will propose a theoretical analysis specifying under which conditions undersampling is recommended and expected to be effective. It emerges that the impact of undersam-pling depends on the number of samples, the variance of the classifier, the degree of imbalance and more specifically on the value of the posterior probability. This makes difficult to predict the average effectiveness of an undersampling strategy since its benefits depend on the distribution of the testing points. Results from several synthetic and real-world unbalanced datasets support and validate our findings.
Article
Full-text available
Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to non-stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about which is the best strategy to deal with them. In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit card dataset provided by our industrial partner.
Article
Full-text available
This paper takes a new look at two sampling schemes commonly used to adapt machine al- gorithms to imbalanced classes and misclas- sication costs. It uses a performance anal- ysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becom- ing the community standard when evaluat- ing new cost sensitive learning algorithms. This paper shows that using C4.5 with under- sampling establishes a reasonable standard for algorithmic comparison. But it is recom- mended that the least cost classier be part of that standard as it can be better than under- sampling for relatively modest costs. Over- sampling, however, shows little sensitivity, there is often little dierence in performance when misclassication costs are changed.
Chapter
Statistical Pattern RecognitionStages in a Pattern Recognition ProblemIssuesSupervised Versus UnsupervisedApproaches to Statistical Pattern RecognitionMultiple RegressionOutline of BookNotes and ReferencesExercises
Article
Statistical pattern recognition is a term used to cover all stages of an investigation from problem formulation and data collection through to discrimination and classification, assessment of results and interpretation. This chapter introduces some of the basic concepts in classification and describes the key issues. It presents two complementary approaches to discrimination, namely a decision theory approach based on calculation of probability density functions and the use of Bayes theorem, and a discriminant function approach. Many different forms of discriminant function have been considered in the literature, varying in complexity from the linear discriminant function to multiparameter nonlinear functions such as the multilayer perceptron. Regression is an important part of statistical pattern recognition. Regression analysis is concerned with predicting the mean value of the response variable given measurements on the predictor variables and assumes a model of the form. Bayes' theorem; regression analysis; statistical process control
Article
In classification, machine learning algorithms can suffer a performance bias when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class), while the other class(es) make up the majority. In this scenario, classifiers can have good accuracy on the majority class, but very poor accuracy on the minority class(es). This paper proposes a multiobjective genetic programming (MOGP) approach to evolving accurate and diverse ensembles of genetic program classifiers with good performance on both the minority and majority of classes. The evolved ensembles comprise of nondominated solutions in the population where individual members vote on class membership. This paper evaluates the effectiveness of two popular Pareto-based fitness strategies in the MOGP algorithm (SPEA2 and NSGAII), and investigates techniques to encourage diversity between solutions in the evolved ensembles. Experimental results on six (binary) class imbalance problems show that the evolved ensembles outperform their individual members, as well as single-predictor methods such as canonical GP, naive Bayes, and support vector machines, on highly unbalanced tasks. This highlights the importance of developing an effective fitness evaluation strategy in the underlying MOGP algorithm to evolve good ensemble members.
Article
This article has no abstract