Conference PaperPDF Available

Calibrating Probability with Undersampling for Unbalanced Classification

December 2015

December 2015

DOI:10.1109/SSCI.2015.33

Conference: 2015 IEEE Symposium Series on Computational Intelligence (SSCI)
At: Cape Town, South Africa

Authors:

Andrea Dal Pozzolo

Université Libre de Bruxelles

Olivier Caelen

Worldline

Reid A Johnson

University of Notre Dame

Gianluca Bontempi

Université Libre de Bruxelles

Undersampling is a popular technique for unbalanced datasets to reduce the skew in class distributions. However, it is well-known that undersampling one class modifies the priors of the training set and consequently biases the posterior probabilities of a classifier [9]. In this paper, we study analytically and experimentally how undersampling affects the posterior probability of a machine learning model. We formalize the problem of undersampling and explore the relationship between conditional probability in the presence and absence of undersampling. Although the bias due to undersampling does not affect the ranking order returned by the posterior probability, it significantly impacts the classification accuracy and probability calibration. We use Bayes Minimum Risk theory to find the correct classification threshold and show how to adjust it after undersampling. Experiments on several real-world unbalanced datasets validate our results.

ps − p as a function of δ, where δ = ω + − ω − for values of ω + ∈ {0.01, 0.1} when π + s = 0.5 and π + = 0.1. Note that δ is upper bounded to guarantee 1 ≥ ps ≥ 0 and 1 ≥ p ≥ 0.

…

Posterior probability as a function of β. On the left the task with µ = 3 and on the right the one with µ = 15. Note that p corresponds to β = 1 and ps to β < 1.

…

Posterior probabilities ps, p and p for β = N + N − in the dataset with overlapping classes (µ = 3).

…

displays p s , p and p for β = N + N − in the dataset with overlapping classes (µ = 3) and we see that p closely approximates p. As p ≈ p, we can say that the above transformation based on (9) is able to correct the probability drift that occurs with undersampling. The correction seems particularly effective on the left-hand side (where the majority class is located), while is less precise on the right-hand side where we expect to have larger variance on p due to the small number of positive samples.

…

Boxplot of AUC for different values of β in the Credit-card dataset.

…

Figures - uploaded by Andrea Dal Pozzolo

Content may be subject to copyright.

Content uploaded by Andrea Dal Pozzolo

Content may be subject to copyright.

Calibrating Probability with Undersampling

for Unbalanced Classiﬁcation

Andrea Dal Pozzolo∗, Olivier Caelen†, Reid A. Johnson‡, Gianluca Bontempi∗§

∗Machine Learning Group, Computer Science Department, Universit´

e Libre de Bruxelles, Brussels, Belgium.

Email: adalpozz@ulb.ac.be

†Fraud Risk Management Analytics, Worldline S.A., Brussels, Belgium.

Email: olivier.caelen@worldline.com

‡iCeNSA, Computer Science and Engineering Department, University of Notre Dame, Notre Dame IN, USA.

Email: rjohns15@nd.edu

§Interuniversity Institute of Bioinformatics in Brussels (IB)2, Brussels, Belgium.

Email: gbonte@ulb.ac.be

Abstract—Undersampling is a popular technique for unbal-

anced datasets to reduce the skew in class distributions. However,

it is well-known that undersampling one class modiﬁes the

priors of the training set and consequently biases the poste-

rior probabilities of a classiﬁer [9]. In this paper, we study

analytically and experimentally how undersampling affects the

posterior probability of a machine learning model. We formalize

the problem of undersampling and explore the relationship

between conditional probability in the presence and absence of

undersampling. Although the bias due to undersampling does not

affect the ranking order returned by the posterior probability, it

signiﬁcantly impacts the classiﬁcation accuracy and probability

calibration. We use Bayes Minimum Risk theory to ﬁnd the

correct classiﬁcation threshold and show how to adjust it after

undersampling. Experiments on several real-world unbalanced

datasets validate our results.

I. INTRODUCTION

In several binary classiﬁcation problems, the two classes

are not equally represented in the dataset. For example, in fraud

detection, fraudulent transactions are normally outnumbered

by genuine ones [8]. When one class is underrepresented in a

dataset, the data is said to be unbalanced. In such problems,

typically, the minority class is the class of interest. Having

few instances of one class means that the learning algorithm

is often unable to generalize the behavior of the minority

class well, hence the algorithm performs poorly in terms of

predictive accuracy [16].

A common strategy for dealing with unbalanced classi-

ﬁcation tasks is to under-sample the majority class in the

training set before learning a classiﬁer [1]. The assumption

behind this strategy is that in the majority class there are

many redundant observations and randomly removing some

of them does not change the estimation of the within-class

distribution. If we make the assumption that training and

testing sets come from the same distribution, then when the

training is unbalanced, the testing set has a skewed distribution

as well. By removing majority class instances, the training set

is artiﬁcially rebalanced. As a consequence, we obtain different

distributions for the training and testing sets, violating the basic

assumption in machine learning that the training and testing

sets are drawn from the same underlying distribution.

In this paper, we study the impact of the bias introduced

by undersampling on classiﬁcation tasks with unbalanced data.

We start by discussing literature results showing how the

posterior probability of an algorithm learnt in the presence

of undersampling is related to the conditional probability of

the original distribution. Using synthetic data we see that the

larger the overlap between the two within-class distributions

(i.e. the greater the non-separability of the classiﬁcation task),

the larger the bias in the posterior probability. The mismatch

between the posterior probability obtained with the original

dataset and after undersampling is assessed in terms of loss

measure (Brier Score), predictive accuracy (G-mean) and rank-

ing (AUC).

Based on the previous works of Saerens et al. [21] and

Elkan [13], we propose an analytical method to correct the

bias introduced by undersampling that can produce well-

calibrated probabilities. The method is equivalent to adjusting

the posterior probability in the presence of new priors. The use

of unbiased probability estimates requires an adjustment to the

probability threshold used to classify instances. When using

class priors as misclassiﬁcation costs, we show that this new

threshold corresponds to the one used before undersampling.

In order to have complete control over the data generation

process, we have ﬁrst recourse to synthetic datasets. This

allows us to simulate problems of different difﬁculty and see

the impact of undersampling on the probability estimates. To

conﬁrm the results obtained with the simulated data, we also

run our experiments on several UCI datasets and a real-world

fraud detection dataset made available to the public.

This paper has the following contributions. First, we review

how undersampling can induce a bias in the posterior proba-

bilities generated by machine learning methods. Second, we

leverage this understanding to develop an analytical method

that can counter and reduce this bias. Third, we show how

to use unbiased probability estimates for decision making in

unbalanced classiﬁciation. We note that while the framework

we derive in this work is theoretically equivalent to the

problem of a change in class priors [21], our perspective is

different. We interpret undersampling as a problem of sample

selection bias, wherein the bias is not intrinsic to the data but

rather introduced artiﬁcially [19].

The paper is organized as follows. Section II introduces

some well-known methods for unbalanced datasets and sec-

tion III formalizes the sampling selection bias due to under-

sampling. Undersampling is responsible for a shift in the pos-

terior probability which leads to biased probability estimates,

for which we propose a corrective method. Section IV shows

how to set the classiﬁcation threshold to take into account

the change in the priors. Finally, section VI uses real-world

datasets to validate the probability transformation presented in

section III and the use of the classiﬁcation threshold proposed

in IV.

II. SAMPLING FOR UNBALANCED CLASSIFICATION

Let us consider a binary classiﬁcation task where the

distribution of the target class is highly skewed. When the

data is unbalanced, standard machine learning algorithms that

maximise overall accuracy tend to classify all observations

as majority class instances [16]. This translates into poor

accuracy on the minority class (low recall), which is typically

the class of interest. There are several methods that deal with

this problem, which we can distinguish between methods that

operate at the data and algorithmic levels [6].

At the data level, the unbalanced strategies are used as a

pre-processing step to re-balance the two classes before any

algorithm is applied. At the algorithmic level, algorithms are

themselves adjusted to deal with the minority class detec-

tion [2]. Here we will restrict ourselves to consider a subset

of data-level methods known as sampling techniques.

Undersampling [11] consists of down-sizing the majority

class by removing observations at random until the dataset

is balanced. In an unbalanced problem, it is often realistic

to assume that many observations of the majority class are

redundant and that by removing some of them at random the

data distribution will not change signiﬁcantly. However the

risk of removing relevant observations from the dataset is still

present, since the removal is performed in an unsupervised

manner. In practice, this technique is often adopted since it is

simple and speeds up the learning phase.

Oversampling [11] consists of up-sizing the minority class

at random, decreasing the level of class imbalance. By repli-

cating the minority class until the two classes have equal

frequency, oversampling increases the risk of over-ﬁtting by

biasing the model towards the minority class. Other drawbacks

of the approach are that it does not add any new valuable

minority examples and that it increases the training time. This

can be particularly ineffective when the original dataset is

fairly large.

SMOTE [7] over-samples the minority class by generating

synthetic minority examples in the neighborhood of observed

ones. The idea is to form new minority examples by interpo-

lating between examples of the same class. This has the effect

of creating clusters around each minority observation.

In this paper we focus on understanding how under-

sampling affects the posterior probability of a classiﬁcation

algorithm.

III. THE IMPACT OF SAMPLING ON POSTERIOR

PROBABILITIES

In binary classiﬁcation we typically learn a model on

training data and use it to generate predictions (class or

posterior probability) on a testing set with the assumption that

both come from the same distribution. When this assumption

does not hold, we encounter the so-called problem of sampling

selection bias [19]. Sampling selection bias can occur due to

a bad choice of the training set. For example, consider the

problem where a bank wants to predict whether someone who

is applying for a credit card will be able to repay the credit at

the end of the month. The bank has data available on customers

whose applications have been approved, but has no information

on rejected customers. This means that the data available to

the bank is a biased sample of the whole population. The bias

in this case is intrinsic to the dataset collected by the bank.

A. Sample Selection Bias due to undersampling

Rebalancing unbalanced data is just the sample selection

bias problem with a known selection bias introduced by design

(rather than by constraint or accident) [19]. In this section,

we investigate the sampling selection bias that occurs when

undersampling a skewed training set.

To begin, let us consider a binary classiﬁcation task where

the goal is to learn a classiﬁer f:Rn→ {0,1}, where X∈Rn

is the input and Y∈ {0,1}the output domain. Let us call class

0negative and class 1positive. Further, assume that the number

of positive observations is small compared to the number of

negatives, with rebalancing performed via undersampling.

Let us denote as (X,Y)the original unbalanced training

sample and as (X, Y )a balanced sample of (X,Y). This

means that (X, Y )⊂(X,Y)and it contains a subset of

the negatives in (X,Y). Let us deﬁne sas a random binary

selection variable for each of the Nsamples in (X,Y), which

takes the value 1if the point is in (X, Y )and 0otherwise.

It is possible to derive the relationship between the posterior

probability of a model learnt on a balanced subset and the one

learnt on the original unbalanced dataset.

We assume that the selection variable sis independent

of the input xgiven the class y(class-dependent selection):

p(s|y, x) = p(s|y). This assumption implies p(x|y, s) =

p(x|y), i.e. by removing observation at random in the ma-

jority class we do not change within-class distributions. With

undersampling there is a change in the prior probabilities

(p(y|s= 1) 6=p(y)) and as a consequence the class-

conditional probabilities are different as well, p(y|x, s =

1) 6=p(y|x). The probability that a point (x, y)is included

in the balanced training sample is given by p(s= 1|y, x).

Let the sign +denote y= 1 and −denote y= 0, e.g.

p(+, x) = p(y= 1, x)and p(−, x) = p(y= 0, x). From

Bayes’ rule, using p(s|y, x) = p(s|y), we can write:

p(+|x, s = 1) = p(s= 1|+)p(+|x)

p(s= 1|+)p(+|x) + p(s= 1|−)p(−|x)

(1)

As shown in our previous work [9], since p(s= 1|+) = 1

we can write (1) as:

p(+|x, s = 1) = p(+|x)

p(+|x) + p(s= 1|−)p(−|x)(2)

Let us denote β=p(s= 1|−)as the probability of selecting

a negative instance with undersampling, p=p(+|x)as the

posterior probability of the positive class on the original

dataset, and ps=p(+|x, s = 1) as the posterior probability

after sampling. We can rewrite equation (2) as:

ps=p

p+β(1 −p)(3)

Using (3) we can obtain an expression of pas a function of

ps:

p=βps

βps−ps+ 1 (4)

Balancing an unbalanced problem corresponds to the case

when β=p(+)

p(−)≈N+

N−, where N+and N−denote the

number of positive and negative instances in the dataset. In

the following we will assume that N+

N−provides an accurate

estimation of the ratio of the prior probabilities. For such level

of β, a small variation at the high values of psinduces a large

change in p, while the opposite occurs for small values of

ps[9]. When β= 1, all the negative instances are used for

training, while for β < 1, a subset of negative instances are

included in the training set. As βdecreases towards N+

N−, the

resulting training set becomes more balanced. Note that N+

N−

is the minimum value for β, as for β < N+

N−we would have

more positives than negatives.

Let’s suppose we have an unbalanced problem where the

positives account for 10% of 10,000 observations (i.e., we

have 1,000 positives and 9,000 negatives). Suppose we want

to have a balanced dataset β=N+

N−≈0.11, where ≈88.9%

(8000/9000) of the negative instances are discharged. Table I

shows how, by reducing β, the original unbalanced dataset

becomes more balanced and smaller as negative instances are

removed. After undersampling, the number of negatives is

N−

s=βN −, while the number of positives stays the same

N+

s=N+. The percentage of negatives (perc−) in the dataset

decreases as N−

s→N+.

TABLE I. UN DER SA MPL ING A D ATASET W ITH 1,000 POSITIVES IN

10,000 OB SERVATIO NS.NsD EFI NES T HE S IZE O F TH E DATASET A FT ER

UNDERSAMPLING AND N−

s(N+

s)THE NUMBER OF NEGATIVE (POSITIVE)

IN STAN CES F OR A GI VE N β. WHE N β= 0.11 THE N EG ATIV E SAM PLE S

REPRESENT 50% OF T HE OB SERVATIO NS I N THE DATAS ET.

NsN−

sN+

sβ perc−

2,000 1,000 1,000 0.11 50.00

2,800 1,800 1,000 0.20 64.29

3,700 2,700 1,000 0.30 72.97

4,600 3,600 1,000 0.40 78.26

5,500 4,500 1,000 0.50 81.82

6,400 5,400 1,000 0.60 84.38

7,300 6,300 1,000 0.70 86.30

8,200 7,200 1,000 0.80 87.80

9,100 8,100 1,000 0.90 89.01

10,000 9,000 1,000 1.00 90.00

B. Bias and class separability

In this section we are going to show how the impact of

bias depends on the separability nature of the classiﬁcation

task. Let ω+and ω−denote the class conditional probabilities

p(x|+) and p(x|−), and π+(π+

s) the class priors before (after)

undersampling. It is possible to derive the relation between

the bias and the difference δ=ω+−ω−between the class

conditional distributions. From Bayes’ theorem we have:

p=ω+π+

ω+π++ω−π−(5)

Suppose δ=ω+−ω−, we can write (5) as:

p=ω+π+

ω+π++ (ω+−δ)π−=ω+π+

ω+(π++π−)−δπ−=ω+π+

ω+−δπ−

(6)

since π++π−= 1. Similarly, since ω+does not change with

undersampling:

ps=ω+π+

ω+−δπ−

(7)

Now we can write ps−pas:

ps−p=ω+π+

ω+−δπ−

s−ω+π+

ω+−δπ−(8)

Since ps≥pbecause of (3), 1≥ps≥0and 1≥p≥0we

have: 1≥ps−p≥0. In Figure 1 we plot ps−pas a function

of δwhen π+

s= 0.5and π+= 0.1. For small values of the

class conditional densities it appears that the bias takes the

highest values for δvalues close to zero. This means that the

bias is higher for similar class conditional probabilities (i.e.

low separable conﬁgurations).

Fig. 1. ps−pas a function of δ, where δ=ω+−ω−for values of

ω+∈ {0.01,0.1}when π+

s= 0.5and π+= 0.1. Note that δis upper

bounded to guarantee 1≥ps≥0and 1≥p≥0.

C. Adjusting posterior probabilities to new priors

Equation (3) shows how the conditional distribution of the

balanced conﬁguration relates to the conditional distribution in

the original unbalanced setting. However, after a classiﬁcation

model is learnt on a balanced training set, it is normally used

to predict a testing set, which is likely to have an unbalanced

distribution similar to the original training set. This means that

the posterior probability of a model learnt on the balanced

training set should be adjusted for the change in priors between

the training and testing sets. In this paper we propose to use

equation (4) to correct the posterior probability estimates after

undersampling. Let us call p0the bias-corrected probability

obtained from psusing (4):

p0=βps

βps−ps+ 1 (9)

Equation (9) can be seen as a special case of the framework

proposed by Saerens et al. [21] and Elkan [13] for correcting

the posterior probability in the case of testing and training sets

sharing the same priors (see Appendix). When we know the

priors in the testing set we can correct the probability with

Elkan’s and Saerens’ equations. However, these probabilities

are usually unknown and must be estimated. If we make the

assumption that training and testing have the same priors we

can used (9) for calibrating ps. Note that the above transforma-

tion will not affect the ranking produced by ps. Equation (9)

deﬁnes a monotone transformation, hence the ranking of ps

will be the same as p0. While pis estimates using all the

samples in the unbalanced dataset, psand p0are computed

considering a subset of the original samples and therefore their

estimations are subjected to higher variance [9]. The variance

effect is typically addressed by the use of averaging strategies

(e.g. UnderBagging [23]), but is not the focus of our paper.

D. Synthetic datasets

We now use two synthetic datasets to analysis the bias

introduced by undersampling and understand how it affects

the posterior probability. Given the simulated setting we are

able to control the true posterior probability pand measure

the sampling bias embedded in ps. We see that the bias is

larger when the two classes are overlapping and that stronger

undersampling induces a larger bias.

Let us consider two binary classiﬁcation tasks, wherein

positive and negative observations are drawn randomly from

two distinct normal distributions. For both datasets we set

the number of positives to be 10% of 10,000 observations,

with ω−∼N(0, σ)and ω+∼N(µ, σ), where µ > 0. The

distance between the two normal distributions, µ, is used to

control the degree of separation between the classes. When µ

is large, the two classes are well-separated, while for small

µthey strongly overlap. In the ﬁrst dataset, we simulate a

classiﬁcation problem with a very low degree of separation

(using µ= 3), in the second a task with well-separated classes

using µ= 15 (see Figure 2). The ﬁrst simulates a difﬁcult

classiﬁcation task, the latter an easy one. For both dataset we

set σ= 3.

500

1000

1500

−10 0 10 20 −10 0 10 20

Count

class

Fig. 2. Synthetic datasets with positive and negative observations sampled

from two different normal distributions. Positives account for 10% of the

10,000 random values. On the left we have a difﬁcult problem with overlapping

classes (µ= 3), on the right an easy problem where the classes are well-

separated (µ= 15).

Figure 3 shows how pschanges with β(pcorresponds to

β= 1). When β→N+

N−the probability shifts to the left,

allowing for higher probabilities on the right hand side of the

Fig. 3. Posterior probability as a function of β. On the left the task with

µ= 3 and on the right the one with µ= 15. Note that pcorresponds to

β= 1 and psto β < 1.

chart (where positive observations are located). In other words,

removing negative samples with undersampling increases the

positive posterior probability, moving the classiﬁcation bound-

ary so that more samples are classiﬁed as positive. The stronger

the undersampling, the larger the shift, i.e. the drift of psfrom

p. The drift is larger in the dataset with non-separable classes

conﬁrming the results of Section III-B.

Figure 4 displays ps,p0and pfor β=N+

N−in the dataset

with overlapping classes (µ= 3) and we see that p0closely

approximates p. As p0≈p, we can say that the above

transformation based on (9) is able to correct the probability

drift that occurs with undersampling. The correction seems

particularly effective on the left-hand side (where the majority

class is located), while is less precise on the right-hand side

where we expect to have larger variance on p0due to the small

number of positive samples.

0.00

0.25

0.50

0.75

1.00

−10 −5 0 5 10 15

Posterior probability

Probability

Fig. 4. Posterior probabilities ps,p0and pfor β=N+

N−in the dataset with

overlapping classes (µ= 3).

IV. CLASSIFICATION THRESHOLD WITH UNBIASED

PROBABILITIES

In the previous section we showed how undersampling

induces biased posterior probabilities and presented a method

to correct for this bias. We now want to investigate how to use

them for classiﬁcation.

A. Threshold with Bayes Minimum Risk

Standard decision making process based on Bayes decision

theory developed in most textbooks on pattern recognition or

machine learning (see for example [24], [3], [12]) deﬁnes the

optimal class of a sample as the one minimizing the risk

(expected value of the loss function). In a binary classiﬁcation

problem, the risk of the positive and negative class is deﬁned

as follows:

r+= (1 −p)l1,0+pl1,1

r−= (1 −p)l0,0+pl0,1

where p=p(+|x)and li,j is the loss (cost) incurred in

deciding iwhen the true class is j.

TABLE II. LO SS MATRI X

Actual Positive Actual Negative

Predicted Positive l1,1l1,0

Predicted Negative l0,1l0,0

Bayes decision rule for minimizing the risk can be stated as

follows: assign the positive class to samples with r+≤r−, and

the negative otherwise. This is equivalent to predict a sample

as positive when p>τ and the threshold τis:

τ=l1,0−l0,0

l1,0−l0,0+l0,1−l1,1

Typically the cost of a correct prediction is zero, hence l0,0= 0

and l1,1= 0. In an unbalanced problem, the cost of missing a

positive instance (false negative) is usually higher than the cost

of missing a negative (false positive). When the costs of a false

negative and false positive are unknown, a natural solution is

to set the costs using the priors. Let l1,0=π+and l0,1=π−,

where π+=p(+) and π−=p(−). Then, since π−> π+we

have l0,1> l1,0as desired. We can then write:

τ=l1,0

l1,0+l0,1

=π+

π++π−=π+(10)

since π++π−= 1. This is also the optimal threshold in a

cost-sensitive application where the costs are deﬁned using the

priors [13].

B. Classiﬁcation threshold adjustment

Even if undersampling produces biased probability esti-

mates, it is often used to balance datasets with skewed class

distributions because several classiﬁers have empirically shown

better performance when trained on balanced dataset [25], [14].

Let τsdenote the threshold used to classify an observation

after undersampling, form (10) we have τs=π+

s, where π+

is the positive class prior after undersampling. In the case of

undersampling with β=N+

N−(balanced training set) we have

τs= 0.5.

When correcting pswith (9), we must also correct the

probability threshold to maintain the predictive accuracy de-

ﬁned by τs(this is needed otherwise we would use different

misclassiﬁcation costs for p0). Let τ0be the threshold for the

unbiased probability p0. From Elkan [13]:

τ0

1−τ0

1−τs

τs

=β(11)

τ0=βτs

(β−1)τs+ 1 (12)

Using τs=π+

s, (12) becomes:

τ0=βπ+

(β−1)π+

s+ 1

τ0=βN+

N++βN −

(β−1) N+

N++βN −+ 1 =N+

N++N−=π+

The optimal threshold to use with p0is equal to the one for p.

As an alternative to classifying observations using pswith τs,

we can obtain equivalent results using p0with τ0. In summary,

as a result of undersampling, a higher number of observations

are predicted as positive, but the posterior probabilities are

biased due to a change in the priors. Equation (12) allows

us ﬁnd the threshold that guarantees equal accuracy after the

posterior probability correction. Therefore, in order to classify

observations with unbiased probabilities after undersampling,

we have to ﬁrst obtain p0from pswith (9) and then use τ0as

a classiﬁcation threshold.

V. ME AS UR ES O F CL AS SI FICATIO N ACC UR ACY AND

PROBABILITY CALIBRATION

The choice of balancing the training set or leaving it

unbalanced has a direct inﬂuence on the classiﬁcation model

that is learnt. A model learnt on a balanced training set has the

two classes equally represented. In the case of an unbalanced

training set, the model learns from a dataset skewed towards

one class. Hence, the classiﬁcation model learnt after under-

sampling is different from the one learnt on the original dataset.

In this section we compare the probability estimates of two

models, one learnt in the presence and the other in the absence

of undersampling. The probabilities are evaluated in terms of

ranking produced, classiﬁcation accuracy and calibration.

To asses the impact of undersampling, we ﬁrst use accuracy

measures based on the confusion matrix (Table III).

TABLE III. CONFUSION MATRIX

Actual Positive Actual Negative

Predicted Positive TP FP

Predicted Negative FN TN

In an unbalanced class problem, it is well-known that

quantities like TPR ( T P

T P +F N ), TNR ( T N

F P +T N ) and average

accuracy ( T P +T N

T P +F N+F P+T N ) are misleading assessment mea-

sures [10]. Let us deﬁne Precision = T P

T P +F P and Recall

=T P

T P +F N . Typically we want to have high conﬁdence that

observations predicted as positive are actually positive (high

Precision) as well as a high detection rate of the positives

(high Recall). However, Precision and Recall share an inverse

relationship, whereby high Precision comes at the cost of

low Recall and vice versa. An accuracy measure based on

both Precision and Recall is the F-measure, also known as

F1-score or F-score. F-measure (2P recision×Recall

P recision+Recall ) and G-

mean (√TPR×T N R) are often considered to be useful and

effective performance measures for unbalanced datasets.

An alternative way to measure the quality of a probability

estimate is to look at the ranking produced by the probability.

A good probability estimate should rank ﬁrst all the minority

class observations and then those from the majority class.

In other words, if ˆpis a good estimate of p(+|x), then ˆp

should give high probability to the positive examples and

small probability to the negatives. A well-accepted ranking

measure for unbalanced dataset is AUC (Area Under the ROC

curve) [5]. To avoid the problem of different misclassiﬁcation

costs, we use an estimation of AUC based on the Mann-

Whitney statistic [10]. This estimate measures the probability

that a random minority class example ranks higher than a

random majority class example [15].

In order to measure the probability calibration, we used the

Brier Score (BS) [4]. BS is a measure of average squared loss

between the estimated probabilities and the actual class value.

It allows to evaluate how well the probabilities are calibrated,

the lower the BS the more accurate are the probabilistic

predictions of a model. Let ˆp(yi|xi)be the probability estimate

of sample xito have class yi∈ {1,0}, BS is deﬁned as:

BS =1

i=1 {yi−ˆp(yi|xi)}2(13)

VI. EX PE RI ME NTAL R ES ULT S

In the previous sections we used synthetic datasets to

study the effect of undersampling. We now consider real-world

unbalanced datasets from the UCI repository used in [9]. For

each dataset we adopt a 10-fold cross validation (CV) to test

our models and we repeated the CV 10 times. In particular,

we used a stratiﬁed CV, where the class proportion in the

datasets is kept the same over all the folds. As the original

datasets are unbalanced, the resulting folds are unbalanced as

well. For each fold of CV we learn two models: one using

all the observations and the other with the ones remaining

after undersampling. Then both models are tested on the

same testing set. We used several supervised classiﬁcation

algorithms available in R [20]: Random Forest [18], SVM [17],

and Logit Boost [22].

We denote as ˆpsand ˆpthe posterior probability estimates

obtained with and without undersampling and as ˆp0the bias-

corrected probability obtained from ˆpswith equation (9). Let

τ,τsand τ0be the probability thresholds used for ˆp,ˆpsand ˆp0

respectively, where τ=π+,τs=π+

sand τ0=π+. The

goal of these experiments is to compare which probability

estimates return the highest ranking (AUC), calibration (BS)

and classiﬁcation accuracy (G-mean) when coupled with the

thresholds deﬁned before. In undersampling, the amount of

sampling deﬁned by βis usually set to be equal to N+

N−, leading

to a balanced dataset where π+

s=π−

s= 0.5. However, there

is no reason to believe that this is the optimal sampling rate.

Often, the optimal rate can be found only a posteriori after

trying different values of β[9]. For this reason we replicate

the CV with different βsuch that {N+

N−≤β≤1}and for

each CV the accuracy is computed as the average G-mean (or

AUC) over all the folds.

In table V we report the results over all the datasets. For

each dataset, we rank the probability estimates ˆps,ˆpand ˆp0

from the worst to the best performing for different values of

β. We then sum the ranks over all the values of βand over all

datasets. More formally, let Ri,k,b ∈ {1,2,3}be the rank of

probability ion dataset kwhen β=b. The probability with

the highest accuracy in kwhen β=bhas Ri,k,b = 3 and the

TABLE IV. DATASET S FRO M THE UCI R EP OSI TORY U SE D IN [9].

Datasets N N+N−N+/N

ecoli 336 35 301 0.10

glass 214 17 197 0.08

letter-a 20000 789 19211 0.04

letter-vowel 20000 3878 16122 0.19

ism 11180 260 10920 0.02

letter 20000 789 19211 0.04

oil 937 41 896 0.04

page 5473 560 4913 0.10

pendigits 10992 1142 9850 0.10

PhosS 11411 613 10798 0.05

satimage 6430 625 5805 0.10

segment 2310 330 1980 0.14

boundary 3505 123 3382 0.04

estate 5322 636 4686 0.12

cam 18916 942 17974 0.05

compustat 13657 520 13137 0.04

covtype 38500 2747 35753 0.07

one with the lowest has Ri,k,b = 1. Then the sum of ranks for

the probability iis deﬁned as PkPbRi,k,b. The higher the

sum, the higher the number of times that one probability has

higher accuracy than the others.

For AUC, a higher rank sum means a higher AUC and

hence a better ranking returned by the probability. Similarly,

with G-mean, a higher rank sum corresponds to higher predic-

tive accuracy. However, in the case of BS, a higher rank sum

means poorer probability calibration (larger bias). Table V has

in bold the probabilities with the best rank sum according to

the different metrics. For each metric and classiﬁer it reports

the p-values of the paired t-test based on the ranks between ˆp

and ˆp0and between ˆpand ˆps.

In terms of AUC, we see that ˆpsand ˆp0have better

performances than ˆpfor LB and SVM. The rank sum is the

same for ˆpsand ˆp0since the two probabilities are linked by a

monotone transformation (equation (9)). If we look at G-mean,

ˆpsand ˆp0return better accuracy than ˆptwo times out of three.

In this case, the rank sums of ˆpsand ˆp0are the same since

we used τsand τ0as the classiﬁcation threshold, where τ0is

obtained from τsusing (12). If we look at the p-values, we

can strongly reject the null hypothesis that the accuracy of ˆps

and ˆpare from the same distribution. For all classiﬁers, ˆpis

the probability estimate with the best calibration (lower rank

sum with BS), followed by ˆp0and ˆps. The rank sum of ˆp0is

always lower than the one of ˆps, indicating that ˆp0has lower

bias than ˆps. This result conﬁrms our theory that equation (9)

allows one to reduce the bias introduced by undersampling.

In summary from this experiment we can conclude that

undersampling does not always improve the ranking or classi-

ﬁcation accuracy of an algorithm, but when it is the case we

should use ˆp0instead of ˆpsbecause the ﬁrst has always better

calibration.

We now consider a real-world dataset, composed of credit

card transactions from September 2013 made available by our

industrial partner. 1It contains a subset of online transactions

that occurred in two days, where we have 492 frauds out

of 284,807 transactions. The dataset is highly unbalanced,

where the positive class (frauds) account for 0.172% of all

transactions, and the minimum value of βis ≈0.00173.

In Figure 5 we have the AUC for different values of β.

1The dataset is available at http://www.ulb.ac.be/di/map/adalpozz/data/

creditcard.Rdata

TABLE V. S UM O F RAN KS A ND P-VAL UE S OF TH E PAIR ED T-TE ST

BETWEEN THE RANKS OF ˆpA ND ˆp0A ND BE TW EEN ˆpAND ˆpsFO R

DIFFERENT METRICS. INBO LD THE PROBABILITIES WITH THE BEST RANK

SU M (HIGHER FOR AUC AND G- MEA N,L OWE R FOR BS).

Metric Algo PRˆpPRˆpsPRˆp0ρ(Rˆp, R ˆps)ρ(Rˆp, Rˆp0)

AUC LB 22,516 23,572 23,572 0.322 0.322

AUC RF 24,422 22,619 22,619 0.168 0.168

AUC SVM 19,595 19,902.5 19,902.5 0.873 0.873

G-mean LB 23,281 23,189.5 23,189.5 0.944 0.944

G-mean RF 22,986 23,337 23,337 0.770 0.770

G-mean SVM 19,550 19,925 19,925 0.794 0.794

BS LB 19809.5 29448.5 20402 0.000 0.510

BS RF 18336 28747 22577 0.000 0.062

BS SVM 17139 23161 19100 0.001 0.156

The boxplots of ˆpsand ˆp0are identical because of (9), they

increase with β→N+

N−and have higher median than the

one of ˆp. This example shows how in case of extreme class

imbalance, undersampling can improve predictive accuracy of

several classiﬁcation algorithms.

SVM

●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●

●

●●

●

●●●●●● ●●●●●●

●●●●●●●● ●●●●●●●●

●

0.900

0.925

0.950

0.975

1.000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

beta

AUC

Probability

Credit−card

Fig. 5. Boxplot of AUC for different values of βin the Credit-card dataset.

SVM

●

●●●●●●●●●●

●●●●●●

●

●●●●●●

●●●●●●●● ●●●●●●●●

3e−04

6e−04

9e−04

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.91

beta

Probability

Credit−card

Fig. 6. Boxplot of BS for different values of βin the Credit-card dataset.

In Figure 6 we have the BS for different values of β. The

boxplots of ˆp0show in general smaller calibration error (lower

BS) than those of ˆpsand the latter have higher BS especially

for small values of β. This supports our previous results, which

found that the loss in probability calibration for ˆpsis greater

the stronger the undersampling.

VII. CONCLUSION

In this paper, we study the bias introduced in the posterior

probabilities that occurs as an artifact of undersampling. We

use several synthetic datasets to analyze this problem from a

theoretical perspective, and then ground our ﬁndings with an

empirical evaluation over several real-world datasets.

The ﬁrst result of the paper is that the bias due to the

instance selection procedure in undersampling is essentially

equivalent to the bias the occurs with a change in the priors

when class-within distributions remain stable. With undersam-

pling, we create a different training set, where the classes are

less unbalanced. However, if we make the assumption that the

training and testing sets come from the same distribution, it

follows that the probability estimates obtained after undersam-

pling are biased. As a result of undersampling, the posterior

probability ˆpsis shifted away from the true distribution, and

the optimal separation boundary moves towards the majority

class so that more cases are classiﬁed into the minority class.

By making the assumptions that prior probabilities do not

change from training and testing, i.e. they both come form the

same data generating process, we propose the transformation

given in (9), which allows us to remove the drift in ˆpsdue to

undersampling. The bias on ˆpsregistered by BS gets larger for

small values of β, which means stronger undersampling pro-

duces probabilities with poorer calibration (larger loss). With

synthetic, UCI and Credit-card datasets, the drift-corrected

probability (ˆp0) has signiﬁcantly better calibration than ˆps

(lower Brier Score).

Even if undersampling produces poorly calibrated proba-

bility estimates ˆps, several studies have shown that it often

provides better predictive accuracy than ˆp[25], [14]. To

improve the calibration of ˆpswe propose to use ˆp0since

this transformation does not affect the ranking. In order to

maintain the accuracy obtained with ˆpsand the probability

threshold τs, we proposed to use ˆp0together with τ0to account

for the change in priors. By changing the undersampling

rate βwe give different costs to false positives and false

negatives, combining ˆp0with τ0allows one to maintain the

same misclassiﬁcation costs of a classiﬁcation strategy with

ˆpuand τufor any value of β.

Finally, we considered a highly unbalanced dataset (Credit-

card), where the minority class accounts for only 0.172% of all

observations. In this dataset, the large improvement in accuracy

obtained with undersampling was coupled with poor calibrated

probabilities (large BS). By correcting the posterior probability

and changing the threshold we were able to improve calibration

without losing predictive accuracy. Obtaining well-calibrated

classiﬁers is particularly important in decision systems based

on fraud detection. This is one of the rare papers making

available the fraud detection dataset used for testing.

ACK NOW LE DG ME NT S

A. Dal Pozzolo is supported by the Doctiris scholarship

funded by Innoviris, Brussels, Belgium. G. Bontempi is sup-

ported by the BridgeIRIS and BruFence projects funded by

Innoviris, Brussels, Belgium.

APPENDIX

Let pt=p(yt= +|xt)be the posterior probability for

a testing instance (xt, yt), where the testing set has priors:

π−

t=N−

Ntand π+

t=N+

Nt. In the unbalanced training set we

have π−=N−

N,π+=N+

Nand p=p(+|x). After undersam-

pling the training set π−

s=βN −

N++βN −,π+

s=N+

N++βN −and

ps=p(+|x, s = 1). If we assume that the class conditional

distributions p(x|+) and p(x|−)remain the same between

the training and testing sets, Saerens et al. [21] show that,

given different priors between the training and testing sets,

the posterior probability can be corrected with the following

equation:

pt=

π+

ps+π−

π−

(1 −ps)

(14)

Let us assume that the training and testing sets share the same

priors: π+

t=π+and π−

t=π−:

pt=

π+

ps+π−

π−

(1 −ps)

Then, since

π+

N+

N++N−

N+

N++βN −

=N++βN −

N++N−(15)

π−

N−

N++N−

βN −

N++βN −

=N++βN −

β(N++N−)(16)

we can write

pt=

N++βN −

N++N−ps

N++βN −

N++N−ps+N++βN −

β(N++N−)(1 −ps)=βps

βps−ps+ 1

The transformation proposed by Saerens et al. [21] is equiv-

alent to equation (4) and the one developed independently by

Elkan [13] for cost-sensitive learning:

pt=π+

ps−π+

sps

π+

s−π+

sps+π+

tps−π+

tπ+

(17)

pt=(1 −π+

s)ps

π+

(1 −ps) + ps−π+

using (15), π+

t=π+and π−

t=π−:

pt=

βN −

N++βN −ps

N++N−

N++βN −(1 −ps) + ps−N+

N++βN −

=βps

βps−ps+ 1

REFERENCES

[1] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying

support vector machines to imbalanced datasets. In Machine Learning:

ECML 2004, pages 39–50. Springer, 2004.

[2] Urvesh Bhowan, Michael Johnston, Mengjie Zhang, and Xin Yao.

Evolving diverse ensembles using genetic programming for classiﬁca-

tion with unbalanced data. Evolutionary Computation, IEEE Transac-

tions on, 17(3):368–386, 2013.

[3] Christopher M Bishop et al. Pattern recognition and machine learning,

volume 4. springer New York, 2006.

[4] Glenn W Brier. Veriﬁcation of forecasts expressed in terms of

probability. Monthly weather review, 78(1):1–3, 1950.

[5] Nitesh V Chawla. Data mining for imbalanced datasets: An overview.

In Data mining and knowledge discovery handbook, pages 853–867.

Springer, 2005.

[6] Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Editorial:

special issue on learning from imbalanced data sets. ACM SIGKDD

Explorations Newsletter, 6(1):1–6, 2004.

[7] NV Chawla, KW Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.

Smote: synthetic minority over-sampling technique. Journal of Artiﬁcial

Intelligence Research (JAIR), 16:321–357, 2002.

[8] Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi,

and Gianluca Bontempi. Credit card fraud detection and concept-drift

adaptation with delayed supervised information. In Neural Networks

(IJCNN), 2015 International Joint Conference on. IEEE, 2015.

[9] Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi. When is

undersampling effective in unbalanced classiﬁcation tasks? In Machine

Learning and Knowledge Discovery in Databases. Springer, 2015.

[10] Andrea Dal Pozzolo, Olivier Caelen, Yann-Ael Le Borgne, Serge

Waterschoot, and Gianluca Bontempi. Learned lessons in credit card

fraud detection from a practitioner perspective. Expert Systems with

Applications, 41(10):4915–4928, 2014.

[11] C. Drummond and R.C. Holte. C4.5, class imbalance, and cost

sensitivity: why under-sampling beats over-sampling. In Workshop on

Learning from Imbalanced Datasets II. Citeseer, 2003.

[12] Richard O Duda, Peter E Hart, and David G Stork. Pattern classiﬁca-

tion. John Wiley & Sons, 2012.

[13] C. Elkan. The foundations of cost-sensitive learning. In International

Joint Conference on Artiﬁcial Intelligence, volume 17, pages 973–978.

Citeseer, 2001.

[14] Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. A multiple

resampling method for learning from imbalanced data sets. Computa-

tional Intelligence, 20(1):18–36, 2004.

[15] David J Hand and Robert J Till. A simple generalisation of the area

under the roc curve for multiple class classiﬁcation problems. Machine

Learning, 45(2):171–186, 2001.

[16] N. Japkowicz and S. Stephen. The class imbalance problem: A

systematic study. Intelligent data analysis, 6(5):429–449, 2002.

[17] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis.

kernlab-an s4 package for kernel methods in r. 2004.

[18] Andy Liaw and Matthew Wiener. Classiﬁcation and regression by

randomforest. R News, 2(3):18–22, 2002.

[19] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer,

and Neil D Lawrence. Dataset shift in machine learning. The MIT

Press, 2009.

[20] R Development Core Team. R: A Language and Environment for

Statistical Computing. R Foundation for Statistical Computing, Vienna,

Austria, 2011. ISBN 3-900051-07-0.

[21] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting

the outputs of a classiﬁer to new a priori probabilities: a simple

procedure. Neural computation, 14(1):21–41, 2002.

[22] Jarek Tuszynski. caTools: Tools: moving window statistics, GIF, Base64,

ROC AUC, etc., 2013. R package version 1.16.

[23] Shuo Wang, Ke Tang, and Xin Yao. Diversity exploration and negative

correlation learning on imbalanced data sets. In Neural Networks,

2009. IJCNN 2009. International Joint Conference on, pages 3259–

3266. IEEE, 2009.

[24] Andrew R Webb. Statistical pattern recognition. John Wiley & Sons,

2003.

[25] Gary M Weiss and Foster Provost. The effect of class distribution on

classiﬁer learning: an empirical study. Rutgers Univ, 2001.

A survey on confidence calibration of deep learning under class imbalance data

Preprint

Full-text available

May 2024

Confidence calibration in classification models, a technique to achieve accurate posterior probability estimation for classification results, is crucial for assessing the likelihood of correct decisions in real-world applications. Class imbalance data, which biases the learning of the model and subsequently skews the posterior probabilities of the model, makes confidence calibration more challenging. Especially for often more important minority classes with high uncertainty, confidence calibration is more complex and necessary. Unlike previous surveys that typically separately investigate confidence calibration or class imbalance, this paper comprehensively investigates confidence calibration methods for deep learning-based classification models under class imbalance. Firstly, the problem of confidence calibration under class imbalance data is outlined. Secondly, a novel exploratory analysis regarding the impact of class imbalance data on confidence calibration is carried out, which can explain some experimental findings in existing studies. Then, this paper conducts a comprehensive review of 57 state-of-the-art confidence calibration methods under class imbalance data, divides these methods into six groups according to method differences, and systematically compares seven properties to evaluate their superiority. Subsequently, some commonly used and emerging evaluation methods in this field are summarized. Finally, we discuss several promising research directions that may serve as a guideline for future studies.

The Elusive Pursuit of Replicating PATE-GAN: Benchmarking, Auditing, Debugging

Preprint

Jun 2024

Synthetic data created by differentially private (DP) generative models is increasingly used in real-world settings. In this context, PATE-GAN has emerged as a popular algorithm, combining Generative Adversarial Networks (GANs) with the private training approach of PATE (Private Aggregation of Teacher Ensembles). In this paper, we analyze and benchmark six open-source PATE-GAN implementations, including three by (a subset of) the original authors. First, we shed light on architecture deviations and empirically demonstrate that none replicate the utility performance reported in the original paper. Then, we present an in-depth privacy evaluation, including DP auditing, showing that all implementations leak more privacy than intended and uncovering 17 privacy violations and 5 other bugs. Our codebase is available from https://github.com/spalabucr/pategan-audit.

Speeding up the Training of Neural Networks with the One-Step Procedure

Article

Full-text available

May 2024
NEURAL PROCESS LETT

In the last decade, research and corporate have shown a dramatically growing interest in the field of machine learning, mostly due to the performances of deep neural networks. These increasingly complex architectures solved a wide range of problems. However, training these sophisticated architectures require many computation on advanced hardware. With this paper, we introduce a new approach based on the One-Step procedure that may fasten their training. In this procedure, an initial guess estimator is computed on a subsample that is then improved with only one step of the Newton gradient descent on the whole dataset. To show the efficiency of this framework, we consider regression and classification tasks using simulated and real datasets. We consider classic architectures, namely multi-layer perceptrons and show, on our examples, that the One-Step procedure is often halving the computation time to train the neural networks while preserving the performances.

HS-CGK: A Hybrid Sampling Method for Imbalance Data Based on Conditional Tabular Generative Adversarial Network and K-Nearest Neighbor Algorithm

Article

Apr 2024

Class imbalance problem in datasets can lead to biased classification decisions in favor of majority class samples. Additionally, class overlap can cause fuzzy classification boundaries, affecting the performance of classification algorithms. To address these issues, we propose a hybrid sampling method based on conditional tabular generative adversarial network (CTGAN) and K-nearest neighbor (KNN) algorithm. Firstly, we introduce an oversampling algorithm, named DB-CTGAN, based on CTGAN. This algorithm filters noisy and boundary samples using the density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm and generates synthetic samples that conform to the real data distribution using CTGAN. Finally, we combine the expanded fraudulent samples generated by DB-CTGAN with the normal samples and use the KNN overlap undersampling algorithm to remove the samples in the overlap region, solving the class overlap problem. Experimental results show that compared with eight sampling methods using four standard classification models (Random Forest, Decision Tree, Support Vector Classification, and XGBoost), the proposed method significantly improves the F1, AUC, and G-mean metrics on five real datasets.

Parametric failure limit detection for the sheet metal forming of a floating photovoltaic (FPV) aluminum alloy structure

Conference Paper

Full-text available

Apr 2024

The sheet metal forming process of a floating photovoltaic (FPV) structure is simulated in LS-DYNA. An anisotropic yield criterion and a two-term Voce hardening law are used to model the plastic behavior of AA5083-H111 sheets. The numerical model incorporates thickness variations to trigger local necking and uses a critical thickness strain as a fracture criterion. To establish a methodology that can be expanded for further studies, the research explores the relationship between cup depth and drawbead distance by proposing an algorithm to distinguish between successful and unsuccessful sheet metal forming operations.

Synthetic Data Generation and Impact Analysis of Machine Learning Models for Enhanced Credit Card Fraud Detection

Chapter

Jun 2024

Traffic Anomaly Detection base on spatio-temporal hypergraph convolution neural networks

Article

Jun 2024

InferDB: In-Database Machine Learning Inference Using Indexes

Article

May 2024

The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing workflow followed by prediction with an ML model. Current approaches for in-database inference implement preprocessing operators and ML algorithms in the database either natively, by transpiling code to SQL, or by executing user-defined functions in guest languages such as Python. In this work, we present a radically different approach that approximates an end-to-end inference pipeline (preprocessing plus prediction) using a light-weight embedding that discretizes a carefully selected subset of the input features and an index that maps data points in the embedding space to aggregated predictions of an ML model. We replace a complex preprocessing workflow and model-based inference with a simple feature transformation and an index lookup. Our framework improves inference latency by several orders of magnitude while maintaining similar prediction accuracy compared to the pipeline it approximates.

A novel approach for credit card fraud transaction detection using deep reinforcement learning scheme

Article

Apr 2024

Online transactions are still the backbone of the financial industry worldwide today. Millions of consumers use credit cards for their daily transactions, which has led to an exponential rise in credit card fraud. Over time, many variations and schemes of fraudulent transactions have been reported. Nevertheless, it remains a difficult task to detect credit card fraud in real-time. It can be assumed that each person has a unique transaction pattern that may change over time. The work in this article aims to (1) understand how deep reinforcement learning can play an important role in detecting credit card fraud with changing human patterns, and (2) develop a solution architecture for real-time fraud detection. Our proposed model utilizes the Deep Q network for real-time detection. The Kaggle dataset available online was used to train and test the model. As a result, a validation performance of 97.10% was achieved with the proposed deep learning component. In addition, the reinforcement learning component has a learning rate of 80%. The proposed model was able to learn patterns autonomously based on previous events. It adapts to the pattern changes over time and can take them into account without further manual training.

Enhancing Fraud Detection in Financial Transactions through Cyber Security Measures

Article

Full-text available

Apr 2024

The digitization of financial systems has brought unprecedented convenience, but it has also increased fraud. This article explores the important intersection of cybersecurity and fraud detection in financial transactions. As the need to effectively combat fraud increases, he explores a variety of cybersecurity approaches and technologies. This article examines advanced technologies such as data mining, machine learning, biometric authentication, and blockchain through a comprehensive review of existing literature. It also highlights the challenges and limitations faced by modern fraud detection methodologies, including sophisticated cyberattacks and regulatory issues. By recognizing these challenges, stakeholders can work to implement holistic solutions that address both technical and regulatory aspects. Ultimately, the purpose of this document is to provide practical guidance for strengthening fraud detection capabilities, strengthening financial systems, and protecting consumer interests in the digital economy.

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Conference Paper

Full-text available

Jul 2015

Most fraud-detection systems (FDSs) monitor streams of credit card transactions by means of classifiers returning alerts for the riskiest payments. Fraud detection is notably a challenging problem because of concept drift (i.e. customers' habits evolve) and class unbalance (i.e. genuine transactions far outnumber frauds). Also, FDSs differ from conventional classification because, in a first phase, only a small set of supervised samples is provided by human investigators who have time to assess only a reduced number of alerts. Labels of the vast majority of transactions are made available only several days later, when customers have possibly reported unauthorized transactions. The delay in obtaining accurate labels and the interaction between alerts and supervised information have to be carefully taken into consideration when learning in a concept-drifting environment. In this paper we address a realistic fraud-detection setting and we show that investigator's feedbacks and delayed labels have to be handled separately. We design two FDSs on the basis of an ensemble and a sliding-window approach and we show that the winning strategy consists in training two separate classifiers (on feedbacks and delayed labels, respectively), and then aggregating the outcomes. Experiments on large dataset of real-world transactions show that the alert precision, which is the primary concern of investigators, can be substantially improved by the proposed approach.

When is Undersampling Effective in Unbalanced Classification Tasks?

Conference Paper

Full-text available

Sep 2015

A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis exists about the impact of undersampling on the accuracy of the final classifier. This paper aims to fill this gap by proposing an integrated analysis of the two elements which have the largest impact on the effectiveness of an undersampling strategy: the increase of the variance due to the reduction of the number of samples and the warping of the posterior distribution due to the change of priori probabilities. In particular we will propose a theoretical analysis specifying under which conditions undersampling is recommended and expected to be effective. It emerges that the impact of undersam-pling depends on the number of samples, the variance of the classifier, the degree of imbalance and more specifically on the value of the posterior probability. This makes difficult to predict the average effectiveness of an undersampling strategy since its benefits depend on the distribution of the testing points. Results from several synthetic and real-world unbalanced datasets support and validate our findings.

Learned lessons in credit card fraud detection from a practitioner perspective

Article

Full-text available

Aug 2014
EXPERT SYST APPL

Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to non-stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about which is the best strategy to deal with them. In this paper we provide some answers from the practitioner’s perspective by focusing on three crucial issues: unbalancedness, non-stationarity and assessment. The analysis is made possible by a real credit card dataset provided by our industrial partner.

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats OverSampling

Article

Full-text available

Jan 2003

This paper takes a new look at two sampling schemes commonly used to adapt machine al- gorithms to imbalanced classes and misclas- sication costs. It uses a performance anal- ysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becom- ing the community standard when evaluat- ing new cost sensitive learning algorithms. This paper shows that using C4.5 with under- sampling establishes a reasonable standard for algorithmic comparison. But it is recom- mended that the least cost classier be part of that standard as it can be better than under- sampling for relatively modest costs. Over- sampling, however, shows little sensitivity, there is often little dierence in performance when misclassication costs are changed.

Introduction to Statistical Pattern Recognition

Chapter

Jul 2003

Andrew Webb

Statistical Pattern RecognitionStages in a Pattern Recognition ProblemIssuesSupervised Versus UnsupervisedApproaches to Statistical Pattern RecognitionMultiple RegressionOutline of BookNotes and ReferencesExercises

Statistical Pattern Recognition: Third Edition

Article

Jan 2002

Andrew Webb

Statistical pattern recognition is a term used to cover all stages of an investigation from problem formulation and data collection through to discrimination and classification, assessment of results and interpretation. This chapter introduces some of the basic concepts in classification and describes the key issues. It presents two complementary approaches to discrimination, namely a decision theory approach based on calculation of probability density functions and the use of Bayes theorem, and a discriminant function approach. Many different forms of discriminant function have been considered in the literature, varying in complexity from the linear discriminant function to multiparameter nonlinear functions such as the multilayer perceptron. Regression is an important part of statistical pattern recognition. Regression analysis is concerned with predicting the mean value of the response variable given measurements on the predictor variables and assumes a model of the form. Bayes' theorem; regression analysis; statistical process control

Evolving Diverse Ensembles Using Genetic Programming for Classification With Unbalanced Data

Article

Jun 2013
IEEE T EVOLUT COMPUT

In classification, machine learning algorithms can suffer a performance bias when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class), while the other class(es) make up the majority. In this scenario, classifiers can have good accuracy on the majority class, but very poor accuracy on the minority class(es). This paper proposes a multiobjective genetic programming (MOGP) approach to evolving accurate and diverse ensembles of genetic program classifiers with good performance on both the minority and majority of classes. The evolved ensembles comprise of nondominated solutions in the population where individual members vote on class membership. This paper evaluates the effectiveness of two popular Pareto-based fitness strategies in the MOGP algorithm (SPEA2 and NSGAII), and investigates techniques to encourage diversity between solutions in the evolved ensembles. Experimental results on six (binary) class imbalance problems show that the evolved ensembles outperform their individual members, as well as single-predictor methods such as canonical GP, naive Bayes, and support vector machines, on highly unbalanced tasks. This highlights the importance of developing an effective fitness evaluation strategy in the underlying MOGP algorithm to evolve good ensemble members.

Pattern Recognition and Machine Learning Errata

Article