Conference PaperPDF Available

Accurately Quantifying a Billion Instances per Second*

October 2020

October 2020

DOI:10.1109/DSAA49011.2020.00012

Conference: IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)
At: Sydney

Authors:

Waqar Hassan

University of São Paulo

André Maletzke

Universidade Estadual do Oeste do Paraná

Gustavo Enrique Batista

UNSW Sydney

Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms.

HDy searches for an α that minimizes the Hellinger Distance [20].

…

Nemenyi's post hoc test for mean absolute quantification error. Groups of methods which are not significantly different at p < 0.05 are connected.

…

Runtime of SMM and state-of-the-art quantifiers across different test set sizes.

…

Figures - uploaded by André Maletzke

Content may be subject to copyright.

Content uploaded by André Maletzke

Content may be subject to copyright.

Accurately Quantifying a Billion Instances per

Second*

Waqar Hassan

ICMC-USP

S˜

ao Carlos, Brazil

waqar@usp.br

Andr´

e Maletzke

UNIOESTE/ICMC-USP

Foz do Iguac¸u, Brazil

andre.maletzke@unioeste.br

Gustavo Batista

CSE-UNSW

Sydney, Australia

g.batista@unsw.edu.au

Abstract—Quantiﬁcation is a thriving research area that de-

velops methods to estimate the class prior probabilities in an

unlabelled set of observations. Quantiﬁcation and classiﬁcation

share several similarities. For instance, the most straightforward

quantiﬁcation method, Classify & Count (CC), directly counts

the output of a classiﬁer. However, CC has a systematic bias

that makes it increasingly misestimate the counts as the class

distribution drifts away from a distribution it perfectly quantiﬁes.

This issue has motivated the development of more reliable

quantiﬁcation methods. Such newer methods can consistently

outperform CC at the cost of a signiﬁcant increase in pro-

cessing requirements. Yet, for a large number of applications,

quantiﬁcation speed is an additional criterion that must be

considered. Frequently, quantiﬁcation methods need to deal with

large amounts of data or fast-paced streams, as it is the case of

news feeding, tweets and sensor data. In this paper, we propose

Sample Mean Matching (SMM), a highly efﬁcient algorithm able

to quantify billions of data instances per second. We compare

SMM to a set of 14 established and state-of-the-art quantiﬁers

in an empirical analysis comprising 25 benchmark and real-world

datasets. We show that SMM is competitive with state-of-the-art

methods with no statistical difference in counting accuracy, and

it is orders of magnitude faster than the vast majority of the

algorithms.

Index Terms—Machine Learning, Quantiﬁcation, Mixture

Methods

I. INTRODUCTION

Quantiﬁcation is the research area that develops methods to

estimate the class distribution in an unlabelled sample. Such

area ﬁnds applications in tasks that are interested in under-

standing the behaviour of groups instead of the prediction of

individuals.

For instance, Forman [1] uses quantiﬁcation methods to

estimate the number of terrorism-related news in the last

month. Milli and colleagues [2] determine the approximate

percentage of unemployed for a given period or according to

different geographical regions. Gao and Sebastiani [3] count

tweets as positive, negative or neutral about a particular topic

under debate in society. Finally, Silva and colleagues [4]

estimate the number of disease-carrying mosquitoes captured

by an insect trap.

Quantiﬁcation shares a series of similarities with classiﬁca-

tion. The simplest quantiﬁcation method is a direct application

of classiﬁcation. Such an approach, known as Classify &

This work has been partially funded by CNPq-TWAS (139467/2017-3).

Count (CC), merely counts the output of a classiﬁer. However,

CC has a systematic bias that makes it increasingly misesti-

mate the counts as the class distribution drifts away from a

distribution that CC perfectly quantiﬁes [5], [6]. Such a ﬂaw

has motivated a growing number of researchers to propose

more reliable quantiﬁcation methods.

Therefore, most of the research in quantiﬁcation has fo-

cused on counting accuracy. However, for a large number

of applications, quantiﬁcation speed is an additional criterion

that must be considered. Frequently, quantiﬁcation methods

need to deal with large amounts of data or fast-paced streams,

as it is the case of news feeding, tweets and sensor data.

Those applications demand approaches that are both fast and

accurate.

In this paper, we propose Sample Mean Matching (SMM),

a highly efﬁcient algorithm able to quantify billions of data

instances per second. Such an algorithm is orders of magnitude

faster, yet, it provides precise estimates compared to the state-

of-the-art. Our proposal is inspired by the mixture methods of

the DySfamily [7] such as HDy [8]. However, our approach

is signiﬁcantly more straightforward and, consequently, more

efﬁcient than other methods of this family.

We compare SMM to a set of 14 established and state-

of-the-art quantiﬁers in an empirical analysis comprising 25

benchmark and real-world datasets. We show that SMM is

competitive with state-of-the-art methods with no statistical

difference in performance, and it is much more efﬁcient than

the vast majority of algorithms in terms of time complexity.

Fig. 1 provides an overview of our empirical evaluation.

This article is organized as follows: Section II presents

background concepts regarding classiﬁcation and quantiﬁca-

tion tasks and the primary difference between these tasks.

Section III summarizes the relevant literature in quantiﬁcation,

describing the counting methods assessed in this paper. Section

IV analyzes the time complexity of the training and test phases

of the quantiﬁcation methods reviewed in the previous section.

Section V presents our proposal based on the idea of mixture

models. Section VI describes the experimental setup proposed

in this paper. Section VII discusses and analyses the empirical

results. Finally, Section VIII concludes this work and presents

directions for future research.

2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)

DOI 10.1109/DSAA49011.2020.00012

ACC CC

DyS−TS

EMQ

HDy

MAX

MS2

PACC PCC

SMM

SORD

1k Slow and

1 3 5 7 9 11 13

Rank

Instances per

second (log-scale)

Gap area. The

literature has no

methods lying in

this region.

Fast and

accurate

Slow and

inaccurate

Fast and

inaccurate

XT50

Fig. 1: A time and accuracy comparison plot encompassing

15 quantiﬁers assessed on 25 datasets. Our proposal, SMM,

standouts as being one of the most efﬁcient yet highly com-

petitive in terms of counting accuracy. The counting accuracy

is expressed as a rank to avoid averaging this measure over

different datasets.

II. BACKGROU ND

Classiﬁcation is a task that induces a predictive model using

atraining set D={(x1, y1),...,(xn, yn)}, where xi∈ X is

a vector with mattributes in the feature space X, and yi∈

Y={c1, . . . , cl}is its respective class label.

The objective of classiﬁcation is to correctly predict the

class labels of individual observations in an unlabeled test set

using their feature values. Therefore, a classiﬁer is a model h

induced from Dsuch that

h:X −→ {c1, . . . , cl}

A. Scorer

Classiﬁers employ different mechanisms to decide which

class will be assigned to any given observation. In binary

classiﬁcation, one of the two classes is denominated positive

class (c1=⊕), while the other is denominated negative class

(c2=). In this setting, one can induce a scorer hS(x). A

scorer is a model induced from Dsuch that

hS:X −→ R

A scorer produces a numerical value called score that

correlates with the posterior probability of the positive class,

that is, P(Y=⊕|x). Consequently, the greater the score is,

the higher is the chance of xbelonging to the positive class.

For classiﬁcation purposes, if such a score is greater than

a certain threshold, th, the observation is classiﬁed as pos-

itive. Otherwise, it is classiﬁed as negative [9]. We refer to

scores of negative observations simply as negative scores and

analogously refer to scores of positive observations as positive

scores.

B. Quantiﬁcation

Quantiﬁcation and classiﬁcation have common character-

istics, such as the representation of data. However, their

objectives differ. A classiﬁer provides an outcome for each

input instance. Conversely, a quantiﬁer assesses the overall

quantity of observations that belong to a speciﬁc class or a

set of classes [10]. Therefore, a quantiﬁer is a model induced

from Dthat predicts the prevalence of each class in a sample,

such that

q:SX−→ [0,1]l

SXdenotes the universe of possible samples from X. For a

given test sample S∈SX, the quantiﬁer outputs a vector ˆ

[ˆp1,...,ˆpl], where ˆpiestimates the prior probability for class

ci, such that Pl

j=1 ˆpj= 1. The objective is [ˆp1,..., ˆpl]to be

as close as possible to the true prior ratios [P(c1), . . . , P (cl)]

of the probability distribution from which Swas sampled.

In classiﬁcation, we usually assume that the data are in-

dependent and identically distributed (i.i.d). “Identically dis-

tributed” means that all observations, from either the training

or test sets, share the same underlying distribution. “Indepen-

dently distributed” means that the observations are indepen-

dent of each other. In other words, the occurrence of one

observation does not affect the probability of the appearance

of any other particular observation.

Similarly to classiﬁcation, in quantiﬁcation, we still assume

that observations are independent. However, the training and

test instances do not come from the same underlying distribu-

tion. As the main task is to measure the prior probabilities of

the classes in S, the class distribution can change signiﬁcantly

from the training set (which supports the induction of q) to

the test sample S[11].

III. REL ATED WORK

Quantiﬁcation has been explored and evolving in the

last decades, resulting in several proposals under different

names for the quantiﬁcation task such as: prevalence esti-

mation [12], class probability re-estimation [13], class prior

estimation [14], and class distribution estimation [8]. Gonz´

alez

and colleagues [10] organized most of the prior work accord-

ing to similarities between algorithms, resulting in a taxonomy

of quantiﬁcation methods consisting of three groups:

I. Classify, count & correct:methods that classify each

instance ﬁrstly and then count how many belong to

each class. Methods that apply any correction to their

predictions are included in this group as well;

II. Adapting traditional classiﬁcation algorithms:algo-

rithms that modify the mechanics of traditional classiﬁ-

cation learning methods so that they become quantiﬁers;

III. Distribution matching:algorithms that parametrically

model the training distribution and later search the pa-

rameters that produce the best match against the test set.

A. Classify, Count & Correct

The most straightforward approach is Classify & Count

(CC). It is a naive adaptation of classiﬁers to quantiﬁcation

problems. Forman [5] has demonstrated CC has a systematic

error that monotonically increases as we move away from a

distribution that CC provides optimal counting.

CC uses a classiﬁer to label each instance in the test sample.

Afterwards, it counts the number of examples belonging to

each class. CC provides optimal quantiﬁcation results with

a perfect classiﬁer. However, classiﬁers with balanced errors,

for instance, a binary classiﬁer that commits an equal number

of false-positive and false-negative errors, is also optimal.

Intuitively, in these situations, CC beneﬁts from the fact that

opposite mistakes can nullify each other.

Fig. 2 illustrates the probability density functions of the

scores for a binary classiﬁcation problem for a hypothetical

test set. We chose the threshold so that the number of

false-positives matches the false-negatives. Therefore, the CC

quantiﬁer provides perfect quantiﬁcation, albeit the underlying

classiﬁer is not perfect.

Fig. 2: Illustration of a scenario in which a CC quantiﬁer

provides ﬂawless quantiﬁcation even though the underlying

classiﬁer is not perfect [15].

The CC outcome is the count of every observation with

score above the threshold. In other words, it is the sum of

true-positives and false-positives. However, the actual count

is the sum of true-positives and false-negatives. Fig. 2 helps

us to understand the motivation behind several quantiﬁcation

methods that we can separate into two sub-groups. The ﬁrst

sub-group corrects the counts by estimating false-positive

and false-negative errors. The second sub-group searches for

suitable threshold values, such as the ones that provide more

reliable estimates for the false-positive and false-negative rates.

A well-known approach of the ﬁrst sub-group is Adjusted

Classify & Count (ACC) [1]. In absolute numbers, ACC’s

correction factor adds the false-negatives to CC’s output and

then subtracts the false-positives. However, ACC is more

commonly expressed as frequencies, in the following manner:

PACC (⊕) = ˆ

PCC (⊕)−P(⊕|)

P(⊕|⊕)−P(⊕|)(1)

where ˆ

PCC (⊕)is the positive class prevalence provided by CC

in the test set. P(⊕|)is the false-positive rate, and P(⊕|⊕)

is the true-positive rate.

If we knew the true-positive and false-positive rates in the

test set, then ACC would be a perfect quantiﬁer. However, as

the test set is unlabelled, the best we can do is to estimate these

quantities in the training set. As Fig. 1 suggests, estimating

these values in the training set makes ACC far from being

perfect and not as accurate as the state-of-the-art.

Probabilistic Classify & Count (PCC) and Probabilistic Ad-

just Classify & Count (PACC) [16] assume that probabilities in

the form of calibrated scores have richer information than the

crisp prediction. Instead of counting the positive predictions

as in CC, PCC averages the scores provided by the scorer

hSto estimate the class proportion. PACC is similar to PCC

but uses the same correction factor as ACC. Similar to CC,

when the class distribution changes, PCC also overestimates

or underestimates the actual class proportion [17]. Moreover,

these probabilistic methods suffer of a chicken-and-egg prob-

lem since the calibration of classiﬁers depends on the class

distribution [5].

The second sub-group selects the threshold value for a clas-

siﬁer using the training data [18]. There are several strategies

to chose a proper threshold value, such as the one that equates

the false-negative and false-positive rates and values that can

possibly provide more reliable error estimates in imbalanced

datasets. Some of the most used strategies are the following.

X: sets the threshold to a value that equates false-negative and

false-positive rates;

Max: maximizes the denominator of ACC in Eq. 1 by ﬁnd-

ing the threshold value which maximizes the difference

between true-positive and false-positive rates.

T50: adjusts the threshold so that the true-positive rate is 50%.

Median sweep (MS): returns the median of several applica-

tions of the ACC method for a predeﬁned range of thresh-

olds. It uses the true-positive and false-positive rates

estimated from the training set using cross-validation.

Forman [18] also proposes a variant of MS named

MS2, which considers only thresholds that produces a

denominator which the difference is greater than 0.25.

B. Adapting Traditional Classiﬁcation Algorithms

Quantiﬁcation Trees (QT) is a decision tree learning algo-

rithm for quantiﬁcation [2]. Similar to other decision trees,

QT greedily selects the splitting-feature and splitting-threshold

at a decision node that optimizes a given criterion. However,

instead of using typical classiﬁcation measures, such as the

ones based on information theory, the proposal uses alternative

measures tailored for quantiﬁcation problems. Two evaluation

metrics proposed in [2] are the following:

Classiﬁcation Error Balancing: for each possible split and

class ci, the difference between false-negative (FNi) and

false-positive (FPi) is computed, as:

Eci=|FPi−FNi|

where the optimum value for quantiﬁcation is achieved

when the value of FPiis equal to FNi, i.e., Eci= 0;

Classiﬁcation-Quantiﬁcation Balancing: this approach is an

improvement over the previous one. The authors made

a trade-off between quantiﬁcation and classiﬁcation ac-

curacy. Thus, for each possible split and class ci, it

computes the following equation:

Eci=|FPi−FNi|×|FPi+FNi|

where the right side of the multiplication is a measure

for classiﬁcation performance, perfect classiﬁcation is

achieved if FPi=FNi= 0.

For a possible split, the values for the chosen evaluation

metric are aggregated in a new vector: E= [Ec1, ..., Ecl]or

E= [Ec1, ..., E cl]. The ﬁnal score is given by the L2-norm

of E. To calculate the goodness of the split, the quantiﬁcation

accuracy of parent node (before splitting) and child node (after

splitting) are compared:

∆ = ||Eparent||2− ||Echild ||2

where ∆denotes the goodness of the split: the larger the

delta the better the split. The growing process is terminated if

∆≤0. Additionally, the authors performed experiments with

a simple decision tree and with Random Forest.

C. Distribution Matching

We can separate the distribution matching methods into

two classes of approaches. The ﬁrst one uses a variation of

the well-known Expectation-Maximization (EM) algorithm.

The second class of methods include algorithms that mixture

distributions on training set to match a test set distribution.

Expectation-Maximization Quantiﬁer (EMQ) [19] is an iter-

ative approach to estimate the class prevalence in imbalanced

class distributions based on the Expectation-Maximization

algorithm. EMQ initializes with the class prevalence in the

training set. Then, for each iteration, it updates the estimated

class ratio to better approximate the distribution of class in the

test set. Formally:

P(0)

EM Q(ci) = ˆ

PTr(ci)

P(t)

EM Q(ci|xk) =

P(t)

EM Q(ci)

PTr(ci)ˆ

PTr(ci|xk)

j=1

P(t)

EM Q(cj)

PTr(cj)ˆ

PTr(cj|xk)

P(t+1)

EM Q(ci) = 1

k=1

P(t)

EM Q(ci|xk)

where ˆ

PTr(ci)is the prior probability of class ciestimated

in the training set Tr.ˆ

PTr(ci|xk)is the posterior probability

for the class cigiven a test instance xk. Such a quantity is

estimated by a probabilistic classiﬁer learned over the training

set. For each iteration t,ˆ

P(t)

EM Q(ci|xk)and ˆ

P(t+1)

EM Q(ci)are re-

estimated sequentially for each instance xkin the test set and

each class ci∈ Y.

Other distribution matching methods consider the scores

obtained on an unlabeled set to follow a parametric mixture

between two known distributions (one for the positive and

another for the negative class). In general, these methods use

a search mechanism to ﬁnd the parameters that best match a

mixture of positive and negative training set score distributions

with the unlabelled score distribution of the test set. The

computation of the parameters of this mixture leads to the

quantiﬁcation estimate.

The ﬁrst distribution matching method uses Kolmogorov-

Smirnov statistic and PP-Area to measure the difference be-

tween the positive (S⊕) and the negative (S) score distribu-

tions [1]. A more recent proposal, the HDy algorithm, repre-

sents each score distribution as a histogram [8]. A weighted

sum of these histograms gives the mixture between the positive

and negative score distributions, where the weights sum up

to 1. The weights that minimize the Hellinger Distance (HD)

between the mixture and the unlabeled (test) score distribution

(S) are considered to be the proportion of the corresponding

classes in the unlabeled sample. The next equation details this

computation:

PHDy(⊕) =

arg min

0≤α≤1HD αH[S⊕] + (1 −α)H[S], H[S]

where HD represents Hellinger distance and H[·]indicates an

operation that converts a set of scores into a histogram. Fig.

3 illustrates this process.

]

S⊕S⊙

S⊖

+ (1 - α)

αHD )

)

Fig. 3: HDy searches for an αthat minimizes the Hellinger

Distance [20].

HDy uses histograms to represent the positive, negative

and unlabelled score distributions. A histogram is a discrete

representation that has a relevant parameter, the number of

bins1.HDy authors recommend to apply the method over a

range of bins from 10 to 110 with an increment of 10. The ﬁnal

output is the median of the estimated positive distributions

across all bins values.

The original HDy method uses a linear search to ﬁnd

the alpha that minimizes the Hellinger distance. Some minor

improvements to HDy are the use of ternary search to make

HDy more efﬁcient and the use of Laplace smoothing to

compute the bin values [21].

The HDy method inspired a recently proposed framework

named DyS[7] that supports the use of different distance

measures besides HD:

PDyS(⊕) =

arg min

0≤α≤1DS αR[S⊕] + (1 −α)R[S], R[S]

where DS is a dissimilarity measure to estimate the match

between the distributions of training scores and test scores,

and R[·]is an operation that converts a set of scores into a

representation, such as a histogram. In [7], the authors also

propose SORD as part of the DySframework. SORD uses

an efﬁcient variation of the Earth Mover’s distance to operate

directly over the scores values, instead of using an intermediate

representation.

Moreover, the authors of [7] also analyze the signiﬁcance of

the number of bins on quantiﬁcation performance and provides

recommendations for this parameter. Histograms with several

1Bins divide the entire range of score values into a series of intervals, so

we can count how many values fall into each interval.

bins incur sparseness and demand further training and test

observations to measure the distributions. Whereas histograms

with a small number of bins may not convey all information

necessary to characterize the data distribution. Based on the

ﬁndings in [7], in this paper, we vary the number of bins

from 2 to 20 with step size of 2 and report the ﬁnal positive

class prevalence as a median of estimated class ratios over all

bins. In addition, as DySarticle compared distinct similarity

functions and provides a rank of these functions, we included

the best ranked distance, Topsøe, in our experiments.

Until now the mixture model methods proposed in the

literature depend on four main factors that inﬂuence their

performance: number of bins required for representing a

distribution; distance function to compare two distributions;

approach to represent distribution; and search procedure for

alpha that minimizes the distance between the distributions.

This work investigates a method that is independent of these

parameters and proposes a simpliﬁed and highly-efﬁcient

version of mixture model methods.

IV. TIME COMPLEXITY

In this section, we discuss the time complexity of quantiﬁ-

cation methods reviewed in Section III. Table I summarizes

the time complexity of each quantiﬁer for the training and

test phases.

For the training phase, consider Tras the training set, M(n)

the cost for training either a scorer or classiﬁer and C(n)

the time for scoring nobservations. Suppose that we also

apply, depending on the quantiﬁer, k-fold cross-validation to

obtain the positive (S⊕) and negative (S) scores as well as

the true-positive (tpr) and false-positive (fpr) rates for either

a given threshold (th) or a set of thresholds (rth). In this

case, the time of the k-folds cross-validation step is deﬁned

by OkM|Tr| − |Tr|

k+C|Tr|

k. All quantiﬁers ex-

cepting CC, PCC, QT, and EMQ have their time complexity

impacted by the cross-validation step in the training phase.

Moreover, quantiﬁers that count by averaging the scores need

an extra step to convert the score into calibrated probabilities.

We use an isotonic regression that performs in O(|Tr|log |Tr|)

[22].

For the testing phase, all methods are impacted by the cost

of scoring each test instance in O(C(|Te|)). For CC, ACC,

MAX, X, T50, and QT, C(|Te|)represents the entire running

cost. Those are followed by MS and MS2 that apply ACC for

a set of thresholds (rth) and then report the median of the

ACC results. MS2 is a bit more efﬁcient than MS due to its

reduced number of threshold possibilities.

Although PCC and PACC are very simple methods, they

require a calibration of the test scores. The mentioned isotonic

regression method requires O(log |Te|)in testing phase [22].

The use of different calibration methods can impact the cost

of both PCC and PACC quantiﬁers. We point the interested

reader to the work of Naeini and Cooper [22] about binary

calibration. Next, in terms of test efﬁciency, appears the

mixture model methods. DySand HDy both are impacted by

the dissimilarity function cost, the number of bins (b), and the

range of αvalues being searched. In our experiments, we use

DySwith Topsøe dissimilarity that is recommended by the

DySauthors [7] whereas HDy uses Hellinger distance. Both

distances run in O(b). However, DySis less expensive than

HDy due to the Ternary Search mechanism used for searching

the best αvalue. Ternary search runs in O(log 1/E), where

Eis the accepted error for the ternary search (e.g. 10−5).

Summing up, beyond the cost of scoring, DyScosts is given

by the Topsøe distance cost multiplied by the number of bins,

and the ternary search cost, i.e., O(blog 1/E). Differently,

apart from scoring cost, HDy makes a sequential search to

ﬁnd the best αgiven a range of values, and thus it runs in

O(b|α|).

At last, SORD and EMQ are the most expensive methods.

The ﬁrst is based on Earth Mover’s Distance (EMD) that

works directly on scores values, requiring the cost of scoring

the test set plus the cost of the EMD distance computation,

totalling O(|Tr|+|Te|) log(|Tr|+|Te|). The latter is based on

the Expectation-Maximization algorithm, that is an iterative

process. Its cost accounts by the number of testing instances

(|Te|), number of iterations (t), and number of classes (l).

Thus, the overall running cost is O(C(|Te|) + |Te|tl).

V. PROPOSED METHOD

In this section, we introduce our simple proposal to quan-

tify the class distribution. The distribution matching methods

(Section III-C) usually employ histograms to represent the

distribution of scores for the training and test sets [7], [8].

The time complexity of these methods depends directly on

the number of bins. Our hypothesis is that these methods can

have a competitive or even better accuracy by replacing the

entire histogram by the mean values of the scores for training

and test sets. Moreover, employing the mean of the scores

directly, instead of going through the process of constructing

histogram distribution for several bins, makes the method more

straightforward and efﬁcient.

The main inspiration of our method comes from the idea if

we split a set of values into groups, the mean of the entire set

is equal to the weighted sum of each group mean. The weights

are the fractions of number of elements in these groups [23].

In our case, we have the scores divided into two groups, the

positive and the negative scores. The mean score calculated

over all unlabelled scores in the test set is equal to a weighted

sum of the mean of positive scores and the mean of negative

scores. As we do not know the actual mean of these scores

in the test set, we use, as surrogate, the mean of the scores in

the training set. Thus, SM M relies in the following equation:

PSM M (⊕) =

arg min

0≤α≤1|αµ[S⊕] + (1 −α)µ[S]−µ[S]|

where µ[·]represents the opeation that computes the scores

average.

Therefore, SMM is a member of the DySframework that

uses simple means to represent the score distribution for

TABLE I: Training and testing time complexities for the assessed quantiﬁers.

Quantiﬁer Training Test

CC O(M(|Tr|)) O(C(|Te|))

ACC OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|))

PCC O(M(|Tr|) + |Tr|log |Tr|)O(C(|Te|) + log |Te|)

PACC OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k +|Tr|log |Tr|O(C(|Te|) + log |Te|)

XOO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|))

T50 OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|))

MAX OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|))

MS OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|) + |rth|)

MS2 OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|) + |rth|)

QT O(M(|Tr|)) O(C(|Te|))

EMQ O(M(|Tr|)) O(C(|Te|) + |Te|tl)

DyS-TS OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|) + blog 1/E)

SORD OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|)+(|Tr|+|Te|) log(|Tr|+|Te|))

HDy OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k O(C(|Te|) + b|α|)

positive, negative and unlabelled scores. Since SMM works

with point-wise estimates, we do not need to apply a search

mechanism to determine α. Instead, we can obtain αusing a

closed-form equation:

α=µ[S]−µ[S]

µ[S⊕]−µ[S](2)

where αis the ﬁnal quantiﬁcation for the positive class. Such

a value should be clipped into the [0,1] range. This equation

has a very simple derivation presented in Appendix A.

Algorithm 1 formalizes the training phase of the SMM

approach. Since the classiﬁcation of examples seen in the

training set generates biased (overly optimistic) scores, we

recommend the use of 10-cross-validation (line 3). In each

fold, we train a classiﬁer and generate scores for the examples

in the validation set (lines 5-9). After 10 runs, we have scores

for the entire training set. In line 10, the algorithm computes

the average scores for the positive and negative scores. Finally,

SMM induces a classiﬁer hover the entire training set Tr.

Algorithm 1 Sample mean matching (SMM) training phase.

1: Input: Training set, Tr

2: Output: The mean of positive and negative training

scores, µ[S⊕], µ[S]; Classiﬁcation model, h

3: [T1...k

r, T 1...k

v]←cross-validation(k, Tr){kis the number

of folds}

4: STr← ∅

5: for i= 1 . . . k do

6: hi←train(Ti

7: Si←score(Ti

v, hi)

8: STr←STr∪Si

9: end for

10: [µ[S⊕], µ[S]] ←mean-by-class(STr)

11: h←train(Tr){Creates a model over all training data}

12: return [µ[S⊕], µ[S], h]

Algorithm 2 presents the test phase of SMM. When a test

set becomes available, SMM computes the scores for each test

set example using h(line 3). Afterwards, it computes the mean

of these scores (line 4) and the αparameter (line 5) according

to Equation 2. Lines 6 and 7 ensure the αis contained in the

[0,1] range.

Algorithm 2 Sample mean matching (SMM) test phase.

1: Input: The mean of positive and negative training scores,

µ[S⊕], µ[S]; Classiﬁcation model, h, Test set, Te

2: Output: Positive class prevalence α

3: S←score(Te, h)

4: µ[S]←mean(S)

5: α←µ[S]−µ[S]

µ[S⊕]−µ[S]

6: α←max(α, 0)

7: α←min(α, 1)

8: return α

SMM is an essentially parameter-free method. Unlike

other distribution matching methods, the performance of

SMM is independent of the four parameters, i.e.: method

to represent the distribution of scores, number of bins re-

quired to represent distribution, dissimilarity function to

compare the distributions, and approach required to search

for αthe minimizes the distance. Instead, SMM employs

the mean value as a summary of the entire distribution.

Such a feature makes it very simple to train and repro-

duce its results. SMM also has very low memory and pro-

cessing requirements. The time complexity for training is

OO(M(|Tr|)) + kM|Tr| − |Tr|

k+C|Tr|

k, there-

fore, similar to other quantiﬁcation methods that rely on cross-

validation to compute unbiased scores.

SMM test phase takes as input just two scalar values

(µ[S⊕], µ[S]) and a classiﬁcation model (h). The time com-

plexity of the test phase is O(C(|Te|)).

VI. EX PE RI ME NTAL SE TU P

We conduct a comprehensible experimental evaluation in-

volving 15 state-of-the-art and established quantiﬁers. Table II

summarizes the methods assessed in this paper.

TABLE II: Assessed Quantiﬁers.

Quantiﬁer Brief description Reference

CC Classify & Count [1]

ACC Adjusted Classify & Count [1]

PCC Probabilistic Classify & Count [16]

PACC Probabilistic Adjusted Classify & Count [16]

X Classiﬁer with decision threshold set to fnr =

fpr

[18]

T50 Classiﬁer with decision threshold set to tpr =

50%

[18]

MAX Classiﬁer with decision threshold such that

tpr −fpr is maximum

[18]

MS Median Sweep [18]

MS2 Median Sweep with a subset of thresholds such

that tpr -fpr >0.25

[18]

QT Quantiﬁcation Tree [2]

EMQ Expectation-Maximization Quantiﬁer [19]

DyS-TS Mixture model of the DySfamily with Topsoe

distance

[7]

SORD Sample-ORD method [7]

HDy Mixture model with Hellinger distance [8]

SMM Sample Mean Matching (proposal)

Due to the large number of methods included in the ex-

periments, we decided to divide the results into two parts.

The ﬁrst part compares our proposal, SMM, with some of the

most accurate quantiﬁcation methods. Therefore, we compare

SMM with DyS-TS, SORD and HDy. According to our

experiments, DyS-TS and SORD are the two most accurate

quantiﬁers considering all datasets. Although HDy is not as

competitive as those methods, we decided to include it since

SMM is a simpliﬁcation of HDy. While HDy represents the

distribution using a histogram, SMM employs a single scalar,

the distribution mean.

In the second evaluation part, we extend our assessment

to include all 15 methods listed in Table II. This second

part gives an overall perspective of the accuracy and runtime

performance of SMM in the light of a larger portion of the

literature.

We use different experimental setups to assess runtime and

accuracy performance. For the runtime analysis, we simulate

the classiﬁcation scores to estimate the cost for each method,

varying the test set sizes from 104to 109with increments of

104. For each size variation, we run each quantiﬁer 1,000 times

under the same setup condition of software and hardware2.

We used artiﬁcial values as scores due to the simplicity to

run a large-scale simulation and the fact we only measure

the running time of the quantiﬁers while not considering their

accuracy. Inspired by the models used in [24] to simulate the

variability in the feature distribution, we simpliﬁed that idea,

sampling synthetic scores as values from a continuous uniform

distribution between [0,1].

2All the experiments were performed on the same computer - a 32-core

Intel®Xeon®CPU E5-2620 v4 @ 2.10GHz with 98 GB of RAM memory -

with no other except operating system-related processes running in parallel.

For counting accuracy experiments, we use 25 real datasets.

Table III brieﬂy describes the main features of the datasets,

obtained from UCI [25], OpenML [26], PROMISE [27], and

Reis [21] repositories3. We used binary quantiﬁcation datasets.

We converted the datasets which contain multiple classes into

binary class problem by ﬁxing one class as positive and all

other classes as negative.

TABLE III: Datasets Description.

ID Dataset Size Features (%) Pos. Repository

instances

A AedesSex 24,000 27 50 Reis

B AedesQuinx 24,000 27 50 Reis

C Anuran Calls 6,585 22 33 UCI

D ArabicDigit 8,800 27 50 UCI

E BNG (vote) 39,366 9 34 OpenML

F Bank Marketing 45,211 16 12 UCI

G Click Prediction 39,948 11 17 OpenML

H CMC 1,473 9 43 UCI

I Covertype-reduced 8,715 54 46 UCI

J Credit Card 30,000 23 22 UCI

K EEG Eye State 14,980 14 45 OpenML

L Enc. Stock Market 96,320 22 49 OpenML

M Handwritten-QG 4,014 63 50 Reis

N HTRU2 17,898 8 9 UCI

O JM1 10,880 21 19 PROMISE

P Letter Recognition 20,000 16 19 UCI

Q Mozilla4 15,545 5 33 OpenML

R MAGIC Gamma 19,020 10 35 UCI

S Nomao 34,465 118 29 OpenML

T Occupancy Detec. 20,560 5 23 UCI

U Phoneme 5,404 5 29 OpenML

V Pollen 3,848 6 50 OpenML

W Spambase 4,601 57 39 UCI

X Wine Type 6,497 12 37 UCI

Y Wine Quality 6,497 12 37 UCI

For training and evaluation of the models, each dataset is

divided into two subsets using stratiﬁed sampling, resulting

in training and test sets. From the training part, we estimate

scores with 10-fold cross-validation. These scores are used

by the distribution matching methods [7], [8], including the

proposal SMM.

Similarly, true-positive rate (tpr) and false-positive rate (fpr)

are also estimated using cross-validation for the ACC and

threshold selection methods [5], [18]. According to [16], PCC

and PACC require calibrated scores, which we obtained using

an isotonic regression calibration based on pair-adjacent vio-

lators [22]. For QT, we use the Classiﬁcation Error Balancing

criterion to deﬁne the splits. The source code for QT algorithm

was provided by its authors [2]. The remaining quantiﬁers

used in our experiments, as well as the auxiliary functions,

are available from our R package4.

In all experiments, we use a random forest classiﬁer with

200 trees to obtain scores and tpr and fpr rates.

We trained the random forest classiﬁers with balanced

training sets, i.e., an equal number of positive and negative

instances. Conversely, we vary the distribution and size of the

3Speciﬁc citations are requested for Credit Card [28], HTRU2 [29],

Mozilla4 [30], Bank Marketing [31], Nomao [32], and Occupancy Detection

[33]. Besides, we note that Jock A. Blackard and Colorado State University

preserve copyright over Covertype.

4https://github.com/andregustavom/mlquantify

test sets. Unlike classiﬁcation, quantiﬁcation methods require,

as input, a sample of observations. For a comprehensive

assessment, we extract several samples from the test set with

different class distributions. We vary class distribution for each

sample size from 0% to 100% with an increment of 1%.

The number of observations in the test set sample is a rele-

vant aspect of the assessment of quantiﬁcation methods [34].

Therefore, we generate test samples with different sizes, from

10 instances to 100 instances with an increment of 10 and

from 100 to 500 instances with an increment of 100.

We use Mean Absolute Error (MAE) [35] to evaluate the

accuracy and the mean runtime to assess the efﬁciency of

quantiﬁers. MAE is the average absolute difference between

actual (p) and predicted (ˆp) class proportion. The online paper

supplementary material contains all code and datasets5.

VII. EXP ER IM EN TAL RES ULT S

We open this section discussing the experimental com-

parison between SMM and three state-of-the-art quantiﬁers,

DyS-TS, SORD and HDy. Fig. 4 summarizes the results

by presenting a critical difference diagram [36] regarding

quantiﬁcation performance (measured with MAE) across all

25 datasets.

1 2 34

DyS−TS

SORD

HDy

SMM

Fig. 4: Nemenyi’s post hoc test for mean absolute quantiﬁ-

cation error. Groups of methods which are not signiﬁcantly

different at p < 0.05 are connected.

According to Fig. 4, SMM is less accurate than DyS-TS and

SORD, but with no statistical difference. Conversely, SMM is

statistically more accurate than HDy at p < 0.05 conﬁdence

level. Interestingly, all these four methods are mixture models.

The differences among them are mostly in the manner the

score distributions are represented and compared. SMM is the

simplest among them, representing the score distribution by a

single scalar value. Even though SMM uses less information

than its counterparts, it is still able to achieve competitive

results.

Such simplicity makes SMM a very efﬁcient quantiﬁer. We

compared SMM with the same four state-of-the-art quantiﬁers.

Fig. 5 shows each method runtime according to the test set

size. SMM is, on average, three orders of magnitude faster

than DyS-TS, SORD, and HDy. The efﬁciency difference of

SMM compared to the other methods remains constant for the

entire spectrum of test set sizes.

We conclude this ﬁrst part of our analysis emphasizing

that although SMM is slightly less accurate than DyS-TS and

5https://github.com/andregustavom/dsaa smm

25000 50000 75000 100000

Samples per

second (log-scale)

DyS−TS HDy SMM SORD

Test sample size

Fig. 5: Runtime of SMM and state-of-the-art quantiﬁers across

different test set sizes.

SORD (with no statistical difference), it is orders of magnitude

faster than these methods.

Let us turn our attention to all 15 quantiﬁcation methods.

Fig. 1 provides an overview of the results considering both

accuracy and efﬁciency. Since Fig. 1 presents average runtime

of algorithms over test sets of different sizes, we report the

number of instances processed per second. In contrast, Fig. 5

reports the number of samples processed per second.

Fig. 1 provides the overall ranking of quantiﬁers considering

all datasets. For a more detailed analysis, Table IV reveals

the MAE numerical results for each combination of quantiﬁer

and dataset, averaged across all class distributions and test

set sizes. The values in bold indicate the best quantiﬁcation

result for each dataset. According to Table IV, DyS-TS is

the most accurate quantiﬁer with minimal MAE for 15 out

of 25 datasets. Similarly, SORD is the second-best quantiﬁer

and holds the best MAE for six datasets (three of them tied

with DyS-TS). SMM ranks in third place and it does not

produce minimum MAE in any of the datasets. However, for

the majority of the datasets, errors introduced by SMM are

very close to the best results. Across all datasets, the average

MAE difference of SMM compared with the DyS-TS and

SORD is 2 percentual points.

To analyse the differences between SMM and the best

result, we underlined in Table IV values that represent a non-

statistically signiﬁcant difference regarding the best quantiﬁer

in each dataset. DyS-TS and SORD have an impressive perfor-

mance with no statistically signiﬁcant differences against the

best quantiﬁer for 23 and 24 of the datasets, respectively. SMM

also performs well, and it is outperformed with statistical

signiﬁcance in only ﬁve of the datasets.

In these 20 out of 25 datasets, where SMM results show no

statistically signiﬁcant difference, SMM is, on average, three

orders of magnitude faster than the best quantiﬁer. The only

exception is when MAX is the best quantiﬁer, which happens

only in two datasets (Xand Y).

Fig. 6 shows the ranking of all quantiﬁers for all datasets

varying the test set size and class distribution. A relevant

aspect of the assessment of quantiﬁers is the size of the test

set [34]. Quantiﬁcation methods receive as input a sample

of observations. Thus, these methods can estimate several

statistics over these test samples. Intuitively, some of these

statistics may require more data to be estimated reliably than

others. Therefore, we can expect that not every quantiﬁer

TABLE IV: Mean absolute error for each dataset and quantiﬁer averaged over all test set sizes and class distributions. The

values in bold indicate the best quantiﬁcation result for each dataset. Underlined values indicate a non-statistically signiﬁcant

difference concerning the best quantiﬁer in each dataset.

Dataset DyS-TS SORD SMM MAX EMQ ACC X MS2 MS PACC CC PCC HDy T50 QT

A0.010 0.010 0.012 0.012 0.010 0.012 0.014 0.025 0.025 0.012 0.013 0.015 0.020 0.128 0.137

B0.045 0.045 0.046 0.052 0.074 0.053 0.054 0.052 0.053 0.071 0.093 0.120 0.056 0.071 0.185

C 0.012 0.011 0.013 0.012 0.011 0.012 0.015 0.025 0.025 0.012 0.012 0.018 0.040 0.068 0.129

D 0.008 0.008 0.014 0.008 0.013 0.008 0.008 0.024 0.024 0.007 0.006 0.009 0.021 0.081 0.174

E 0.010 0.010 0.011 0.011 0.009 0.011 0.011 0.025 0.025 0.012 0.012 0.016 0.023 0.025 0.070

F0.044 0.050 0.050 0.045 0.110 0.088 0.051 0.058 0.065 0.061 0.286 0.250 0.282 0.067 0.314

G0.125 0.133 0.131 0.154 0.269 0.213 0.230 0.151 0.156 0.508 0.468 0.329 0.296 0.154 0.451

H0.088 0.096 0.097 0.109 0.155 0.112 0.207 0.104 0.096 0.240 0.217 0.204 0.261 0.117 0.240

I0.052 0.056 0.055 0.064 0.096 0.063 0.064 0.061 0.061 0.091 0.107 0.142 0.065 0.084 0.225

J0.087 0.092 0.089 0.099 0.190 0.106 0.165 0.093 0.097 0.235 0.301 0.259 0.280 0.098 0.247

K0.027 0.028 0.032 0.032 0.052 0.034 0.033 0.038 0.039 0.040 0.050 0.068 0.060 0.064 0.230

L 0.415 0.381 0.415 0.454 0.251 0.459 0.451 0.343 0.342 0.508 0.252 0.252 0.415 0.465 0.251

M 0.006 0.004 0.007 0.005 0.008 0.004 0.005 0.023 0.023 0.004 0.003 0.004 0.007 0.088 0.130

N0.024 0.027 0.027 0.027 0.026 0.029 0.028 0.037 0.037 0.028 0.076 0.078 0.096 0.066 0.124

O 0.111 0.104 0.107 0.126 0.251 0.153 0.146 0.112 0.121 0.466 0.430 0.299 0.144 0.119 0.312

P0.016 0.019 0.022 0.017 0.019 0.027 0.019 0.032 0.032 0.047 0.026 0.037 0.065 0.062 0.353

Q0.040 0.042 0.045 0.048 0.060 0.047 0.051 0.049 0.048 0.069 0.095 0.109 0.060 0.076 0.203

R0.023 0.026 0.025 0.027 0.026 0.027 0.034 0.036 0.035 0.041 0.062 0.055 0.030 0.069 0.120

S0.016 0.020 0.019 0.019 0.016 0.022 0.021 0.029 0.029 0.020 0.034 0.039 0.044 0.064 0.163

T 0.007 0.007 0.008 0.008 0.006 0.008 0.008 0.024 0.024 0.012 0.008 0.014 0.039 0.060 0.050

U0.034 0.035 0.037 0.040 0.057 0.046 0.066 0.043 0.044 0.053 0.095 0.117 0.174 0.068 0.230

V 0.394 0.444 0.440 0.477 0.253 0.467 0.469 0.389 0.389 0.355 0.259 0.253 0.374 0.438 0.253

W0.022 0.022 0.023 0.026 0.025 0.025 0.027 0.032 0.032 0.029 0.038 0.047 0.056 0.060 0.161

X 0.009 0.008 0.011 0.007 0.009 0.009 0.011 0.024 0.024 0.008 0.008 0.011 0.051 0.096 0.125

Y 0.010 0.008 0.010 0.008 0.009 0.009 0.012 0.024 0.024 0.009 0.009 0.011 0.060 0.076 0.127

Std. dev. 0.107 0.110 0.113 0.124 0.092 0.126 0.128 0.094 0.094 0.166 0.138 0.107 0.122 0.106 0.092

Avg. rank 3.00 3.30 5.10 5.46 6.38 6.94 7.96 8.36 8.60 8.60 9.14 10.52 11.40 11.68 13.56

will perform equally for different test set sizes. In particular,

small test sets frequently pose a very challenging scenario for

quantiﬁcation methods.

DyS−TS

SORD

SMM

MAX

EMQ

ACC

MS2

PACC

PCC

HDy

T50

Quantication algorithms

Rank

Fig. 6: Aggregated rank positions for all 15 quantiﬁcation

algorithms and the 25 datasets.

Fig. 7 shows the average MAE for all 25 datasets split by

test set size. SMM performs similarly to DyS-TS and SORD,

the two best-ranked quantiﬁers. SMM performance is slightly

worse than these methods considering tiny test sets of size ten.

As the test set sizes increase, the MAE difference between

SMM and DyS-TS and SORD increases. However, even for

the largest test set sizes, the difference is still quite small.

The performance of the quantiﬁers varies signiﬁcantly ac-

cording to their quantiﬁcation strategies. Methods such as CC,

PCC and QT that do not obtain any information from the test

sample have an almost constant performance across all test

set sizes. EMQ performs surprisingly well for small test sets

sizes, but rather poorly for large ones. MS and MS2 slightly

outperform DyS-TS, SORD and SMM for small test sets but

are surpassed for large ones.

0.06

0.12

DyS−TS

SORD

SMM

MAX

EMQ

ACC

MS2

PACC

PCC

HDy

T50

Quantication algorithms

MAE

(log-scale)

20 30

40 50

60 70

80 90

100 200

300 400

500

Fig. 7: Mean absolute quantiﬁcation error segregated by test

set size.

VIII. CONCLUSION

The trade-off between efﬁciency and efﬁcacy is a recurrent

topic in several areas of Computer Science, including Machine

Learning. Frequently, to improve effectiveness, such as count-

ing accuracy, researchers need to propose more sophisticated

algorithms that, in turn, affects efﬁciency.

Our proposal, SMM, is a highly efﬁcient quantiﬁer that

provides quantiﬁcation accuracy slightly lower than the state-

of-the-art. This proposal is relevant for several application

domains that need to process large quantities of data with

limited time or processing power.

Mixture models for quantiﬁcation are the main inspira-

tion for our proposal. However, instead of using a multi-

dimensional representation of the distributions, it uses a simple

mean to summarize such information. Such simpliﬁcation

leads to an algorithm with low requirements in terms of

memory and processing. A comparison with other state-of-

the-art mixture models reveals that SMM is three orders of

magnitude faster. Yet, SMM accuracy is lower than SORD

and DyS-TS, the difference is not statistically signiﬁcant.

As part of our intentions for future work, we will continue

to develop mixture models, improving their performance for

different sizes of test samples. Our experimental results show

that small test samples are particularly challenging for this

class of methods.

REFERENCES

[1] G. Forman, “Counting positives accurately despite inaccurate classiﬁca-

tion,” in European Conference on Machine Learning. Springer, 2005,

pp. 564–575.

[2] L. Milli, A. Monreale, G. Rossetti, F. Giannotti, D. Pedreschi, and

F. Sebastiani, “Quantiﬁcation trees,” in 2013 IEEE 13th International

Conference on Data Mining. IEEE, 2013, pp. 528–536.

[3] W. Gao and F. Sebastiani, “From classiﬁcation to quantiﬁcation in tweet

sentiment analysis,” Social Network Analysis and Mining, vol. 6, no. 1,

2016.

[4] D. Silva, V. Souza, D. Ellis, E. Keogh, and G. Batista, “Exploring low

cost laser sensors to identify ﬂying insect species,” Journal of Intelligent

& Robotic Systems, vol. 80, no. 1, pp. 313–330, 2015.

[5] G. Forman, “Quantifying counts and costs via classiﬁcation,” Data

Mining and Knowledge Discovery, vol. 17, no. 2, pp. 164–206, 2008.

[Online]. Available: http://dx.doi.org/10.1007/s10618-008- 0097-y

[6] P. Gonz´

alez, J. D´

ıez, N. Chawla, and J. J. del Coz, “Why is quantiﬁcation

an interesting learning problem?” Progress in Artiﬁcial Intelligence,

vol. 6, no. 1, pp. 53–58, 2017.

[7] A. Maletzke, D. dos Reis, E. Cherman, and G. E. A. P. A. Batista, “Dys:

a framework for mixture models in quantiﬁcation,” in AAAI Conference

on Artiﬁcial Intelligence, ser. AAAI ’19, 2019.

[8] V. Gonz´

alez-Castro, R. Alaiz-Rodr´

ıguez, and E. Alegre, “Class distribu-

tion estimation based on the hellinger distance,” Information Sciences,

vol. 218, pp. 146 – 164, 2013.

[9] P. Flach, Machine learning: the art and science of algorithms that make

sense of data. Cambridge University Press, 2012.

[10] P. Gonz´

alez, A. Casta˜

no, N. V. Chawla, and J. J. D. Coz, “A review

on quantiﬁcation learning,” ACM Computing Surveys (CSUR), vol. 50,

no. 5, p. 74, 2017.

[11] D. d. Reis, M. de Souto, E. de Sousa, and G. Batista, “Quantifying with

only positive training data,” arXiv preprint arXiv:2004.10356, 2020.

[12] J. Barranquero, P. Gonz´

alez, J. D´

ıez, and J. J. del Coz, “On the

study of nearest neighbor algorithms for prevalence estimation in binary

problems,” Pattern Recognition, vol. 46, no. 2, pp. 472–482, Feb. 2013.

[13] R. Alaiz-Rodr´

ıguez, A. Guerrero-Curieses, and J. Cid-Sueiro, “Class and

subclass probability re-estimation to adapt a classiﬁer in the presence of

concept drift,” Neurocomputing, vol. 74, no. 16, pp. 2614 – 2623, 2011.

[14] Y. S. Chan and H. T. Ng, “Estimating class priors in domain adaptation

for word sense disambiguation,” in Proceedings of the 21st Interna-

tional Conference on Computational Linguistics and the 44th Annual

Meeting of the Association for Computational Linguistics, ser. ACL-

44. Stroudsburg, PA, USA: Association for Computational Linguistics,

2006, pp. 89–96.

[15] D. dos Reis, A. Maletzke, E. Cherman, and G. Batista, “One-class

quantiﬁcation,” in Joint European Conference on Machine Learning and

Knowledge Discovery in Databases. Springer, 2018, pp. 273–289.

[16] A. Bella, C. Ferri, J. Hern´

andez-Orallo, and M. J. Ramirez-Quintana,

“Quantiﬁcation via probability estimators,” in IEEE International Con-

ference on Data Mining. IEEE, 2010, pp. 737–742.

[17] D. Tasche, “Exact ﬁt of simple ﬁnite mixture models,” Journal of Risk

and Financial Management, vol. 7, no. 4, pp. 150–164, 2014.

[18] G. Forman, “Quantifying trends accurately despite classiﬁer error and

class imbalance,” in ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, ser. KDD ’06. ACM, 2006, pp.

157–166.

[19] M. Saerens, P. Latinne, and C. Decaestecker, “Adjusting the outputs of

a classiﬁer to new a priori probabilities: a simple procedure,” Neural

computation, vol. 14, no. 1, pp. 21–41, 2002.

[20] A. Maletzke, D. dos Reis, E. Cherman, and G. Batista, “On the need

of class ratio insensitive drift tests for data streams,” in Proceedings

of the Second International Workshop on Learning with Imbalanced

Domains: Theory and Applications, ECML-PKDD, vol. 94. Dublin,

Ireland: PMLR, 10 Sep 2018, pp. 110–124.

[21] D. dos Reis, A. Maletzke, D. F. Silva, and G. E. A. P. A. Batista,

“Classifying and counting with recurrent contexts,” in ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining,

ser. KDD’18. ACM, 2018, pp. 1983–1992.

[22] M. P. Naeini and G. F. Cooper, “Binary classiﬁer calibration using

an ensemble of near isotonic regression models,” in 2016 IEEE 16th

International Conference on Data Mining (ICDM). IEEE, 2016, pp.

360–369.

[23] B. Everitt and A. Skrondal, The Cambridge dictionary of statistics.

Cambridge University Press Cambridge, 2002, vol. 106.

[24] D. Tasche, “Conﬁdence intervals for class prevalences under prior

probability shift,” arXiv preprint arXiv:1906.04119, 2019.

[25] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”

2017. [Online]. Available: http://archive.ics.uci.edu/ml

[26] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “Openml:

Networked science in machine learning,” ACM SIGKDD Explorations

Newsletter, vol. 15, no. 2, pp. 49–60, 2013.

[27] J. Sayyad Shirabad and T. Menzies, “The PROMISE repository of

software engineering databases.” School of Information Technology and

Engineering, University of Ottawa, Canada, 2005. [Online]. Available:

http://promise.site.uottawa.ca/SERepository

[28] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques

for the predictive accuracy of probability of default of credit card

clients,” Expert Systems with Applications, vol. 36, no. 2, pp. 2473–

2480, 2009.

[29] R. J. Lyon, B. Stappers, S. Cooper, J. Brooke, and J. Knowles, “Fifty

years of pulsar candidate selection: from simple ﬁlters to a new prin-

cipled real-time classiﬁcation approach,” Monthly Notices of the Royal

Astronomical Society, vol. 459, no. 1, pp. 1104–1123, 2016.

[30] A. G. Koru, D. Zhang, and H. Liu, “Modeling the effect of size on defect

proneness for open-source software,” in Predictor Models in Software

Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International

Workshop on. IEEE, 2007, pp. 10–10.

[31] S. Moro, P. Cortez, and P. Rita, “A data-driven approach to predict the

success of bank telemarketing,” Decision Support Systems, vol. 62, pp.

22–31, 2014.

[32] L. Candillier and V. Lemaire, “Design and analysis of the nomao

challenge active learning in the real-world,” in Active Learning in Real-

world Applications, Workshop ECML-PKDD, 2012.

[33] L. M. Candanedo and V. Feldheim, “Accurate occupancy detection of

an ofﬁce room from light, temperature, humidity and co2 measurements

using statistical learning models,” Energy and Buildings, vol. 112, pp.

28–39, 2016.

[34] A. Maletzke, W. Hassan, D. d. Reis, and G. Batista, “The importance

of the test set size in quantiﬁcation assessment,” in Proceedings of the

Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence,

IJCAI-20, 7 2020, pp. 2640–2646, main track. [Online]. Available:

https://doi.org/10.24963/ijcai.2020/366

[35] G. Da San Martino, W. Gao, and F. Sebastiani, “Ordinal text quantiﬁca-

tion,” in Proceedings of the 39th International ACM SIGIR conference

on Research and Development in Information Retrieval. ACM, 2016,

pp. 937–940.

[36] J. Demˇ

sar, “Statistical comparisons of classiﬁers over multiple data sets,”

J. Mach. Learn. Res., vol. 7, p. 1–30, Dec. 2006.

APPENDIX

SMM αparameter can be computed in closed-form. Equa-

tion 2 has a straightforward derivation.

αµ[S⊕] + (1 −α)µ[S] = µ[S]

αµ[S⊕] + µ[S]−αµ[S] = µ[S]

α(µ[S⊕]−µ[S]) = µ[S]−µ[S]

α=µ[S]−µ[S]

µ[S⊕]−µ[S]

(3)

Comments on Friedman's Method for Class Distribution Estimation

Preprint

Full-text available

May 2024

Dirk Tasche

The purpose of class distribution estimation (also known as quantification) is to determine the values of the prior class probabilities in a test dataset without class label observations. A variety of methods to achieve this have been proposed in the literature, most of them based on the assumption that the distributions of the training and test data are related through prior probability shift (also known as label shift). Among these methods, Friedman's method has recently been found to perform relatively well both for binary and multi-class quantification. We discuss the properties of Friedman's method and another approach mentioned by Friedman (called DeBias method in the literature) in the context of a general framework for designing linear equation systems for class distribution estimation.

Quantificação de mosquitos Aedes aegypti a partir de imagens de smartphones

Conference Paper

Apr 2024

A vigilância automática do mosquito Aedes aegypti é um desenvolvimento tecnológico com potencial de transformar as atuais práticas de monitoramento. Monitorar mosquitos se traduz em estimar o tamanho da população de mosquitos, isto é, contar a quantidade de mosquitos da espécie alvo dada a região monitorada. Contar é o conceito mais fundamental da matemática e um desafio para o Aprendizado de Máquina. Nesse sentido, a quantificação é uma tarefa de Aprendizado de Máquina recentemente formalizada, cujo objetivo é predizer a distribuição de classes dado um conjunto de teste. Neste trabalho, foram avaliados diferentes quantificadores a partir de imagens de vetores de doenças. Os resultados empíricos demonstram que o método de classificar e contar é um baseline, sendo superado pelos métodos DyS e HDy.

Utilizando a quantificação na análise de sentimentos em reviews de produtos

Conference Paper

Apr 2024

A coleta de informações como reviews sobre os produtos tornou-se uma tarefa relevante para as empresas, pois expressam o sentimento de consumidores sobre um determinado item. Conhecer a quantidade de reviews positivos e negativos sobre um produto/serviço é uma tarefa de interesse que pode ser explorada pela quantificação. O objetivo deste trabalho é avaliar diferentes quantificadores aplicados a reviews de produtos, bem como a influência desses métodos na performance de classificação. Foram avaliados dez métodos de quantificação em seis conjuntos de dados de reviews de produtos. Como resultado observou-se que o método amplamente utilizado para resolver tarefas de quantificação é superado por oito métodos e que quantificadores podem ser utilizados para melhorar a classificação de reviews. Em ambos os casos observou-se diferença estatisticamente significativa.

Binary quantification and dataset shift: an experimental investigation

Article

Full-text available

Mar 2024
DATA MIN KNOWL DISC

Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.

Multi-Label Quantification

Article

Jul 2023

Quantification, variously called supervised prevalence estimation or learning to quantify , is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values ) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.

An Equivalence Analysis of Binary Quantification Methods

Article

Jun 2023

Quantification (or prevalence estimation) algorithms aim at predicting the class distribution of unseen sets (or bags) of examples. These methods are useful for two main tasks: 1) quantification applications, for instance when we need to track the proportions of several groups of interest over time, and 2) domain adaptation problems, in which we usually need to adapt a previously trained classifier to a different --albeit related-- target distribution according to the estimated prevalences. This paper analyzes several binary quantification algorithms showing that not only do they share a common framework but are, in fact, equivalent. Inspired by this study, we propose a new method that extends one of the approaches analyzed. After an empirical evaluation of all these methods using synthetic and benchmark datasets, the paper concludes recommending three of them due to their precision, efficiency, and diversity.

The Road Ahead

Chapter

Full-text available

Mar 2023

This chapter concludes the book, discussing possible future developments in the quantification arena.

The Quantification Landscape

Chapter

Full-text available

Mar 2023

This chapter looks at other aspects of the “quantification landscape” that have not been covered in the previous chapters, and discusses the evolution of quantification research, from its beginnings to the most recent quantification-based “shared tasks”; the landscape of quantification-based, publicly available software libraries; visualization tools specifically oriented to displaying the results of quantification-based experiments; and other tasks in data science that present important similarities with quantification. This chapter also presents the results of experiments, that we have carried out ourselves, in which we compare many of the methods discussed in Chapter 2 on a common testing infrastructure.

Applications of Quantification

Chapter

Full-text available

Mar 2023

This chapter provides the motivation for what is to come in the rest of the book by describing the applications that quantification has been put at, ranging from improving classification accuracy in domain adaptation, to measuring and improving the fairness of classification systems with respect to a sensitive attribute, to supporting research and development in fields that are usually more concerned with aggregate data than with individual data, such as the social sciences, political science, epidemiology, market research, ecological modelling, and others.

Evaluation of Quantification Algorithms

Chapter

Full-text available

Mar 2023

In this chapter we discuss the experimental evaluation of quantification systems. We look at evaluation measures for the various types of quantification systems (binary, single-label multiclass, multi-label multiclass, ordinal), but also at evaluation protocols for quantification, that essentially consist in ways to extract multiple testing samples for use in quantification evaluation from a single classification test set. The chapter ends with a discussion on how to perform model selection (i.e., hyperparameter optimization) in a quantification-specific way.

The Importance of the Test Set Size in Quantification Assessment

Conference Paper

Full-text available

Jul 2020

Quantification is a task similar to classification in the sense that it learns from a labeled training set. However, quantification is not interested in predicting the class of each observation, but rather measure the class distribution in the test set. The community has developed performance measures and experimental setups tailored to quantification tasks. Nonetheless, we argue that a critical variable, the size of the test sets, remains ignored. Such disregard has three main detrimental effects. First, it implicitly assumes that quantifiers will perform equally well for different test set sizes. Second, it increases the risk of cherry-picking by selecting a test set size for which a particular proposal performs best. Finally, it disregards the importance of designing methods that are suitable for different test set sizes. We discuss these issues with the support of one of the broadest experimental evaluations ever performed, with three main outcomes. (i) We empirically demonstrate the importance of the test set size to assess quantifiers. (ii) We show that current quantifiers generally have a mediocre performance on the smallest test sets. (iii) We propose a metalearning scheme to select the best quantifier based on the test size that can outperform the best single quantification method.

DyS: A Framework for Mixture Models in Quantification

Article

Full-text available

Jul 2019

Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS: Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsøe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance’s popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsøe, while not requiring the sensible parameterization of the number of bins.

Confidence Intervals for Class Prevalences under Prior Probability Shift

Article

Full-text available

Jul 2019

Dirk Tasche

Point estimation of class prevalences in the presence of dataset shift has been a popular research topic for more than two decades. Less attention has been paid to the construction of confidence and prediction intervals for estimates of class prevalences. One little considered question is whether or not it is necessary for practical purposes to distinguish confidence and prediction intervals. Another question so far not yet conclusively answered is whether or not the discriminatory power of the classifier or score at the basis of an estimation method matters for the accuracy of the estimates of the class prevalences. This paper presents a simulation study aimed at shedding some light on these and other related questions.

DyS: a Framework for Mixture Models in Quantification

Conference Paper

Full-text available

Mar 2019

Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS: Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsøe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance's popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsøe, while not requiring the sensible parameterization of the number of bins.

On the Need of Class Ratio Insensitive Drift Tests for Data Streams

Conference Paper

Full-text available

Nov 2018

Early approaches to detect concept drifts in data streams without actual class labels aim at minimizing external labeling costs. However, their functionality is dubious when presented with changes in the proportion of the classes over time, as such methods keep reporting concept drifts that would not damage the performance of the current classification model. In this paper, we present an approach that can detect changes in the distribution of the features that is insensitive to changes in the distribution of the classes. The method also provides an estimate of the current class ratio and use it to adapt the threshold of a classification model trained with a balanced data. We show that the classification performance achieved by such a modified classifier is greater than that of a classifier trained with the same class distribution as the current imbalanced data.

One-class Quantification

Conference Paper

Full-text available

Sep 2018

This paper proposes one-class quantification, a new Machine Learning task. Quantification estimates the class distribution of an unlabeled sample of instances. Similarly to one-class classification, we assume that only a sample of examples of a single class is available for learning, and we are interested in counting the cases of such class in a test set. We formulate, for the first time, one-class quantification methods and assess them in a comprehensible open-set evaluation. In an open-set problem, several "subclasses" represent the negative class, and we cannot assume to have enough observations for all of them at training time. Therefore, new classes may appear after deployment, making this a challenging setup for existing quantification methods. We show that our proposals are simple and more accurate than the state-of-the-art in quantification. Finally, the approaches are very efficient, fitting batch and stream applications .

Classifying and Counting with Recurrent Contexts

Conference Paper

Full-text available

Jul 2018

Many real-world applications in the batch and data stream settings with data shift pose restrictions to the access to class labels after the deployment of a classification or quantification model. However, a significant portion of the data stream literature assumes that actual labels are instantaneously available after issuing their corresponding classifications. In this paper, we explore a different set of assumptions without relying on the availability of class labels. We assume that, although the distribution of the data may change over time, it will switch between one of a handful of well-known distributions. Still, we allow the proportions of the classes to vary. In these conditions, we propose the first method that can accurately identify the correct context of data samples and simultaneously estimate the proportion of the positive class. This estimate can be further used to adjust a classification decision threshold and improve classification accuracy. Finally, the method is very efficient regarding time and memory requirements, fitting data stream applications.

A Review on Quantification Learning

Article

Full-text available

Sep 2017

The task of quantification consists in providing an aggregate estimation (e.g., the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution. Several real-world applications demand this kind of method that does not require predictions for individual examples and just focuses on obtaining accurate estimates at an aggregate level. During the past few years, several quantification methods have been proposed from different perspectives and with different goals. This article presents a unified review of the main approaches with the aim of serving as an introductory tutorial for newcomers in the field.

I4R: Promoting Deep Reinforcement Learning by the Indicator for Expressive Representations

Conference Paper

Jul 2020

Learning expressive representations is always crucial for well-performed policies in deep reinforcement learning (DRL). Different from supervised learning, in DRL, accurate targets are not always available, and some inputs with different actions only have tiny differences, which stimulates the demand for learning expressive representations. In this paper, firstly, we empirically compare the representations of DRL models with different performances. We observe that the representations of a better state extractor (SE) are more scattered than a worse one when they are visualized. Thus, we investigate the singular values of representation matrix, and find that, better SEs always correspond to smaller differences among these singular values. Next, based on such observations, we define an indicator of the representations for DRL model, which is the Number of Significant Singular Values (NSSV) of a representation matrix. Then, we propose I4R algorithm, to improve DRL algorithms by adding the corresponding regularization term to enhance the NSSV. Finally, we apply I4R to both policy gradient and value based algorithms on Atari games, and the results show the superiority of our proposed method.

Binary Classifier Calibration Using an Ensemble of Near Isotonic Regression Models

Conference Paper

Dec 2016

Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of near isotonic regression (ENIR). The method can be considered as an extension of BBQ [20], a recently proposed calibration method, as well as the commonly used calibration method based on isotonic regression (IsoRegC) [27]. ENIR is designed to address the key limitation of IsoRegC which is the monotonicity assumption of the predictions. Similar to BBQ, the method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus it can be used with many existing classification models to generate accurate probabilistic predictions. We demonstrate the performance of ENIR on synthetic and real datasets for commonly applied binary classification models. Experimental results show that the method outperforms several common binary classifier calibration methods. In particular on the real data, ENIR commonly performs statistically significantly better than the other methods, and never worse. It is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is O(N log N) time, where N is the number of samples.

Accurately Quantifying a Billion Instances per Second*

Abstract and Figures

Recommended publications

Submit your application to win an all-inclusive 11-days at Sao Paulo School of Advanced Sciences on...

Pitfalls in Quantification Assessment

DyS: a Framework for Mixture Models in Quantification

Accurately Quantifying under Score Variability

DyS: A Framework for Mixture Models in Quantification

An Equivalence Analysis of Binary Quantification Methods