Conference PaperPDF Available

Accurately Quantifying a Billion Instances per Second*

Authors:

Abstract and Figures

Quantification is a thriving research area that develops methods to estimate the class prior probabilities in an unlabelled set of observations. Quantification and classification share several similarities. For instance, the most straightforward quantification method, Classify & Count (CC), directly counts the output of a classifier. However, CC has a systematic bias that makes it increasingly misestimate the counts as the class distribution drifts away from a distribution it perfectly quantifies. This issue has motivated the development of more reliable quantification methods. Such newer methods can consistently outperform CC at the cost of a significant increase in processing requirements. Yet, for a large number of applications, quantification speed is an additional criterion that must be considered. Frequently, quantification methods need to deal with large amounts of data or fast-paced streams, as it is the case of news feeding, tweets and sensor data. In this paper, we propose Sample Mean Matching (SMM), a highly efficient algorithm able to quantify billions of data instances per second. We compare SMM to a set of 14 established and state-of-the-art quantifiers in an empirical analysis comprising 25 benchmark and real-world datasets. We show that SMM is competitive with state-of-the-art methods with no statistical difference in counting accuracy, and it is orders of magnitude faster than the vast majority of the algorithms.
Content may be subject to copyright.
Accurately Quantifying a Billion Instances per
Second*
Waqar Hassan
ICMC-USP
S˜
ao Carlos, Brazil
waqar@usp.br
Andr´
e Maletzke
UNIOESTE/ICMC-USP
Foz do Iguac¸u, Brazil
andre.maletzke@unioeste.br
Gustavo Batista
CSE-UNSW
Sydney, Australia
g.batista@unsw.edu.au
Abstract—Quantification is a thriving research area that de-
velops methods to estimate the class prior probabilities in an
unlabelled set of observations. Quantification and classification
share several similarities. For instance, the most straightforward
quantification method, Classify & Count (CC), directly counts
the output of a classifier. However, CC has a systematic bias
that makes it increasingly misestimate the counts as the class
distribution drifts away from a distribution it perfectly quantifies.
This issue has motivated the development of more reliable
quantification methods. Such newer methods can consistently
outperform CC at the cost of a significant increase in pro-
cessing requirements. Yet, for a large number of applications,
quantification speed is an additional criterion that must be
considered. Frequently, quantification methods need to deal with
large amounts of data or fast-paced streams, as it is the case of
news feeding, tweets and sensor data. In this paper, we propose
Sample Mean Matching (SMM), a highly efficient algorithm able
to quantify billions of data instances per second. We compare
SMM to a set of 14 established and state-of-the-art quantifiers
in an empirical analysis comprising 25 benchmark and real-world
datasets. We show that SMM is competitive with state-of-the-art
methods with no statistical difference in counting accuracy, and
it is orders of magnitude faster than the vast majority of the
algorithms.
Index Terms—Machine Learning, Quantification, Mixture
Methods
I. INTRODUCTION
Quantification is the research area that develops methods to
estimate the class distribution in an unlabelled sample. Such
area finds applications in tasks that are interested in under-
standing the behaviour of groups instead of the prediction of
individuals.
For instance, Forman [1] uses quantification methods to
estimate the number of terrorism-related news in the last
month. Milli and colleagues [2] determine the approximate
percentage of unemployed for a given period or according to
different geographical regions. Gao and Sebastiani [3] count
tweets as positive, negative or neutral about a particular topic
under debate in society. Finally, Silva and colleagues [4]
estimate the number of disease-carrying mosquitoes captured
by an insect trap.
Quantification shares a series of similarities with classifica-
tion. The simplest quantification method is a direct application
of classification. Such an approach, known as Classify &
This work has been partially funded by CNPq-TWAS (139467/2017-3).
Count (CC), merely counts the output of a classifier. However,
CC has a systematic bias that makes it increasingly misesti-
mate the counts as the class distribution drifts away from a
distribution that CC perfectly quantifies [5], [6]. Such a flaw
has motivated a growing number of researchers to propose
more reliable quantification methods.
Therefore, most of the research in quantification has fo-
cused on counting accuracy. However, for a large number
of applications, quantification speed is an additional criterion
that must be considered. Frequently, quantification methods
need to deal with large amounts of data or fast-paced streams,
as it is the case of news feeding, tweets and sensor data.
Those applications demand approaches that are both fast and
accurate.
In this paper, we propose Sample Mean Matching (SMM),
a highly efficient algorithm able to quantify billions of data
instances per second. Such an algorithm is orders of magnitude
faster, yet, it provides precise estimates compared to the state-
of-the-art. Our proposal is inspired by the mixture methods of
the DySfamily [7] such as HDy [8]. However, our approach
is significantly more straightforward and, consequently, more
efficient than other methods of this family.
We compare SMM to a set of 14 established and state-
of-the-art quantifiers in an empirical analysis comprising 25
benchmark and real-world datasets. We show that SMM is
competitive with state-of-the-art methods with no statistical
difference in performance, and it is much more efficient than
the vast majority of algorithms in terms of time complexity.
Fig. 1 provides an overview of our empirical evaluation.
This article is organized as follows: Section II presents
background concepts regarding classification and quantifica-
tion tasks and the primary difference between these tasks.
Section III summarizes the relevant literature in quantification,
describing the counting methods assessed in this paper. Section
IV analyzes the time complexity of the training and test phases
of the quantification methods reviewed in the previous section.
Section V presents our proposal based on the idea of mixture
models. Section VI describes the experimental setup proposed
in this paper. Section VII discusses and analyses the empirical
results. Finally, Section VIII concludes this work and presents
directions for future research.
1
2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)
978-1-7281-8206-3/20/$31.00 ©2020 IEEE
DOI 10.1109/DSAA49011.2020.00012
ACC CC
DyS−TS
EMQ
HDy
MAX
MS
MS2
PACC PCC
QT
SMM
SORD
1k Slow and
1M
1G
1 3 5 7 9 11 13
Rank
Instances per
second (log-scale)
Gap area. The
literature has no
methods lying in
this region.
Fast and
accurate
accurate
Slow and
inaccurate
Fast and
inaccurate
XT50
Fig. 1: A time and accuracy comparison plot encompassing
15 quantifiers assessed on 25 datasets. Our proposal, SMM,
standouts as being one of the most efficient yet highly com-
petitive in terms of counting accuracy. The counting accuracy
is expressed as a rank to avoid averaging this measure over
different datasets.
II. BACKGROU ND
Classification is a task that induces a predictive model using
atraining set D={(x1, y1),...,(xn, yn)}, where xi∈ X is
a vector with mattributes in the feature space X, and yi
Y={c1, . . . , cl}is its respective class label.
The objective of classification is to correctly predict the
class labels of individual observations in an unlabeled test set
using their feature values. Therefore, a classifier is a model h
induced from Dsuch that
h:X → {c1, . . . , cl}
A. Scorer
Classifiers employ different mechanisms to decide which
class will be assigned to any given observation. In binary
classification, one of the two classes is denominated positive
class (c1=), while the other is denominated negative class
(c2=). In this setting, one can induce a scorer hS(x). A
scorer is a model induced from Dsuch that
hS:X R
A scorer produces a numerical value called score that
correlates with the posterior probability of the positive class,
that is, P(Y=⊕|x). Consequently, the greater the score is,
the higher is the chance of xbelonging to the positive class.
For classification purposes, if such a score is greater than
a certain threshold, th, the observation is classified as pos-
itive. Otherwise, it is classified as negative [9]. We refer to
scores of negative observations simply as negative scores and
analogously refer to scores of positive observations as positive
scores.
B. Quantification
Quantification and classification have common character-
istics, such as the representation of data. However, their
objectives differ. A classifier provides an outcome for each
input instance. Conversely, a quantifier assesses the overall
quantity of observations that belong to a specific class or a
set of classes [10]. Therefore, a quantifier is a model induced
from Dthat predicts the prevalence of each class in a sample,
such that
q:SX[0,1]l
SXdenotes the universe of possible samples from X. For a
given test sample SSX, the quantifier outputs a vector ˆ
p=
[ˆp1,...,ˆpl], where ˆpiestimates the prior probability for class
ci, such that Pl
j=1 ˆpj= 1. The objective is p1,..., ˆpl]to be
as close as possible to the true prior ratios [P(c1), . . . , P (cl)]
of the probability distribution from which Swas sampled.
In classification, we usually assume that the data are in-
dependent and identically distributed (i.i.d). “Identically dis-
tributed” means that all observations, from either the training
or test sets, share the same underlying distribution. “Indepen-
dently distributed” means that the observations are indepen-
dent of each other. In other words, the occurrence of one
observation does not affect the probability of the appearance
of any other particular observation.
Similarly to classification, in quantification, we still assume
that observations are independent. However, the training and
test instances do not come from the same underlying distribu-
tion. As the main task is to measure the prior probabilities of
the classes in S, the class distribution can change significantly
from the training set (which supports the induction of q) to
the test sample S[11].
III. REL ATED WORK
Quantification has been explored and evolving in the
last decades, resulting in several proposals under different
names for the quantification task such as: prevalence esti-
mation [12], class probability re-estimation [13], class prior
estimation [14], and class distribution estimation [8]. Gonz´
alez
and colleagues [10] organized most of the prior work accord-
ing to similarities between algorithms, resulting in a taxonomy
of quantification methods consisting of three groups:
I. Classify, count & correct:methods that classify each
instance firstly and then count how many belong to
each class. Methods that apply any correction to their
predictions are included in this group as well;
II. Adapting traditional classification algorithms:algo-
rithms that modify the mechanics of traditional classifi-
cation learning methods so that they become quantifiers;
III. Distribution matching:algorithms that parametrically
model the training distribution and later search the pa-
rameters that produce the best match against the test set.
A. Classify, Count & Correct
The most straightforward approach is Classify & Count
(CC). It is a naive adaptation of classifiers to quantification
problems. Forman [5] has demonstrated CC has a systematic
error that monotonically increases as we move away from a
distribution that CC provides optimal counting.
CC uses a classifier to label each instance in the test sample.
Afterwards, it counts the number of examples belonging to
each class. CC provides optimal quantification results with
2
a perfect classifier. However, classifiers with balanced errors,
for instance, a binary classifier that commits an equal number
of false-positive and false-negative errors, is also optimal.
Intuitively, in these situations, CC benefits from the fact that
opposite mistakes can nullify each other.
Fig. 2 illustrates the probability density functions of the
scores for a binary classification problem for a hypothetical
test set. We chose the threshold so that the number of
false-positives matches the false-negatives. Therefore, the CC
quantifier provides perfect quantification, albeit the underlying
classifier is not perfect.
Fig. 2: Illustration of a scenario in which a CC quantifier
provides flawless quantification even though the underlying
classifier is not perfect [15].
The CC outcome is the count of every observation with
score above the threshold. In other words, it is the sum of
true-positives and false-positives. However, the actual count
is the sum of true-positives and false-negatives. Fig. 2 helps
us to understand the motivation behind several quantification
methods that we can separate into two sub-groups. The first
sub-group corrects the counts by estimating false-positive
and false-negative errors. The second sub-group searches for
suitable threshold values, such as the ones that provide more
reliable estimates for the false-positive and false-negative rates.
A well-known approach of the first sub-group is Adjusted
Classify & Count (ACC) [1]. In absolute numbers, ACC’s
correction factor adds the false-negatives to CC’s output and
then subtracts the false-positives. However, ACC is more
commonly expressed as frequencies, in the following manner:
ˆ
PACC () = ˆ
PCC ()P(⊕|)
P(⊕|⊕)P(⊕|)(1)
where ˆ
PCC ()is the positive class prevalence provided by CC
in the test set. P(⊕|)is the false-positive rate, and P(⊕|⊕)
is the true-positive rate.
If we knew the true-positive and false-positive rates in the
test set, then ACC would be a perfect quantifier. However, as
the test set is unlabelled, the best we can do is to estimate these
quantities in the training set. As Fig. 1 suggests, estimating
these values in the training set makes ACC far from being
perfect and not as accurate as the state-of-the-art.
Probabilistic Classify & Count (PCC) and Probabilistic Ad-
just Classify & Count (PACC) [16] assume that probabilities in
the form of calibrated scores have richer information than the
crisp prediction. Instead of counting the positive predictions
as in CC, PCC averages the scores provided by the scorer
hSto estimate the class proportion. PACC is similar to PCC
but uses the same correction factor as ACC. Similar to CC,
when the class distribution changes, PCC also overestimates
or underestimates the actual class proportion [17]. Moreover,
these probabilistic methods suffer of a chicken-and-egg prob-
lem since the calibration of classifiers depends on the class
distribution [5].
The second sub-group selects the threshold value for a clas-
sifier using the training data [18]. There are several strategies
to chose a proper threshold value, such as the one that equates
the false-negative and false-positive rates and values that can
possibly provide more reliable error estimates in imbalanced
datasets. Some of the most used strategies are the following.
X: sets the threshold to a value that equates false-negative and
false-positive rates;
Max: maximizes the denominator of ACC in Eq. 1 by find-
ing the threshold value which maximizes the difference
between true-positive and false-positive rates.
T50: adjusts the threshold so that the true-positive rate is 50%.
Median sweep (MS): returns the median of several applica-
tions of the ACC method for a predefined range of thresh-
olds. It uses the true-positive and false-positive rates
estimated from the training set using cross-validation.
Forman [18] also proposes a variant of MS named
MS2, which considers only thresholds that produces a
denominator which the difference is greater than 0.25.
B. Adapting Traditional Classification Algorithms
Quantification Trees (QT) is a decision tree learning algo-
rithm for quantification [2]. Similar to other decision trees,
QT greedily selects the splitting-feature and splitting-threshold
at a decision node that optimizes a given criterion. However,
instead of using typical classification measures, such as the
ones based on information theory, the proposal uses alternative
measures tailored for quantification problems. Two evaluation
metrics proposed in [2] are the following:
Classification Error Balancing: for each possible split and
class ci, the difference between false-negative (FNi) and
false-positive (FPi) is computed, as:
Eci=|FPiFNi|
where the optimum value for quantification is achieved
when the value of FPiis equal to FNi, i.e., Eci= 0;
Classification-Quantification Balancing: this approach is an
improvement over the previous one. The authors made
a trade-off between quantification and classification ac-
curacy. Thus, for each possible split and class ci, it
computes the following equation:
Eci=|FPiFNi|×|FPi+FNi|
where the right side of the multiplication is a measure
for classification performance, perfect classification is
achieved if FPi=FNi= 0.
For a possible split, the values for the chosen evaluation
metric are aggregated in a new vector: E= [Ec1, ..., Ecl]or
3
E= [Ec1, ..., E cl]. The final score is given by the L2-norm
of E. To calculate the goodness of the split, the quantification
accuracy of parent node (before splitting) and child node (after
splitting) are compared:
∆ = ||Eparent||2− ||Echild ||2
where denotes the goodness of the split: the larger the
delta the better the split. The growing process is terminated if
0. Additionally, the authors performed experiments with
a simple decision tree and with Random Forest.
C. Distribution Matching
We can separate the distribution matching methods into
two classes of approaches. The first one uses a variation of
the well-known Expectation-Maximization (EM) algorithm.
The second class of methods include algorithms that mixture
distributions on training set to match a test set distribution.
Expectation-Maximization Quantifier (EMQ) [19] is an iter-
ative approach to estimate the class prevalence in imbalanced
class distributions based on the Expectation-Maximization
algorithm. EMQ initializes with the class prevalence in the
training set. Then, for each iteration, it updates the estimated
class ratio to better approximate the distribution of class in the
test set. Formally:
ˆ
P(0)
EM Q(ci) = ˆ
PTr(ci)
ˆ
P(t)
EM Q(ci|xk) =
ˆ
P(t)
EM Q(ci)
ˆ
PTr(ci)ˆ
PTr(ci|xk)
Pl
j=1
ˆ
P(t)
EM Q(cj)
ˆ
PTr(cj)ˆ
PTr(cj|xk)
ˆ
P(t+1)
EM Q(ci) = 1
n
n
X
k=1
ˆ
P(t)
EM Q(ci|xk)
where ˆ
PTr(ci)is the prior probability of class ciestimated
in the training set Tr.ˆ
PTr(ci|xk)is the posterior probability
for the class cigiven a test instance xk. Such a quantity is
estimated by a probabilistic classifier learned over the training
set. For each iteration t,ˆ
P(t)
EM Q(ci|xk)and ˆ
P(t+1)
EM Q(ci)are re-
estimated sequentially for each instance xkin the test set and
each class ci∈ Y.
Other distribution matching methods consider the scores
obtained on an unlabeled set to follow a parametric mixture
between two known distributions (one for the positive and
another for the negative class). In general, these methods use
a search mechanism to find the parameters that best match a
mixture of positive and negative training set score distributions
with the unlabelled score distribution of the test set. The
computation of the parameters of this mixture leads to the
quantification estimate.
The first distribution matching method uses Kolmogorov-
Smirnov statistic and PP-Area to measure the difference be-
tween the positive (S) and the negative (S) score distribu-
tions [1]. A more recent proposal, the HDy algorithm, repre-
sents each score distribution as a histogram [8]. A weighted
sum of these histograms gives the mixture between the positive
and negative score distributions, where the weights sum up
to 1. The weights that minimize the Hellinger Distance (HD)
between the mixture and the unlabeled (test) score distribution
(S) are considered to be the proportion of the corresponding
classes in the unlabeled sample. The next equation details this
computation:
ˆ
PHDy() =
arg min
0α1HD αH[S] + (1 α)H[S], H[S]
where HD represents Hellinger distance and H[·]indicates an
operation that converts a set of scores into a histogram. Fig.
3 illustrates this process.
]
]
,
SS
S
+ (1 - α)
αHD )
)
Fig. 3: HDy searches for an αthat minimizes the Hellinger
Distance [20].
HDy uses histograms to represent the positive, negative
and unlabelled score distributions. A histogram is a discrete
representation that has a relevant parameter, the number of
bins1.HDy authors recommend to apply the method over a
range of bins from 10 to 110 with an increment of 10. The final
output is the median of the estimated positive distributions
across all bins values.
The original HDy method uses a linear search to find
the alpha that minimizes the Hellinger distance. Some minor
improvements to HDy are the use of ternary search to make
HDy more efficient and the use of Laplace smoothing to
compute the bin values [21].
The HDy method inspired a recently proposed framework
named DyS[7] that supports the use of different distance
measures besides HD:
ˆ
PDyS() =
arg min
0α1DS αR[S] + (1 α)R[S], R[S]
where DS is a dissimilarity measure to estimate the match
between the distributions of training scores and test scores,
and R[·]is an operation that converts a set of scores into a
representation, such as a histogram. In [7], the authors also
propose SORD as part of the DySframework. SORD uses
an efficient variation of the Earth Mover’s distance to operate
directly over the scores values, instead of using an intermediate
representation.
Moreover, the authors of [7] also analyze the significance of
the number of bins on quantification performance and provides
recommendations for this parameter. Histograms with several
1Bins divide the entire range of score values into a series of intervals, so
we can count how many values fall into each interval.
4
bins incur sparseness and demand further training and test
observations to measure the distributions. Whereas histograms
with a small number of bins may not convey all information
necessary to characterize the data distribution. Based on the
findings in [7], in this paper, we vary the number of bins
from 2 to 20 with step size of 2 and report the final positive
class prevalence as a median of estimated class ratios over all
bins. In addition, as DySarticle compared distinct similarity
functions and provides a rank of these functions, we included
the best ranked distance, Topsøe, in our experiments.
Until now the mixture model methods proposed in the
literature depend on four main factors that influence their
performance: number of bins required for representing a
distribution; distance function to compare two distributions;
approach to represent distribution; and search procedure for
alpha that minimizes the distance between the distributions.
This work investigates a method that is independent of these
parameters and proposes a simplified and highly-efficient
version of mixture model methods.
IV. TIME COMPLEXITY
In this section, we discuss the time complexity of quantifi-
cation methods reviewed in Section III. Table I summarizes
the time complexity of each quantifier for the training and
test phases.
For the training phase, consider Tras the training set, M(n)
the cost for training either a scorer or classifier and C(n)
the time for scoring nobservations. Suppose that we also
apply, depending on the quantifier, k-fold cross-validation to
obtain the positive (S) and negative (S) scores as well as
the true-positive (tpr) and false-positive (fpr) rates for either
a given threshold (th) or a set of thresholds (rth). In this
case, the time of the k-folds cross-validation step is defined
by OkM|Tr| − |Tr|
k+C|Tr|
k. All quantifiers ex-
cepting CC, PCC, QT, and EMQ have their time complexity
impacted by the cross-validation step in the training phase.
Moreover, quantifiers that count by averaging the scores need
an extra step to convert the score into calibrated probabilities.
We use an isotonic regression that performs in O(|Tr|log |Tr|)
[22].
For the testing phase, all methods are impacted by the cost
of scoring each test instance in O(C(|Te|)). For CC, ACC,
MAX, X, T50, and QT, C(|Te|)represents the entire running
cost. Those are followed by MS and MS2 that apply ACC for
a set of thresholds (rth) and then report the median of the
ACC results. MS2 is a bit more efficient than MS due to its
reduced number of threshold possibilities.
Although PCC and PACC are very simple methods, they
require a calibration of the test scores. The mentioned isotonic
regression method requires O(log |Te|)in testing phase [22].
The use of different calibration methods can impact the cost
of both PCC and PACC quantifiers. We point the interested
reader to the work of Naeini and Cooper [22] about binary
calibration. Next, in terms of test efficiency, appears the
mixture model methods. DySand HDy both are impacted by
the dissimilarity function cost, the number of bins (b), and the
range of αvalues being searched. In our experiments, we use
DySwith Topsøe dissimilarity that is recommended by the
DySauthors [7] whereas HDy uses Hellinger distance. Both
distances run in O(b). However, DySis less expensive than
HDy due to the Ternary Search mechanism used for searching
the best αvalue. Ternary search runs in O(log 1/E), where
Eis the accepted error for the ternary search (e.g. 105).
Summing up, beyond the cost of scoring, DyScosts is given
by the Topsøe distance cost multiplied by the number of bins,
and the ternary search cost, i.e., O(blog 1/E). Differently,
apart from scoring cost, HDy makes a sequential search to
find the best αgiven a range of values, and thus it runs in
O(b|α|).
At last, SORD and EMQ are the most expensive methods.
The first is based on Earth Mover’s Distance (EMD) that
works directly on scores values, requiring the cost of scoring
the test set plus the cost of the EMD distance computation,
totalling O(|Tr|+|Te|) log(|Tr|+|Te|). The latter is based on
the Expectation-Maximization algorithm, that is an iterative
process. Its cost accounts by the number of testing instances
(|Te|), number of iterations (t), and number of classes (l).
Thus, the overall running cost is O(C(|Te|) + |Te|tl).
V. PROPOSED METHOD
In this section, we introduce our simple proposal to quan-
tify the class distribution. The distribution matching methods
(Section III-C) usually employ histograms to represent the
distribution of scores for the training and test sets [7], [8].
The time complexity of these methods depends directly on
the number of bins. Our hypothesis is that these methods can
have a competitive or even better accuracy by replacing the
entire histogram by the mean values of the scores for training
and test sets. Moreover, employing the mean of the scores
directly, instead of going through the process of constructing
histogram distribution for several bins, makes the method more
straightforward and efficient.
The main inspiration of our method comes from the idea if
we split a set of values into groups, the mean of the entire set
is equal to the weighted sum of each group mean. The weights
are the fractions of number of elements in these groups [23].
In our case, we have the scores divided into two groups, the
positive and the negative scores. The mean score calculated
over all unlabelled scores in the test set is equal to a weighted
sum of the mean of positive scores and the mean of negative
scores. As we do not know the actual mean of these scores
in the test set, we use, as surrogate, the mean of the scores in
the training set. Thus, SM M relies in the following equation:
ˆ
PSM M () =
arg min
0α1|αµ[S] + (1 α)µ[S]µ[S]|
where µ[·]represents the opeation that computes the scores
average.
Therefore, SMM is a member of the DySframework that
uses simple means to represent the score distribution for
5
TABLE I: Training and testing time complexities for the assessed quantifiers.
Quantifier Training Test
CC O(M(|Tr|)) O(C(|Te|))
ACC OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|))
PCC O(M(|Tr|) + |Tr|log |Tr|)O(C(|Te|) + log |Te|)
PACC OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k +|Tr|log |Tr|O(C(|Te|) + log |Te|)
XOO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|))
T50 OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|))
MAX OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|))
MS OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|) + |rth|)
MS2 OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|) + |rth|)
QT O(M(|Tr|)) O(C(|Te|))
EMQ O(M(|Tr|)) O(C(|Te|) + |Te|tl)
DyS-TS OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|) + blog 1/E)
SORD OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|)+(|Tr|+|Te|) log(|Tr|+|Te|))
HDy OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k O(C(|Te|) + b|α|)
positive, negative and unlabelled scores. Since SMM works
with point-wise estimates, we do not need to apply a search
mechanism to determine α. Instead, we can obtain αusing a
closed-form equation:
α=µ[S]µ[S]
µ[S]µ[S](2)
where αis the final quantification for the positive class. Such
a value should be clipped into the [0,1] range. This equation
has a very simple derivation presented in Appendix A.
Algorithm 1 formalizes the training phase of the SMM
approach. Since the classification of examples seen in the
training set generates biased (overly optimistic) scores, we
recommend the use of 10-cross-validation (line 3). In each
fold, we train a classifier and generate scores for the examples
in the validation set (lines 5-9). After 10 runs, we have scores
for the entire training set. In line 10, the algorithm computes
the average scores for the positive and negative scores. Finally,
SMM induces a classifier hover the entire training set Tr.
Algorithm 1 Sample mean matching (SMM) training phase.
1: Input: Training set, Tr
2: Output: The mean of positive and negative training
scores, µ[S], µ[S]; Classification model, h
3: [T1...k
r, T 1...k
v]cross-validation(k, Tr){kis the number
of folds}
4: STr← ∅
5: for i= 1 . . . k do
6: hitrain(Ti
r)
7: Siscore(Ti
v, hi)
8: STrSTrSi
9: end for
10: [µ[S], µ[S]] mean-by-class(STr)
11: htrain(Tr){Creates a model over all training data}
12: return [µ[S], µ[S], h]
Algorithm 2 presents the test phase of SMM. When a test
set becomes available, SMM computes the scores for each test
set example using h(line 3). Afterwards, it computes the mean
of these scores (line 4) and the αparameter (line 5) according
to Equation 2. Lines 6 and 7 ensure the αis contained in the
[0,1] range.
Algorithm 2 Sample mean matching (SMM) test phase.
1: Input: The mean of positive and negative training scores,
µ[S], µ[S]; Classification model, h, Test set, Te
2: Output: Positive class prevalence α
3: Sscore(Te, h)
4: µ[S]mean(S)
5: αµ[S]µ[S]
µ[S]µ[S]
6: αmax(α, 0)
7: αmin(α, 1)
8: return α
SMM is an essentially parameter-free method. Unlike
other distribution matching methods, the performance of
SMM is independent of the four parameters, i.e.: method
to represent the distribution of scores, number of bins re-
quired to represent distribution, dissimilarity function to
compare the distributions, and approach required to search
for αthe minimizes the distance. Instead, SMM employs
the mean value as a summary of the entire distribution.
Such a feature makes it very simple to train and repro-
duce its results. SMM also has very low memory and pro-
cessing requirements. The time complexity for training is
OO(M(|Tr|)) + kM|Tr| − |Tr|
k+C|Tr|
k, there-
fore, similar to other quantification methods that rely on cross-
validation to compute unbiased scores.
SMM test phase takes as input just two scalar values
(µ[S], µ[S]) and a classification model (h). The time com-
plexity of the test phase is O(C(|Te|)).
6
VI. EX PE RI ME NTAL SE TU P
We conduct a comprehensible experimental evaluation in-
volving 15 state-of-the-art and established quantifiers. Table II
summarizes the methods assessed in this paper.
TABLE II: Assessed Quantifiers.
Quantifier Brief description Reference
CC Classify & Count [1]
ACC Adjusted Classify & Count [1]
PCC Probabilistic Classify & Count [16]
PACC Probabilistic Adjusted Classify & Count [16]
X Classifier with decision threshold set to fnr =
fpr
[18]
T50 Classifier with decision threshold set to tpr =
50%
[18]
MAX Classifier with decision threshold such that
tpr fpr is maximum
[18]
MS Median Sweep [18]
MS2 Median Sweep with a subset of thresholds such
that tpr -fpr >0.25
[18]
QT Quantification Tree [2]
EMQ Expectation-Maximization Quantifier [19]
DyS-TS Mixture model of the DySfamily with Topsoe
distance
[7]
SORD Sample-ORD method [7]
HDy Mixture model with Hellinger distance [8]
SMM Sample Mean Matching (proposal)
Due to the large number of methods included in the ex-
periments, we decided to divide the results into two parts.
The first part compares our proposal, SMM, with some of the
most accurate quantification methods. Therefore, we compare
SMM with DyS-TS, SORD and HDy. According to our
experiments, DyS-TS and SORD are the two most accurate
quantifiers considering all datasets. Although HDy is not as
competitive as those methods, we decided to include it since
SMM is a simplification of HDy. While HDy represents the
distribution using a histogram, SMM employs a single scalar,
the distribution mean.
In the second evaluation part, we extend our assessment
to include all 15 methods listed in Table II. This second
part gives an overall perspective of the accuracy and runtime
performance of SMM in the light of a larger portion of the
literature.
We use different experimental setups to assess runtime and
accuracy performance. For the runtime analysis, we simulate
the classification scores to estimate the cost for each method,
varying the test set sizes from 104to 109with increments of
104. For each size variation, we run each quantifier 1,000 times
under the same setup condition of software and hardware2.
We used artificial values as scores due to the simplicity to
run a large-scale simulation and the fact we only measure
the running time of the quantifiers while not considering their
accuracy. Inspired by the models used in [24] to simulate the
variability in the feature distribution, we simplified that idea,
sampling synthetic scores as values from a continuous uniform
distribution between [0,1].
2All the experiments were performed on the same computer - a 32-core
Intel®Xeon®CPU E5-2620 v4 @ 2.10GHz with 98 GB of RAM memory -
with no other except operating system-related processes running in parallel.
For counting accuracy experiments, we use 25 real datasets.
Table III briefly describes the main features of the datasets,
obtained from UCI [25], OpenML [26], PROMISE [27], and
Reis [21] repositories3. We used binary quantification datasets.
We converted the datasets which contain multiple classes into
binary class problem by fixing one class as positive and all
other classes as negative.
TABLE III: Datasets Description.
ID Dataset Size Features (%) Pos. Repository
instances
A AedesSex 24,000 27 50 Reis
B AedesQuinx 24,000 27 50 Reis
C Anuran Calls 6,585 22 33 UCI
D ArabicDigit 8,800 27 50 UCI
E BNG (vote) 39,366 9 34 OpenML
F Bank Marketing 45,211 16 12 UCI
G Click Prediction 39,948 11 17 OpenML
H CMC 1,473 9 43 UCI
I Covertype-reduced 8,715 54 46 UCI
J Credit Card 30,000 23 22 UCI
K EEG Eye State 14,980 14 45 OpenML
L Enc. Stock Market 96,320 22 49 OpenML
M Handwritten-QG 4,014 63 50 Reis
N HTRU2 17,898 8 9 UCI
O JM1 10,880 21 19 PROMISE
P Letter Recognition 20,000 16 19 UCI
Q Mozilla4 15,545 5 33 OpenML
R MAGIC Gamma 19,020 10 35 UCI
S Nomao 34,465 118 29 OpenML
T Occupancy Detec. 20,560 5 23 UCI
U Phoneme 5,404 5 29 OpenML
V Pollen 3,848 6 50 OpenML
W Spambase 4,601 57 39 UCI
X Wine Type 6,497 12 37 UCI
Y Wine Quality 6,497 12 37 UCI
For training and evaluation of the models, each dataset is
divided into two subsets using stratified sampling, resulting
in training and test sets. From the training part, we estimate
scores with 10-fold cross-validation. These scores are used
by the distribution matching methods [7], [8], including the
proposal SMM.
Similarly, true-positive rate (tpr) and false-positive rate (fpr)
are also estimated using cross-validation for the ACC and
threshold selection methods [5], [18]. According to [16], PCC
and PACC require calibrated scores, which we obtained using
an isotonic regression calibration based on pair-adjacent vio-
lators [22]. For QT, we use the Classification Error Balancing
criterion to define the splits. The source code for QT algorithm
was provided by its authors [2]. The remaining quantifiers
used in our experiments, as well as the auxiliary functions,
are available from our R package4.
In all experiments, we use a random forest classifier with
200 trees to obtain scores and tpr and fpr rates.
We trained the random forest classifiers with balanced
training sets, i.e., an equal number of positive and negative
instances. Conversely, we vary the distribution and size of the
3Specific citations are requested for Credit Card [28], HTRU2 [29],
Mozilla4 [30], Bank Marketing [31], Nomao [32], and Occupancy Detection
[33]. Besides, we note that Jock A. Blackard and Colorado State University
preserve copyright over Covertype.
4https://github.com/andregustavom/mlquantify
7
test sets. Unlike classification, quantification methods require,
as input, a sample of observations. For a comprehensive
assessment, we extract several samples from the test set with
different class distributions. We vary class distribution for each
sample size from 0% to 100% with an increment of 1%.
The number of observations in the test set sample is a rele-
vant aspect of the assessment of quantification methods [34].
Therefore, we generate test samples with different sizes, from
10 instances to 100 instances with an increment of 10 and
from 100 to 500 instances with an increment of 100.
We use Mean Absolute Error (MAE) [35] to evaluate the
accuracy and the mean runtime to assess the efficiency of
quantifiers. MAE is the average absolute difference between
actual (p) and predicted (ˆp) class proportion. The online paper
supplementary material contains all code and datasets5.
VII. EXP ER IM EN TAL RES ULT S
We open this section discussing the experimental com-
parison between SMM and three state-of-the-art quantifiers,
DyS-TS, SORD and HDy. Fig. 4 summarizes the results
by presenting a critical difference diagram [36] regarding
quantification performance (measured with MAE) across all
25 datasets.
1 2 34
CD
DyS−TS
SORD
HDy
SMM
Fig. 4: Nemenyi’s post hoc test for mean absolute quantifi-
cation error. Groups of methods which are not significantly
different at p < 0.05 are connected.
According to Fig. 4, SMM is less accurate than DyS-TS and
SORD, but with no statistical difference. Conversely, SMM is
statistically more accurate than HDy at p < 0.05 confidence
level. Interestingly, all these four methods are mixture models.
The differences among them are mostly in the manner the
score distributions are represented and compared. SMM is the
simplest among them, representing the score distribution by a
single scalar value. Even though SMM uses less information
than its counterparts, it is still able to achieve competitive
results.
Such simplicity makes SMM a very efficient quantifier. We
compared SMM with the same four state-of-the-art quantifiers.
Fig. 5 shows each method runtime according to the test set
size. SMM is, on average, three orders of magnitude faster
than DyS-TS, SORD, and HDy. The efficiency difference of
SMM compared to the other methods remains constant for the
entire spectrum of test set sizes.
We conclude this first part of our analysis emphasizing
that although SMM is slightly less accurate than DyS-TS and
5https://github.com/andregustavom/dsaa smm
Fig. 5: Runtime of SMM and state-of-the-art quantifiers across
different test set sizes.
SORD (with no statistical difference), it is orders of magnitude
faster than these methods.
Let us turn our attention to all 15 quantification methods.
Fig. 1 provides an overview of the results considering both
accuracy and efficiency. Since Fig. 1 presents average runtime
of algorithms over test sets of different sizes, we report the
number of instances processed per second. In contrast, Fig. 5
reports the number of samples processed per second.
Fig. 1 provides the overall ranking of quantifiers considering
all datasets. For a more detailed analysis, Table IV reveals
the MAE numerical results for each combination of quantifier
and dataset, averaged across all class distributions and test
set sizes. The values in bold indicate the best quantification
result for each dataset. According to Table IV, DyS-TS is
the most accurate quantifier with minimal MAE for 15 out
of 25 datasets. Similarly, SORD is the second-best quantifier
and holds the best MAE for six datasets (three of them tied
with DyS-TS). SMM ranks in third place and it does not
produce minimum MAE in any of the datasets. However, for
the majority of the datasets, errors introduced by SMM are
very close to the best results. Across all datasets, the average
MAE difference of SMM compared with the DyS-TS and
SORD is 2 percentual points.
To analyse the differences between SMM and the best
result, we underlined in Table IV values that represent a non-
statistically significant difference regarding the best quantifier
in each dataset. DyS-TS and SORD have an impressive perfor-
mance with no statistically significant differences against the
best quantifier for 23 and 24 of the datasets, respectively. SMM
also performs well, and it is outperformed with statistical
significance in only five of the datasets.
In these 20 out of 25 datasets, where SMM results show no
statistically significant difference, SMM is, on average, three
orders of magnitude faster than the best quantifier. The only
exception is when MAX is the best quantifier, which happens
only in two datasets (Xand Y).
Fig. 6 shows the ranking of all quantifiers for all datasets
varying the test set size and class distribution. A relevant
aspect of the assessment of quantifiers is the size of the test
set [34]. Quantification methods receive as input a sample
of observations. Thus, these methods can estimate several
statistics over these test samples. Intuitively, some of these
statistics may require more data to be estimated reliably than
others. Therefore, we can expect that not every quantifier
8
TABLE IV: Mean absolute error for each dataset and quantifier averaged over all test set sizes and class distributions. The
values in bold indicate the best quantification result for each dataset. Underlined values indicate a non-statistically significant
difference concerning the best quantifier in each dataset.
Dataset DyS-TS SORD SMM MAX EMQ ACC X MS2 MS PACC CC PCC HDy T50 QT
A0.010 0.010 0.012 0.012 0.010 0.012 0.014 0.025 0.025 0.012 0.013 0.015 0.020 0.128 0.137
B0.045 0.045 0.046 0.052 0.074 0.053 0.054 0.052 0.053 0.071 0.093 0.120 0.056 0.071 0.185
C 0.012 0.011 0.013 0.012 0.011 0.012 0.015 0.025 0.025 0.012 0.012 0.018 0.040 0.068 0.129
D 0.008 0.008 0.014 0.008 0.013 0.008 0.008 0.024 0.024 0.007 0.006 0.009 0.021 0.081 0.174
E 0.010 0.010 0.011 0.011 0.009 0.011 0.011 0.025 0.025 0.012 0.012 0.016 0.023 0.025 0.070
F0.044 0.050 0.050 0.045 0.110 0.088 0.051 0.058 0.065 0.061 0.286 0.250 0.282 0.067 0.314
G0.125 0.133 0.131 0.154 0.269 0.213 0.230 0.151 0.156 0.508 0.468 0.329 0.296 0.154 0.451
H0.088 0.096 0.097 0.109 0.155 0.112 0.207 0.104 0.096 0.240 0.217 0.204 0.261 0.117 0.240
I0.052 0.056 0.055 0.064 0.096 0.063 0.064 0.061 0.061 0.091 0.107 0.142 0.065 0.084 0.225
J0.087 0.092 0.089 0.099 0.190 0.106 0.165 0.093 0.097 0.235 0.301 0.259 0.280 0.098 0.247
K0.027 0.028 0.032 0.032 0.052 0.034 0.033 0.038 0.039 0.040 0.050 0.068 0.060 0.064 0.230
L 0.415 0.381 0.415 0.454 0.251 0.459 0.451 0.343 0.342 0.508 0.252 0.252 0.415 0.465 0.251
M 0.006 0.004 0.007 0.005 0.008 0.004 0.005 0.023 0.023 0.004 0.003 0.004 0.007 0.088 0.130
N0.024 0.027 0.027 0.027 0.026 0.029 0.028 0.037 0.037 0.028 0.076 0.078 0.096 0.066 0.124
O 0.111 0.104 0.107 0.126 0.251 0.153 0.146 0.112 0.121 0.466 0.430 0.299 0.144 0.119 0.312
P0.016 0.019 0.022 0.017 0.019 0.027 0.019 0.032 0.032 0.047 0.026 0.037 0.065 0.062 0.353
Q0.040 0.042 0.045 0.048 0.060 0.047 0.051 0.049 0.048 0.069 0.095 0.109 0.060 0.076 0.203
R0.023 0.026 0.025 0.027 0.026 0.027 0.034 0.036 0.035 0.041 0.062 0.055 0.030 0.069 0.120
S0.016 0.020 0.019 0.019 0.016 0.022 0.021 0.029 0.029 0.020 0.034 0.039 0.044 0.064 0.163
T 0.007 0.007 0.008 0.008 0.006 0.008 0.008 0.024 0.024 0.012 0.008 0.014 0.039 0.060 0.050
U0.034 0.035 0.037 0.040 0.057 0.046 0.066 0.043 0.044 0.053 0.095 0.117 0.174 0.068 0.230
V 0.394 0.444 0.440 0.477 0.253 0.467 0.469 0.389 0.389 0.355 0.259 0.253 0.374 0.438 0.253
W0.022 0.022 0.023 0.026 0.025 0.025 0.027 0.032 0.032 0.029 0.038 0.047 0.056 0.060 0.161
X 0.009 0.008 0.011 0.007 0.009 0.009 0.011 0.024 0.024 0.008 0.008 0.011 0.051 0.096 0.125
Y 0.010 0.008 0.010 0.008 0.009 0.009 0.012 0.024 0.024 0.009 0.009 0.011 0.060 0.076 0.127
Std. dev. 0.107 0.110 0.113 0.124 0.092 0.126 0.128 0.094 0.094 0.166 0.138 0.107 0.122 0.106 0.092
Avg. rank 3.00 3.30 5.10 5.46 6.38 6.94 7.96 8.36 8.60 8.60 9.14 10.52 11.40 11.68 13.56
will perform equally for different test set sizes. In particular,
small test sets frequently pose a very challenging scenario for
quantification methods.
4
8
12
DyS−TS
SORD
SMM
MAX
EMQ
ACC
X
MS2
MS
PACC
CC
PCC
HDy
T50
QT
Quantication algorithms
Rank
Fig. 6: Aggregated rank positions for all 15 quantification
algorithms and the 25 datasets.
Fig. 7 shows the average MAE for all 25 datasets split by
test set size. SMM performs similarly to DyS-TS and SORD,
the two best-ranked quantifiers. SMM performance is slightly
worse than these methods considering tiny test sets of size ten.
As the test set sizes increase, the MAE difference between
SMM and DyS-TS and SORD increases. However, even for
the largest test set sizes, the difference is still quite small.
The performance of the quantifiers varies significantly ac-
cording to their quantification strategies. Methods such as CC,
PCC and QT that do not obtain any information from the test
sample have an almost constant performance across all test
set sizes. EMQ performs surprisingly well for small test sets
sizes, but rather poorly for large ones. MS and MS2 slightly
outperform DyS-TS, SORD and SMM for small test sets but
are surpassed for large ones.
0.06
0.12
DyS−TS
SORD
SMM
MAX
EMQ
ACC
X
MS2
MS
PACC
CC
PCC
HDy
T50
QT
Quantication algorithms
MAE
(log-scale)
10
20 30
40 50
60 70
80 90
100 200
300 400
500
Fig. 7: Mean absolute quantification error segregated by test
set size.
VIII. CONCLUSION
The trade-off between efficiency and efficacy is a recurrent
topic in several areas of Computer Science, including Machine
Learning. Frequently, to improve effectiveness, such as count-
ing accuracy, researchers need to propose more sophisticated
algorithms that, in turn, affects efficiency.
Our proposal, SMM, is a highly efficient quantifier that
provides quantification accuracy slightly lower than the state-
of-the-art. This proposal is relevant for several application
domains that need to process large quantities of data with
limited time or processing power.
Mixture models for quantification are the main inspira-
tion for our proposal. However, instead of using a multi-
dimensional representation of the distributions, it uses a simple
mean to summarize such information. Such simplification
9
leads to an algorithm with low requirements in terms of
memory and processing. A comparison with other state-of-
the-art mixture models reveals that SMM is three orders of
magnitude faster. Yet, SMM accuracy is lower than SORD
and DyS-TS, the difference is not statistically significant.
As part of our intentions for future work, we will continue
to develop mixture models, improving their performance for
different sizes of test samples. Our experimental results show
that small test samples are particularly challenging for this
class of methods.
REFERENCES
[1] G. Forman, “Counting positives accurately despite inaccurate classifica-
tion,” in European Conference on Machine Learning. Springer, 2005,
pp. 564–575.
[2] L. Milli, A. Monreale, G. Rossetti, F. Giannotti, D. Pedreschi, and
F. Sebastiani, “Quantification trees,” in 2013 IEEE 13th International
Conference on Data Mining. IEEE, 2013, pp. 528–536.
[3] W. Gao and F. Sebastiani, “From classification to quantification in tweet
sentiment analysis,” Social Network Analysis and Mining, vol. 6, no. 1,
2016.
[4] D. Silva, V. Souza, D. Ellis, E. Keogh, and G. Batista, “Exploring low
cost laser sensors to identify flying insect species,” Journal of Intelligent
& Robotic Systems, vol. 80, no. 1, pp. 313–330, 2015.
[5] G. Forman, “Quantifying counts and costs via classification,” Data
Mining and Knowledge Discovery, vol. 17, no. 2, pp. 164–206, 2008.
[Online]. Available: http://dx.doi.org/10.1007/s10618-008- 0097-y
[6] P. Gonz´
alez, J. D´
ıez, N. Chawla, and J. J. del Coz, “Why is quantification
an interesting learning problem?” Progress in Artificial Intelligence,
vol. 6, no. 1, pp. 53–58, 2017.
[7] A. Maletzke, D. dos Reis, E. Cherman, and G. E. A. P. A. Batista, “Dys:
a framework for mixture models in quantification,” in AAAI Conference
on Artificial Intelligence, ser. AAAI ’19, 2019.
[8] V. Gonz´
alez-Castro, R. Alaiz-Rodr´
ıguez, and E. Alegre, “Class distribu-
tion estimation based on the hellinger distance,” Information Sciences,
vol. 218, pp. 146 – 164, 2013.
[9] P. Flach, Machine learning: the art and science of algorithms that make
sense of data. Cambridge University Press, 2012.
[10] P. Gonz´
alez, A. Casta˜
no, N. V. Chawla, and J. J. D. Coz, “A review
on quantification learning,” ACM Computing Surveys (CSUR), vol. 50,
no. 5, p. 74, 2017.
[11] D. d. Reis, M. de Souto, E. de Sousa, and G. Batista, “Quantifying with
only positive training data,arXiv preprint arXiv:2004.10356, 2020.
[12] J. Barranquero, P. Gonz´
alez, J. D´
ıez, and J. J. del Coz, “On the
study of nearest neighbor algorithms for prevalence estimation in binary
problems,” Pattern Recognition, vol. 46, no. 2, pp. 472–482, Feb. 2013.
[13] R. Alaiz-Rodr´
ıguez, A. Guerrero-Curieses, and J. Cid-Sueiro, “Class and
subclass probability re-estimation to adapt a classifier in the presence of
concept drift,” Neurocomputing, vol. 74, no. 16, pp. 2614 – 2623, 2011.
[14] Y. S. Chan and H. T. Ng, “Estimating class priors in domain adaptation
for word sense disambiguation,” in Proceedings of the 21st Interna-
tional Conference on Computational Linguistics and the 44th Annual
Meeting of the Association for Computational Linguistics, ser. ACL-
44. Stroudsburg, PA, USA: Association for Computational Linguistics,
2006, pp. 89–96.
[15] D. dos Reis, A. Maletzke, E. Cherman, and G. Batista, “One-class
quantification,” in Joint European Conference on Machine Learning and
Knowledge Discovery in Databases. Springer, 2018, pp. 273–289.
[16] A. Bella, C. Ferri, J. Hern´
andez-Orallo, and M. J. Ramirez-Quintana,
“Quantification via probability estimators,” in IEEE International Con-
ference on Data Mining. IEEE, 2010, pp. 737–742.
[17] D. Tasche, “Exact fit of simple finite mixture models,Journal of Risk
and Financial Management, vol. 7, no. 4, pp. 150–164, 2014.
[18] G. Forman, “Quantifying trends accurately despite classifier error and
class imbalance,” in ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, ser. KDD ’06. ACM, 2006, pp.
157–166.
[19] M. Saerens, P. Latinne, and C. Decaestecker, “Adjusting the outputs of
a classifier to new a priori probabilities: a simple procedure,” Neural
computation, vol. 14, no. 1, pp. 21–41, 2002.
[20] A. Maletzke, D. dos Reis, E. Cherman, and G. Batista, “On the need
of class ratio insensitive drift tests for data streams,” in Proceedings
of the Second International Workshop on Learning with Imbalanced
Domains: Theory and Applications, ECML-PKDD, vol. 94. Dublin,
Ireland: PMLR, 10 Sep 2018, pp. 110–124.
[21] D. dos Reis, A. Maletzke, D. F. Silva, and G. E. A. P. A. Batista,
“Classifying and counting with recurrent contexts,” in ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
ser. KDD’18. ACM, 2018, pp. 1983–1992.
[22] M. P. Naeini and G. F. Cooper, “Binary classifier calibration using
an ensemble of near isotonic regression models,” in 2016 IEEE 16th
International Conference on Data Mining (ICDM). IEEE, 2016, pp.
360–369.
[23] B. Everitt and A. Skrondal, The Cambridge dictionary of statistics.
Cambridge University Press Cambridge, 2002, vol. 106.
[24] D. Tasche, “Confidence intervals for class prevalences under prior
probability shift,” arXiv preprint arXiv:1906.04119, 2019.
[25] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”
2017. [Online]. Available: http://archive.ics.uci.edu/ml
[26] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “Openml:
Networked science in machine learning,” ACM SIGKDD Explorations
Newsletter, vol. 15, no. 2, pp. 49–60, 2013.
[27] J. Sayyad Shirabad and T. Menzies, “The PROMISE repository of
software engineering databases.” School of Information Technology and
Engineering, University of Ottawa, Canada, 2005. [Online]. Available:
http://promise.site.uottawa.ca/SERepository
[28] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques
for the predictive accuracy of probability of default of credit card
clients,” Expert Systems with Applications, vol. 36, no. 2, pp. 2473–
2480, 2009.
[29] R. J. Lyon, B. Stappers, S. Cooper, J. Brooke, and J. Knowles, “Fifty
years of pulsar candidate selection: from simple filters to a new prin-
cipled real-time classification approach,” Monthly Notices of the Royal
Astronomical Society, vol. 459, no. 1, pp. 1104–1123, 2016.
[30] A. G. Koru, D. Zhang, and H. Liu, “Modeling the effect of size on defect
proneness for open-source software,” in Predictor Models in Software
Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International
Workshop on. IEEE, 2007, pp. 10–10.
[31] S. Moro, P. Cortez, and P. Rita, “A data-driven approach to predict the
success of bank telemarketing,” Decision Support Systems, vol. 62, pp.
22–31, 2014.
[32] L. Candillier and V. Lemaire, “Design and analysis of the nomao
challenge active learning in the real-world,” in Active Learning in Real-
world Applications, Workshop ECML-PKDD, 2012.
[33] L. M. Candanedo and V. Feldheim, “Accurate occupancy detection of
an office room from light, temperature, humidity and co2 measurements
using statistical learning models,” Energy and Buildings, vol. 112, pp.
28–39, 2016.
[34] A. Maletzke, W. Hassan, D. d. Reis, and G. Batista, “The importance
of the test set size in quantification assessment,” in Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence,
IJCAI-20, 7 2020, pp. 2640–2646, main track. [Online]. Available:
https://doi.org/10.24963/ijcai.2020/366
[35] G. Da San Martino, W. Gao, and F. Sebastiani, “Ordinal text quantifica-
tion,” in Proceedings of the 39th International ACM SIGIR conference
on Research and Development in Information Retrieval. ACM, 2016,
pp. 937–940.
[36] J. Demˇ
sar, “Statistical comparisons of classifiers over multiple data sets,
J. Mach. Learn. Res., vol. 7, p. 1–30, Dec. 2006.
APPENDIX
SMM αparameter can be computed in closed-form. Equa-
tion 2 has a straightforward derivation.
αµ[S] + (1 α)µ[S] = µ[S]
αµ[S] + µ[S]αµ[S] = µ[S]
α(µ[S]µ[S]) = µ[S]µ[S]
α=µ[S]µ[S]
µ[S]µ[S]
(3)
10
... -Speed of computation. See for instance Hassan et al. [11]. ...
... With the exception of Hassan et al. [11], in the literature the choices Z = 1 Cy (hard classifier for one of the classes y in Y) and Z = P [Y = y|X] (posterior probability under P for class y) have been considered. Below, we consider Friedman's [9] choices of hard classifiers and Z = P [Y = y|X] in more detail. ...
... Remark 3 (DeBias method). Suppose we are in the binary case ℓ = 2 and apply Corollary 2 with C as given in (11) and f 1 (X) = P [Y = 1|X]. This implies C = Σ * P = var P [Y = 1|X] . ...
Preprint
Full-text available
The purpose of class distribution estimation (also known as quantification) is to determine the values of the prior class probabilities in a test dataset without class label observations. A variety of methods to achieve this have been proposed in the literature, most of them based on the assumption that the distributions of the training and test data are related through prior probability shift (also known as label shift). Among these methods, Friedman's method has recently been found to perform relatively well both for binary and multi-class quantification. We discuss the properties of Friedman's method and another approach mentioned by Friedman (called DeBias method in the literature) in the context of a general framework for designing linear equation systems for class distribution estimation.
... Portanto, um classificadoré um modelo h induzido a partir de D, de tal modo que h : X → {c 1 , . . . , c l }[Hassan et al. 2020]. ...
Conference Paper
A vigilância automática do mosquito Aedes aegypti é um desenvolvimento tecnológico com potencial de transformar as atuais práticas de monitoramento. Monitorar mosquitos se traduz em estimar o tamanho da população de mosquitos, isto é, contar a quantidade de mosquitos da espécie alvo dada a região monitorada. Contar é o conceito mais fundamental da matemática e um desafio para o Aprendizado de Máquina. Nesse sentido, a quantificação é uma tarefa de Aprendizado de Máquina recentemente formalizada, cujo objetivo é predizer a distribuição de classes dado um conjunto de teste. Neste trabalho, foram avaliados diferentes quantificadores a partir de imagens de vetores de doenças. Os resultados empíricos demonstram que o método de classificar e contar é um baseline, sendo superado pelos métodos DyS e HDy.
Conference Paper
A coleta de informações como reviews sobre os produtos tornou-se uma tarefa relevante para as empresas, pois expressam o sentimento de consumidores sobre um determinado item. Conhecer a quantidade de reviews positivos e negativos sobre um produto/serviço é uma tarefa de interesse que pode ser explorada pela quantificação. O objetivo deste trabalho é avaliar diferentes quantificadores aplicados a reviews de produtos, bem como a influência desses métodos na performance de classificação. Foram avaliados dez métodos de quantificação em seis conjuntos de dados de reviews de produtos. Como resultado observou-se que o método amplamente utilizado para resolver tarefas de quantificação é superado por oito métodos e que quantificadores podem ser utilizados para melhorar a classificação de reviews. Em ambos os casos observou-se diferença estatisticamente significativa.
Article
Full-text available
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.
Article
Quantification, variously called supervised prevalence estimation or learning to quantify , is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values ) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.
Article
Quantification (or prevalence estimation) algorithms aim at predicting the class distribution of unseen sets (or bags) of examples. These methods are useful for two main tasks: 1) quantification applications, for instance when we need to track the proportions of several groups of interest over time, and 2) domain adaptation problems, in which we usually need to adapt a previously trained classifier to a different --albeit related-- target distribution according to the estimated prevalences. This paper analyzes several binary quantification algorithms showing that not only do they share a common framework but are, in fact, equivalent. Inspired by this study, we propose a new method that extends one of the approaches analyzed. After an empirical evaluation of all these methods using synthetic and benchmark datasets, the paper concludes recommending three of them due to their precision, efficiency, and diversity.
Chapter
Full-text available
This chapter concludes the book, discussing possible future developments in the quantification arena.
Chapter
Full-text available
This chapter looks at other aspects of the “quantification landscape” that have not been covered in the previous chapters, and discusses the evolution of quantification research, from its beginnings to the most recent quantification-based “shared tasks”; the landscape of quantification-based, publicly available software libraries; visualization tools specifically oriented to displaying the results of quantification-based experiments; and other tasks in data science that present important similarities with quantification. This chapter also presents the results of experiments, that we have carried out ourselves, in which we compare many of the methods discussed in Chapter 2 on a common testing infrastructure.
Chapter
Full-text available
This chapter provides the motivation for what is to come in the rest of the book by describing the applications that quantification has been put at, ranging from improving classification accuracy in domain adaptation, to measuring and improving the fairness of classification systems with respect to a sensitive attribute, to supporting research and development in fields that are usually more concerned with aggregate data than with individual data, such as the social sciences, political science, epidemiology, market research, ecological modelling, and others.
Chapter
Full-text available
In this chapter we discuss the experimental evaluation of quantification systems. We look at evaluation measures for the various types of quantification systems (binary, single-label multiclass, multi-label multiclass, ordinal), but also at evaluation protocols for quantification, that essentially consist in ways to extract multiple testing samples for use in quantification evaluation from a single classification test set. The chapter ends with a discussion on how to perform model selection (i.e., hyperparameter optimization) in a quantification-specific way.
Conference Paper
Full-text available
Quantification is a task similar to classification in the sense that it learns from a labeled training set. However, quantification is not interested in predicting the class of each observation, but rather measure the class distribution in the test set. The community has developed performance measures and experimental setups tailored to quantification tasks. Nonetheless, we argue that a critical variable, the size of the test sets, remains ignored. Such disregard has three main detrimental effects. First, it implicitly assumes that quantifiers will perform equally well for different test set sizes. Second, it increases the risk of cherry-picking by selecting a test set size for which a particular proposal performs best. Finally, it disregards the importance of designing methods that are suitable for different test set sizes. We discuss these issues with the support of one of the broadest experimental evaluations ever performed, with three main outcomes. (i) We empirically demonstrate the importance of the test set size to assess quantifiers. (ii) We show that current quantifiers generally have a mediocre performance on the smallest test sets. (iii) We propose a metalearning scheme to select the best quantifier based on the test size that can outperform the best single quantification method.
Article
Full-text available
Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS: Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsøe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance’s popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsøe, while not requiring the sensible parameterization of the number of bins.
Article
Full-text available
Point estimation of class prevalences in the presence of dataset shift has been a popular research topic for more than two decades. Less attention has been paid to the construction of confidence and prediction intervals for estimates of class prevalences. One little considered question is whether or not it is necessary for practical purposes to distinguish confidence and prediction intervals. Another question so far not yet conclusively answered is whether or not the discriminatory power of the classifier or score at the basis of an estimation method matters for the accuracy of the estimates of the class prevalences. This paper presents a simulation study aimed at shedding some light on these and other related questions.
Conference Paper
Full-text available
Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS: Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsøe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance's popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsøe, while not requiring the sensible parameterization of the number of bins.
Conference Paper
Full-text available
Early approaches to detect concept drifts in data streams without actual class labels aim at minimizing external labeling costs. However, their functionality is dubious when presented with changes in the proportion of the classes over time, as such methods keep reporting concept drifts that would not damage the performance of the current classification model. In this paper, we present an approach that can detect changes in the distribution of the features that is insensitive to changes in the distribution of the classes. The method also provides an estimate of the current class ratio and use it to adapt the threshold of a classification model trained with a balanced data. We show that the classification performance achieved by such a modified classifier is greater than that of a classifier trained with the same class distribution as the current imbalanced data.
Conference Paper
Full-text available
This paper proposes one-class quantification, a new Machine Learning task. Quantification estimates the class distribution of an unlabeled sample of instances. Similarly to one-class classification, we assume that only a sample of examples of a single class is available for learning, and we are interested in counting the cases of such class in a test set. We formulate, for the first time, one-class quantification methods and assess them in a comprehensible open-set evaluation. In an open-set problem, several "subclasses" represent the negative class, and we cannot assume to have enough observations for all of them at training time. Therefore, new classes may appear after deployment, making this a challenging setup for existing quantification methods. We show that our proposals are simple and more accurate than the state-of-the-art in quantification. Finally, the approaches are very efficient, fitting batch and stream applications .
Conference Paper
Full-text available
Many real-world applications in the batch and data stream settings with data shift pose restrictions to the access to class labels after the deployment of a classification or quantification model. However, a significant portion of the data stream literature assumes that actual labels are instantaneously available after issuing their corresponding classifications. In this paper, we explore a different set of assumptions without relying on the availability of class labels. We assume that, although the distribution of the data may change over time, it will switch between one of a handful of well-known distributions. Still, we allow the proportions of the classes to vary. In these conditions, we propose the first method that can accurately identify the correct context of data samples and simultaneously estimate the proportion of the positive class. This estimate can be further used to adjust a classification decision threshold and improve classification accuracy. Finally, the method is very efficient regarding time and memory requirements, fitting data stream applications.
Article
Full-text available
The task of quantification consists in providing an aggregate estimation (e.g., the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution. Several real-world applications demand this kind of method that does not require predictions for individual examples and just focuses on obtaining accurate estimates at an aggregate level. During the past few years, several quantification methods have been proposed from different perspectives and with different goals. This article presents a unified review of the main approaches with the aim of serving as an introductory tutorial for newcomers in the field.
Conference Paper
Learning expressive representations is always crucial for well-performed policies in deep reinforcement learning (DRL). Different from supervised learning, in DRL, accurate targets are not always available, and some inputs with different actions only have tiny differences, which stimulates the demand for learning expressive representations. In this paper, firstly, we empirically compare the representations of DRL models with different performances. We observe that the representations of a better state extractor (SE) are more scattered than a worse one when they are visualized. Thus, we investigate the singular values of representation matrix, and find that, better SEs always correspond to smaller differences among these singular values. Next, based on such observations, we define an indicator of the representations for DRL model, which is the Number of Significant Singular Values (NSSV) of a representation matrix. Then, we propose I4R algorithm, to improve DRL algorithms by adding the corresponding regularization term to enhance the NSSV. Finally, we apply I4R to both policy gradient and value based algorithms on Atari games, and the results show the superiority of our proposed method.
Conference Paper
Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of near isotonic regression (ENIR). The method can be considered as an extension of BBQ [20], a recently proposed calibration method, as well as the commonly used calibration method based on isotonic regression (IsoRegC) [27]. ENIR is designed to address the key limitation of IsoRegC which is the monotonicity assumption of the predictions. Similar to BBQ, the method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus it can be used with many existing classification models to generate accurate probabilistic predictions. We demonstrate the performance of ENIR on synthetic and real datasets for commonly applied binary classification models. Experimental results show that the method outperforms several common binary classifier calibration methods. In particular on the real data, ENIR commonly performs statistically significantly better than the other methods, and never worse. It is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is O(N log N) time, where N is the number of samples.