ArticlePDF Available

Comparing cost sensitive classifiers by the false-positive to false- negative ratio in diagnostic studies

Authors:
  • Bharath institute of Higher Education and Research
Expert Systems With Applications 227 (2023) 120303
Available online 2 May 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
Comparing cost sensitive classiers by the false-positive to false- negative
ratio in diagnostic studies
A. Kumaravel
a
, T. Vijayan
b
a
Department of Information Technology, Bharath Institute of Higher Education and Research, India
b
Department of Electronics and Communication Engineering, Bharath Institute of Higher Education and Research, India
ARTICLE INFO
Keywords:
Cost ratio
Confusion matrix
Cost matrix
Total cost
False positive
False negative
Cost sensitive learning
In vitro fertilization
ABSTRACT
Nowadays researchers want to be cautious about cost of building models which can generate false positives and
false negatives in unexpected ways. They keep on searching for various measures for controlling such behavior
depending upon the underlying datasets. Cost sensitive classiers are the models to check the total cost due to
misclassications. In this article, the cost sensitive classiers are tried for the rst time endowed with a new
measure ‘cost ratioto monitor such misclassications in the sensitive diagnostic studies. The scheme for vari-
ations of such ratio is introduced and its inuence on the loss is investigated. This cost ratio, a rational number,
ρ
is made up of the integers for the cost of false positive by itsfrequency of occurrences, in the numerator and the
similar cost of false negative in the denominator. We apply this novel cost monitoring measure for learning the
sample dataset of sensitive nature in the context of in vitro fertilization (IVF) dataset indicating the success or
failure of fertilization depending on the attributes like Age, Anti-Müllerian hormone (AMH), Right ovary (RO),
Left Ovary (LO), Number of eggs, No of Inseminations, No of fertilized and Egg quality. This article mainly makes
focus on variations of different ranges of cost ratio
ρ
and establishes the possibility of reducing errors in the
predictions made.
1. Introduction
It is natural for some sensitive decisions signicantly behave differ-
ently and inuence the outputs. IVF decisions by clinicians are prone to
wrong decision if false positive frequency is not taken care. Hence in this
article we propose the new measure made up of cost ratio based on false-
positive to false- negative occurrences.
In vitro fertilization (IVF), one of the types of assisted reproductive
technology (ART), carries the procedures for getting pregnancy through
fertilization, embryo development, and implantation. Combining pre-
scriptions of medicines and surgeries, IVF supports the patients in above
mentioned procedures. The main process of IVF uses medication for
making several eggs mature and being ready for further fertilization. In
the second step the eggs are removed from the body to mix with sperm
for fertilization. Out of these fertilized eggs called embryos are
implanted in the uterus. The annual reports of Indian Forum for fertility
clinics is less that produced across the United States (CDC, 2018; Sadecki
et al., 2022) . It does not have the information on women population,
treatment and clinical locations. But in contrast the Danish project
highlights not only these missing details but also the possibility of
inuencing factors (Baldur-Felskov et al., 2012; Bungum et al., 2019;
Thorsted et al., 2019) on infertility and future consequences.
Also there are cautions for the long-term health consequences of
infertility is presented (Murugappan et al., 2019; Pisarska, 2017) while
the tools for proper evaluation of results are not sufcient. Genetic
causes for impacting the guaranteed conception also contribute to the
Infertility added to the multi-factors from, both male and female sides,
usually by disruption of ER stress and cell death. Moreover, the associ-
ated obstetrical outcomes is inuenced by the treatment of infertility
(Vander Borght & Wyns, 2018).
The problem of nding effective and efcient classiers is the central
topic of knowledge discovery eld. Many techniques, methods and
principle are applied in nding more effective, efcient and also accu-
rate classier in data mining research. It is also important to evaluate
and opt preprocessing procedures applied on the given data set thor-
oughly to construct a best learning model for processing. There exists
context where cost sensitivity plays a major role. In most of the cases the
cost sensitive models are accepted due to their potential in producing
accurate or minimal error results as performance. But rate of false pos-
itive or false negative are not controlled. Many applications demand for
cost sensitive separate measures and they may be most tting. In order
to check this hypothesis, in this article we propose a measure in terms of
ratio between false- positive to false- negative try to get the results
through total cost for training and testing. The researchers with the
similar theme used either only one type of classier to measure the total
cost or they consider many types of data sets as found in the following
Contents lists available at ScienceDirect
Expert Systems With Applications
journal homepage: www.elsevier.com/locate/eswa
https://doi.org/10.1016/j.eswa.2023.120303
Received 31 January 2023; Received in revised form 11 April 2023; Accepted 27 April 2023
Expert Systems With Applications 227 (2023) 120303
2
related works. Hence most of the time one can see controlling the
occurrence of false negatives as if this has no effect on false positives and
vice versa. The notion of cost ratio dened as metric for measuring its
inuence on the evaluating measures like accuracy, precision, recall,
total cost and likelihood ratios for the classiers is rarely found in the
literature. Hence we have a novel framework in which we address this
issue rst of its kind.
The main objective of this paper is to produce a mapping between the
cost matrices (as input to the cost sensitive classiers) and confusion
matrixes (as output for extracting the occurrences of false positive and
false negative). Below the contents are divided into seven sections.
Section 2 and 3 consist of related works and materials. Section 4 and 5
describe the data set and proposed algorithm. Section 6 and 7 present
design of experiment and performance followed by conclusion.
2. Related works
The author (Thakkar et al., 2022 2022) while predicting customer
churning rate, the AdaBoost ensemble is applied by author with the help
of cost enabled cost sensitive classiers to reduce false-negative error
and the misclassication cost more signicantly inside an error-based
framework. (Mienye & Sun, 2021) investigated the strength of cost-
sensitive learning approaches in the context of imbalanced data set
using merely the conventional machine learning methods. (Telikani
et al., 2022) investigated cost sensitive classication by deep learning
based on partitioning the dataset and their corresponding cost matrix of
the components by dening a separate cost function layer. (Thai-Nghe
et al., 2010) presented two methods for cost sensitive learning for
imbalanced data using sampling techniques and optimizing cost ratio
locally. However, in many contexts of imbalanced dataset, the
misclassication costs cannot be determined completely. The cost-
sensitive learning technique takes misclassication costs into account
during the model construction, and does not modify the imbalanced data
distribution directly. Assigning distinct costs to the training examples
seems to be the most effective approach for the problem of class
imbalanced data. The author (Weiss et al., 2007) proved the dependence
of total cost by varying the cost ratio uniformly. The author (Domingos,
1999) shows the meta cost procedure helps cost reduction. Here we
present a frame work different from earlier work by allowing multiple
classiers instead of single classier SVM classier is used in (Thai-Nghe
et al., 2010), decisions are found in (Peter., 2001). The cost oriented
classiers built so far help us to reduce the risk associated with the
distribution of false positives in the predictions. In general methods for
measuring the loss due to wrong predictions is of interest and it varies
application to application signicantly.
The effect of making false positives varies one context to another. In
catalogue mailing for business promotion may yield small negative cost
in the case of non-respondent whereas this may be relatively more when
missing the potential respondent. Many researchers deal with the vari-
ety of algorithms (Kubat & Matwin, 1997; Pes & Lai, 2021; Peter., 2001)
to increase the accuracy of the evaluated models or to reduce the
probability of making wrong predictions. Sampling of training data
directly inuences the distribution of classes. The learning models ob-
tained from highly biased datasets are incapable of producing fair pre-
dictions. Hence either oversampling or under sampling can be used to
alter the class distribution (Weiss et al., 2007) of the training data as
found in (Abe et al., 2004; Breiman et al., 2017; Chan & Stolfo, 1998) .
In (Weiss et al., 2013) authors applied the heuristic technique for the
relationship between the cost of false positive and the cost of false
negative only for attribute selection. This heuristic method works with
the cooperation from the domain experts. In this case domain expert
happens to be the physician for coronary artery disease. They also
generated result based on the cost for subset of features and even cost for
individual feature (Khan et al., 2018). Here in our proposed work we
consider cost ratio of false positive to cost negative to establish the se-
lection of appropriate cost sensitive classiers.
Here the relationship (Equivalence in the nature of distributions)
between class frequencies and misclassication based on cost ratio was
established in differently (Ioannidis et al., 2011; Peter, 2001) . The
frequencies of positive and negative examples may be monitored to
make the learning algorithms cost sensitive. The authors in (Kubat &
Matwin, 1997) suggested the approximate equality of different classes
must be adopted for better performance. Epidemiology studies in
(Ioannidis et al., 2011) considers the ratios of false negative to false
positive for identifying risk factors contributing to causes and effects for
preventing health care. The ‘Black stone ratio emphases the thrust of
false negative and false positive in the criminal justice system to strike
an acceptable tradeoff between their cost in terms of reward and pun-
ishment. ‘Sentimental exaggerations as made in the criminal justice
system or medical diagnosis system reects as cost of false positive and
false negative in many forms.
The approach addresses the challenge of handling class-imbalanced
data, where the minority class holds greater signicance than the ma-
jority class a problem that standard machine learning classiers typi-
cally struggle with. To tackle this, correlation based feature selection is
utilized as a preprocessing technique to eliminate noise features and
extract the most relevant ones and gives superior geometric perfor-
mance (Elkarami et al., 2016).
3. Methods and materials
Cost-Sensitive Learning, the construction of such classier and their
parameters are described in the following sub sections.
3.1. Cost sensitive classier
A cost-sensitive classier refers to a mechanism in machine learning
that factors in the costs linked with various forms of classication errors.
While conventional classication treats all types of errors (false positives
and false negatives) uniformly, the idea in cost-sensitive classication is
for assigning distinct costs to each type of error. There are several ap-
proaches to introduce cost-sensitivity in machine learning models. One
way is to adjust the weights of training instances based on the assigned
cost of each class. Another approach is to predict the class that mini-
mizes the expected misclassication cost, instead of the most likely class.
Using a bagged classier can enhance the accuracy of probability esti-
mates from the base classier, leading to improved performance. In
cases where the base classier is unable to handle instance weights and
the weights are non-uniform, the data can be re-sampled with replace-
ment based on the weights prior to being fed into the base classier.
3.2. Cost-sensitive learning (CSL)
It is assumed that most classiers misclassication costs are same,
but in reality this assumption is not true always. For example, in diag-
nosis of cancer the misclassication is very serious than a false alarm
because the patient could loose his life due to delayed diagnosis and late
treatment (Ioannidis et al., 2011) In our processes we have revised our
equations and parameters of cost calculation in terms of materials shown
below.
The cost values are tabulated in cost matrix as in Table 1, with same
structure as confusion matrix, a table with main diagonal entries aligned
exactly true as true and false as false, during training and testing
Table 1
Templatefor Cost Matrix based on confusion matrix.
Predicted Class
Negative Positive
Actual Class Negative C
11
C
12
Positive C
21
C
22
A. Kumaravel and T. Vijayan
Expert Systems With Applications 227 (2023) 120303
3
respectively. The non aligned entries are in the non diagonal posi-
tions. The below table shows the template of cost matrix according to
confusion matrix structure.
The above cost matrix helps us to calculate the total cost and we can
vary the non-diagonal elements for studying the nature of misclassi-
cation. If misclassication cost in terms of false positive or false nega-
tive, is known, total cost is chosen in this proposed model as the best
metric to evaluate classier performance as shown below. We have
enumerated only total cost evaluation for the metric of the performance
applied for the four cost sensitive learning methods using ‘tree type of
base classiers.
The cost ratio
ρ
is dened as the ratio
ρ
=Number of falsepositives
Number of falsenegatives (1)
The equation below shows the applied Total Cost formula.
Total Cost = (FN ×CFN) + (FP ×CFP)(2)
where
FN =# false negatives, FP =# false positives obtained from the
confusion matrix generated as outputs while training and testing are
carried out.
CFP cost of false positive denoted by C
12
in the Table 1.
CFN cost of false negative denoted by C
21
in the Table 1.
3.3. Cost-Sensitive classiers construction
This section explains about assumptions made to construct an algo-
rithm to generate FP -false positives and FN-false negatives in minimum
level using cost sensitive classiers. Below four different cases are
explained how this generated value are used in following processes.
The ratios of false positive to false negative is used as inputs for the
main algorithms and extract output from confusion matrix. The chal-
lenge in training through the cost sensitive classier for learning the
model is to obtain misclassication cost for expected classication
results.
Here we consider C(i, j), where i and j take values 1 or 2, indicates the
cost of predicting an instance belonging to class i while ground truth is
that it belongs to class j. The main algorithm rolls around the ratio C
(1,2)/ C(2,1) and the inverse of this by either uniform increment or
relative prime steps amounting to four different styles. The main
objective of this process is to nd acceptable such ratio as it varies across
different values in these four different styles.
The objective is to construct the mapping ζ from Q to R where the
domain is the set of rational numbers and the range is the set of real
numbers indicating the cost associated. ζ (x) =c indicates a ratio FP: FN,
x in Q takes the cost c where FP and FN are two integers (McCrimmon,
1960; Sagher, 1989; Yu-Ting, 1980) Here we have adopted three types
of variations for ×in Q, distinguished by four cases as described below
as follow.
1. Case 1. If ×is of the form 1/y where y is non negative integer.
2. Case 2. If ×is of the form p/q where p, q are non-negative integers
and gcd(p,q) =1.
3. Case 3. Reciprocal of ×in case 1.
4. Case 4. Reciprocal of ×in case 2.
These four cases allow all possible ×in Q i.e. values of ×(FP: FN-
ratio), which is considered based on number theory results discussed
in (Domingos, 1999; Thai-Nghe et al., 2010; Weiss et al., 2007) By using
these four cases the main algorithm is formed for four components
mentioned in below Table 2.
4. Data collection and preprocessing
The data collection for the underlying data set is based on IVF ex-
ercises carried out at prasanth Fertility Chennai Centre, India which had
been approved by the Centres Review Board. The period of collection
carried out between the year 2016 to 2018 and it has been amounted a
sample size with 327 patient records having class distribution:118
negative,209 positive instances (Hari Priya, 2021). Hence we realize the
class ratio of negative and positive 1:2.
Exclusion criteria go along the following lines:
Table 2
Algorithm components based on Cost Ratio.
Ratio Pattern (CFP:CFN) Uniform Inverse
Normal CSC-U CSC-UI
Non Uniform (Relatively Prime) CSC-NU CSC-NUI
Fig. 1. Attribute distribution chart.
A. Kumaravel and T. Vijayan
Expert Systems With Applications 227 (2023) 120303
4
a) Women gone through ovarian surgery or having any endocrine dis-
orders were not included for the study.
b) Those who cannot show the ovarian simulation results controlling
within 150 IU/day with pituitary suppression of FSH 100 IU/day
(Recagon) sensed by TVS scan along estradiol serum measuring as
standard methods (Bas-Lando et al., 2017; Muttukrishna et al., 2005;
Uyar et al., 2014) .
This dataset contains 8 attributes as columns and 327 patientsre-
cords in real time. The distribution of each attribute is shown in Fig. 1.
Discretization of attributes Age, Egg Quality, Anti-Müllerian hor-
mone (AMH), Right Ovary (RO), Left Ovary (LO), No of Insemination
and No of fertilized, No of eggs.
Age: The range is divided into ve sectors s1, s2, s3, s4 and s5 as
s1:2025, s2:2630, s3:3035, s4:3640 and s5:4146.
Egg Quality: The range is divided into ve sectors s1, s2, s3, s4 and s5
as s1:0.01, s2:0.25, s3:0.5, s4:0.75 and s5:1.
Anti-Müllerian hormone (AMH): The range is divided into ve sec-
tors s1, s2, s3, s4 and s5 as s1:02.0, s2:2.14.0, s3:4.16.0, s4:6.18.0
and s5:8.110.6.
Right ovary (RO), Left Ovary (LO), No of Insemination and No of
fertilized: The range is divided into ve sectors s1, s2, s3, s4 and s5 as
s1:05, s2:610, s3:1115, s4:1620 and s5:2138.
No of eggs: The range is divided into ve sectors s1, s2, s3, s4 and s5
as s1:15, s2:610, s3:1115, s4:1620 and s5:2143.
5. Proposed main algorithm
In this paper article cost sensitive classiers are constructed with the
help of algorithm in Fig. 3 to implement the above-described algorithm
components. To build cost sensitive classiers Weka tool (Weka, 2021)
is used by tuning the ratio
ρ
as planned in section 3.2. Below in Table 2
we have described the inter related four components of main algorithm.
The prex CSC stands for Cost Sensitive Classier and sufx U, UI,
NU and NUI denote the components of uniform, uniform inverse, non
-uniform and non -uniform inverse respectively as stated in Table 2.The
main process and its steps involved in the algorithm is shown in Fig. 2.
Fig. 2. Cost Sensitive classier based on cost ratio.
Fig 3. Proposed Algorithm for Cost Ratio based Cost Sensitive Learning.
A. Kumaravel and T. Vijayan
Expert Systems With Applications 227 (2023) 120303
5
5.1. Pseudo code for cost sensitive classier by cost ratio
In Fig. 3 we specify the steps of algorithm components annotated
in‘[]and other steps common for all the 4 cases are described in the
rest.
Dataset D contains the underlying date instances, here the instances
of patient attributes as shown in 3 contributing the class values either
positive or negative. This dataset contains 327 records and 8 features in
each record. Being the context to be treated cost-sensitive, preprocessing
like normalization and attribute selection is not considered as all attri-
butes carry equal signicance. The data collection does not contain any
missing or blank data.
In line 2, b
i
denotes any tree classier{J48, LMT, ADT, Decision
Stump}.
Between the two loops, outer loop is for iterating over four tree
classiers and inner loop for iteration over the index i variation for the
cost ratio extracted from the cost matrix. After xing the loop variants
for the current iteration, the TCprocedure call takes care of training,
testing and classifying and nally the total cost involved. Final output is
presented by the optimal value among the set of total cost generated by
the above iterations.
5.1.1. Component for CSC-U:
To obtain the cost value as discussed earlier the ratio
ρ
s numerator is
incremented uniformly from 1 to10 and denominator is xed.
5.1.2. Component for CSC-UI:
This is similar to that of CSC -UI by just inversing Component for
CSC-U where CFN =c
21
=i andc
12
=1, i.e. i{1, 2,..,10}.
5.1.3. Component for CSC-NU:
The above algorithm is applied with the ratio
ρ
as m: n where m & n
are relatively primes and their values are with in principle range 1 to10.
The only extra complexity for non-uniform cost algorithm is testing the
relative primality condition by greatest common divisor that is gcd (m,
n) =1.
5.1.4. Component for CSC-NUI:
From the ratio
ρ
calculated by CFP/CFN, we consider its inverse and
repeat the classication process to get the new results for both case
uniform and non-uniform through their index in non-decreasing order.
Hence we obtain the non-uniform cost algorithm namely algorithm NUI
by simply inverting the loop indices in the algorithm for NU.
Fig. 4. Uniform Variation of Total Cost.
Fig. 5. Uniform Inverted Variation of Total Cost.
A. Kumaravel and T. Vijayan
Expert Systems With Applications 227 (2023) 120303
6
5.2. Interpretation of main algorithm
To get the output list of classiers performance features like accu-
racy, precision, recall, total cost and likelihood ratio we apply the main
algorithm and each component works depending on the range type of
the cost ratio. The innermost part is common to all types of components
and it consists of the steps for classifying and generating confusion
matrix. The index of the outermost loop varies over base classiers (for
these experiments we consider only decision trees). The next level inner
loops index type is varying for each component. In the rst component
CSC-U the index (cost ratio) varies uniformly by unit increment whereas
in CSC-UI the index is inverted. The two components in the non-uniform
case CSC-NU and CSC-NUIare similar except the index determined by
cost ratio with numerator and denominator selected as co primes. The
inputs of the main algorithm are dataset and the cost matrix. The entries
in the cost matrix are assumed to be integers for the sake of simplicity.
The values of right diagonal elements c
21
and c
12
are restricted to
principle values 1 to 10 and we construct the four types of ratios based
on these values. The reasons for the selection of values in principle range
are rstly most of the classiers show their behavior stable in this range
and secondly even if it is not the case, the extremely large learning time
for larger values.
The comparisons of above mentioned four components in algorithm
are shown through tables and graphs below.
6. Experimental results
We implemented data from clinical records of IVF from data base in
tree classiers for cost sensitive learners in Weka platform (Weka,
2021). The best four classiers namely J48, ADT, LMT,and Decision
Stump was adopted for this test.
In Figs. 4 and 5, by comparing the graph of CSC-U and CSC-UI it
shows that magnitude of total cost is high while increasing false positive
in main algorithm and also noted that the increase in total cost is
increasing gradually.it is also noted that choosing right tree classier is
important because it is observed that in both in CSC-U J48 produces
more total cost and decision stump shows low total cost, but in CSC-UI
the reverse is seen where J48 produces less total cost compare to deci-
sion stump. (See Table 3 and 4).
In Figs. 6 and 7, by comparing the graph of CSC-NU and CSC-NUI it
shows that magnitude of total cost is high while increasing false positive
in main algorithm and also noted that the increase in total cost is
increasing gradually it is also noted that choosing right tree classier is
important because it is observed that in both in CSC-NU LMT produces
more total cost and decision stump shows low total cost, but in CSC-NUI
the reverse is seen where Decision Stump produces less total cost
compare to decision stump (See Table 5 and 6).
7. Conclusion
The cost sensitive model for the IVF data sets is processed for four
different ranges of cost ratio. The results show the inuence of cost ratio
false positive to false negative is varying for different types of tree
classier as done here namely J48, ADT, LMT and Decision stump for
Table 3
Performance by Total cost based on the ratio (false positive: false negative) in
Cost Sensitive Classiers applying CSC-U.
Total Cost for IVF Cost sensitive Classiers CSC-U
Cost Ratio J48 LMT ADT Decision Stump
1:1 121 131 107 123
1:2 191 214 184 237
1:3 261 297 261 351
1:4 331 380 338 465
1:5 401 463 415 579
1:6 471 546 492 693
1:7 541 629 569 807
1:8 611 712 646 921
1:9 681 795 723 1035
1:10 751 876 800 1149
Table 4
Performance by Total cost based on the ratio (false positive: false negative) in
Cost Sensitive Classiers applying CSC-UI.
Total Cost for IVF Cost sensitive Classiers CSC-UI
Cost Ratio J48 LMT ADT Decision Stump
1:1 121 131 107 123
2:1 172 179 137 132
3:1 223 227 167 141
4:1 274 275 197 150
5:1 325 323 228 159
6:1 376 371 257 168
7:1 427 419 287 177
8:1 478 467 317 186
9:1 529 515 347 195
10:1 580 563 377 204
Fig. 6. Non Uniform Variation of Total Cost.
A. Kumaravel and T. Vijayan
Expert Systems With Applications 227 (2023) 120303
7
IVF dataset. More over the magnitude of total cost is increasing gradu-
ally with change in ratio and based on the obtained results best classier
can been chosen from the better understanding of classier in future.
The limitations primarily are visible in the number of iterations based on
the index varying only in the initial segment of integers 110. Though it
works for the purposes establishing the existence of mapping on the cost
ratio and demonstrating the same, this restriction can be relaxed higher
number of iterations with computing facility for more time and space for
generating the cost models. The future work can be extended with- the
study for other types of cost sensitive meta classiers to measure the
error cost as discussed in this work.
CRediT authorship contribution statement
A. Kumaravel: Conceptualization, Methodology, Writing original
draft, Visualization, Supervision, Writing review & editing. T.
Vijayan: Software, Data curation, Investigation, Validation, Project
administration.
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
Data availability
Data will be made available on request.
Acknowledgements
We would like to thank authorities of Prasanth Fertility Hospital,
Chennai, India for allowing the real time data used for this research
work.
Fig. 7. Non Uniform Inverted Variation of Total Cost.
Table 5
Performance by Total cost based on the ratio (false positive: false negative) in
Cost Sensitive Classiers applying CSC-NU.
Total Cost for IVF Cost sensitive Classiers CSC-NU
Cost Ratio J48 LMT ADT Decision Stump
1:1 121 131 107 123
1:3 223 227 167 141
1:4 274 275 197 150
2:3 293 310 244 255
2:5 244 406 304 273
2:7 497 502 364 291
2:9 599 327 424 309
3:4 414 441 351 378
3:7 567 585 441 405
4:5 535 572 458 501
4:7 637 668 518 519
4:9 739 764 578 537
5:6 657 703 565 624
5:7 707 751 595 633
5:8 758 799 625 642
5:9 809 847 655 651
6:7 777 834 672 747
7:8 898 965 779 870
7:9 949 1013 809 879
8:9 1019 1096 886 993
Table 6
Performance by Total cost based on the ratio (false positive: false negative) in
Cost Sensitive Classiers applying CSC-NUI.
Total Cost for IVF Cost sensitive Classiers CSC-NUI
Cost Ratio J48 LMT ADT Decision Stump
1: 1 121 131 107 123
3: 1 261 297 261 351
3: 2 312 345 291 360
4: 1 331 380 338 465
4: 3 433 476 398 483
5: 2 452 511 445 588
5: 4 554 607 505 606
6: 5 675 738 612 729
7: 2 592 677 599 816
7: 3 643 725 629 825
7: 4 694 773 659 834
7: 5 745 821 689 843
7: 6 796 869 719 852
8: 5 815 904 766 957
8: 7 917 1000 826 975
9: 2 732 843 753 1044
9: 4 834 939 813 1062
9: 5 885 987 843 1071
9: 7 987 1083 903 1089
9: 8 1038 1131 933 1098
A. Kumaravel and T. Vijayan
Expert Systems With Applications 227 (2023) 120303
8
References
Abe, N., Zadrozny, B., & Langford, J. (2004). An iterative method for multi-class cost-
sensitive learning. Proceedings of the Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. https://doi.org/10.1145/1014052.1014056.
Baldur-Felskov, B., Kjaer, S. K., Albieri, V., Steding-Jessen, M., Kjaer, T., Johansen, C.,
Dalton, S. O., & Jensen, A. (2012). Psychiatric disorders in women with fertility
problems: results from a large Danish register-based cohort study. Human
Reproduction, 28(3), 683690. https://doi.org/10.1093/humrep/des422
Bas-Lando, M., Rabinowitz, R., Farkash, R., Algur, N., Rubinstein, E., Schonberger, O., &
Eldar-Geva, T. (2017). Prediction value of anti-Mullerian hormone (AMH) serum
levels and antral follicle count (AFC) in hormonal contraceptive (HC) users and non-
HC users undergoing IVF-PGD treatment. Gynecological Endocrinology, 33(10),
797800. https://doi.org/10.1080/09513590.2017.1320376
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017, October 19).
Classication And Regression Trees. https://doi.org/10.1201/9781315139470.
Bungum, A. B., Glazer, C. H., Arendt, L. H., Schmidt, L., Pinborg, A., Bonde, J. P., &
Tøttenborg, S. S. (2019). Risk of hospitalization for early onset of cardiovascular
disease among infertile women: a register-based cohort study. Human Reproduction,
34(11), 22742281. https://doi.org/10.1093/humrep/dez154
Chan, P. K., & Stolfo, S. (1998). Toward scalable learning with non-uniform class and
cost distributions: A case study in credit card fraud detection. Knowledge Discovery
and Data Mining.
CDC. (2018). 2017 Fertility Clinic Success Rates | Assisted Reproductive Technology
(ART) Report | Reproductive Health | CDC. https://www.cdc.gov/art/repor
ts/2017/fertility-clinic.html.
Domingos, P. (1999). MetaCost. Proceedings of the Fifth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/
312129.312220.
Elkarami, B., Alkhateeb, A., & Rueda, L. (2016, May). Cost-sensitive classication on
class-balanced ensembles for imbalanced non-coding RNA data. 2016 IEEE EMBS
International Student Conference (ISC). https://doi.org/10.1109/
embsisc.2016.7508607.
Hari Priya, G., et al. (2021). Classiers with synthetic oversampling pre-process for In
Vitro Fertilization predictions. Indian Journal of Computer Science and Engineering, 12
(6), 15321541. https://doi.org/10.21817/indjcse/2021/v12i6/211206061.
Ioannidis, J. P. A., Tarone, R., & McLaughlin, J. K. (2011). The False-positive to False-
negative Ratio in Epidemiologic Studies. Epidemiology, 22(4), 450456. https://doi.
org/10.1097/ede.0b013e31821b506e
McCrimmon, K. (1960). Enumeration of the positive rationals. The American
Mathematical Monthly, 67(9), 868.
Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., & Togneri, R. (2018). Cost-Sensitive
Learning of Deep Feature Representations From Imbalanced Data. IEEE Transactions
on Neural Networks and Learning Systems, 29(8), 35733587. https://doi.org/
10.1109/tnnls.2017.2732482
Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-
sided selection. Proceedings of the 14th International Conference in Ma chine
Learning, Nashville, 179-186.
Mienye, I. D., & Sun, Y. (2021). Performance analysis of cost-sensitive learning methods
with application to imbalanced medical data. Informatics in Medicine Unlocked, 25,
Article 100690. https://doi.org/10.1016/j.imu.2021.100690
Murugappan, G., Li, S., Lathi, R. B., Baker, V. L., & Eisenberg, M. L. (2019). Increased risk
of incident chronic medical conditions in infertile women: analysis of US claims data.
American Journal of Obstetrics and Gynecology, 220(5), 473.e1473.e14. https://doi.
org/10.1016/j.ajog.2019.01.214
Muttukrishna, S., McGarrigle, H., Wakim, R., Khadum, I., Ranieri, D., & Serhal, P. (2005).
Antral follicle count, anti-mullerian hormone and inhibin B: predictors of ovarian
response in assisted reproductive technology? BJOG: An International Journal of
Obstetrics & Gynaecology, 112(10), 13841390. https://doi.org/10.1111/j.1471-
0528.2005.00670.x
Pes, B., & Lai, G. (2021). Cost-sensitive learning strategies for high-dimensional and
imbalanced data: a comparative study. Peer J Computer Science, 7. https://doi.org/
10.7717/peerj-cs.832
Peter. (2001, August). The foundations of cost-sensitive learning. IJCAI01: Proceedings of
the 17th International Joint Conference on Articial Intelligence, 2, 973978. https://
doi.org/10.5555/1642194.1642224.
Pisarska, M. D. (2017, June 28). Fertility Status and Overall Health. PubMed Central
(PMC). https://doi.org/10.1055/s-0037-1603728.
Sadecki, E., Weaver, A., Zhao, Y., Stewart, E. A., & Ainsworth, A. J. (2022). Fertility
trends and comparisons in a historical cohort of US women with primary infertility.
Reproductive Health, 19(1). https://doi.org/10.1186/s12978-021-01313-6
Telikani, A., Gandomi, A. H., Choo, K. K. R., & Shen, J. (2022). A cost-sensitive deep
learning-based approach for network trafc classication. IEEE Transactions on
Network and Service Management, 19(1), 661670. https://doi.org/10.1109/
tnsm.2021.3112283
Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010, July). Cost-sensitive learning
methods for imbalanced data. The 2010 International Joint Conference on Neural
Networks (IJCNN). https://doi.org/10.1109/ijcnn.2010.5596486.
Thakkar, H. K., Desai, A., Ghosh, S., Singh, P., & Sharma, G. (2022, January 22).
Clairvoyant: AdaBoost with Cost-Enabled Cost-Sensitive Classier for Customer
Churn Prediction. Computational Intelligence and Neuroscience, 2022, 111. https://
doi.org/10.1155/2022/9028580.
Thorsted, A., Lauridsen, J., Høyer, B., Arendt, L. H., Bech, B., Toft, G., Hougaard, K.,
Olsen, J., Bonde, J. P., & Ramlau-Hansen, C. (2019). Birth weight for gestational age
and the risk of infertility: a Danish cohort study. Human Reproduction, 35(1),
195202. https://doi.org/10.1093/humrep/dez232
Uyar, A., Bener, A., & Ciray, H. N. (2014). Predictive modeling of implantation outcome
in an in vitro fertilization setting. Medical Decision Making, 35(6), 714725. https://
doi.org/10.1177/0272989x14535984
Vander Borght, M., & Wyns, C. (2018). Fertility and infertility: Denition and
epidemiology. Clinical Biochemistry, 62, 210. https://doi.org/10.1016/j.
clinbiochem.2018.03.012
Weiss, G. M., McCarthy, K., & Zabar, B. (2007). Cost-sensitive learning vs. sampling:
Which is best for handling unbalanced classes with unequal error costs? DMIN, 7
(3541), 24.
Weiss, Y., Elovici, Y., & Rokach, L. (2013). February). The CASH algorithm-cost-sensitive
attribute selection using histograms. Information Sciences, 222, 247268. https://doi.
org/10.1016/j.ins.2011.01.035
Weka (2021). Department of Computer Science: University of Waikato. (n.d.). Department of
Computer Science: University of Waikato. http://www.cs.waikato.ac.nz.
Sagher, Y. (1989). Counting the rationals. Amer. Math. Monthly, 96(9), 823.
Yu-Ting, S. (1980). A Naturalenumeration of non-negative rational numbersan
informal discussion. The American Mathematical Monthly, 87(1), 25. https://doi.org/
10.2307/2320374
A. Kumaravel and T. Vijayan
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Customer churn prediction is one of the challenging problems and paramount concerns for telecommunication industries. With the increasing number of mobile operators, users can switch from one mobile operator to another if they are unsatisfied with the service. Marketing literature states that it costs 5–10 times more to acquire a new customer than retain an existing one. Hence, effective customer churn management has become a crucial demand for mobile communication operators. Researchers have proposed several classifiers and boosting methods to control customer churn rate, including deep learning (DL) algorithms. However, conventional classification algorithms follow an error-based framework that focuses on improving the classifier’s accuracy over cost sensitization. Typical classification algorithms treat misclassification errors equally, which is not applicable in practice. On the contrary, DL algorithms are computationally expensive as well as time-consuming. In this paper, a novel class-dependent cost-sensitive boosting algorithm called AdaBoostWithCost is proposed to reduce the churn cost. This study demonstrates the empirical evaluation of the proposed AdaBoostWithCost algorithm, which consistently outperforms the discrete AdaBoost algorithm concerning telecom churn prediction. The key focus of the AdaBoostWithCost classifier is to reduce false-negative error and the misclassification cost more significantly than the AdaBoost.
Article
Full-text available
Background There is growing interest in long-term outcomes following infertility and infertility treatment. However, there are few detailed longitudinal cohorts available for this work. This study aimed to assemble a historical cohort of women with primary infertility and age-matched controls to evaluate fertility trends, sequelae, and sociodemographic differences. Described here are cohort group characteristics and associated reproductive trends over time. Methods A population-based historical cohort was created using the Rochester Epidemiology Project (REP) record-linkage system (Olmsted County, MN). The cohort included women aged 18–50 with a diagnosis of primary infertility between January 1, 1980, and December 31, 1999. As part of a case–control study, we identified 1:1 age-matched female controls from the same community and era. Results A total of 1001 women with primary infertility and 1001 age-matched controls were identified. The women with primary infertility were significantly more likely to be married, college educated, use barrier contraception, and non-smokers compared to age-matched controls. The incidence of primary infertility increased from 14 to 20 per 10,000 person years from 1980–1985 to 1995–1999. Ovulatory dysfunction and unexplained infertility were the most common causes of primary infertility and clomiphene was the most widely used fertility medication. Rates of in vitro fertilization (IVF) increased from 1.8% during 1980–1985 to 26.0% during 1995–1999. Conclusion Women with primary infertility were found to have unique sociodemographic characteristics compared to age-matched control women, which is consistent with previous research. The incidence of diagnosed primary infertility increased from 1980 to 1999, as did use of IVF.
Article
Full-text available
High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.
Article
Full-text available
Many real-world machine learning applications require building models using highly imbalanced datasets. Usually, in medical datasets, the healthy patients or samples are dominant, making them the majority class, while the sick patients are few, making them the minority class. Researchers have proposed numerous machine learning methods to predict medical diagnosis. Still, the class imbalance problem makes it difficult for classifiers to adequately learn and distinguish between the minority and majority classes. Cost-sensitive learning and resampling techniques are used to deal with the class imbalance problem. This research focuses on developing robust cost-sensitive classifiers by modifying the objective functions of some well-known algorithms, such as logistic regression, decision tree, extreme gradient boosting, and random forest, which are then used to efficiently predict medical diagnosis. Meanwhile, as opposed to resampling techniques, our approach does not alter the original data distribution. Firstly, we implement the standard versions of these algorithms to provide a baseline for performance comparison. Secondly, we develop their corresponding cost-sensitive algorithms. For the proposed approaches, it is not necessary to change the distribution of the original data as the modified algorithms consider the imbalanced class distribution during training, thereby resulting in more reliable performance than when the data is resampled. Four popular medical datasets, including the Pima Indians Diabetes, Haberman Breast Cancer, Cervical Cancer Risk Factors, and Chronic Kidney Disease datasets, are used in the experiments to validate the performance of the proposed approach. The experimental results show that the cost-sensitive methods yield superior performance compared to the standard algorithms.
Article
Network traffic classification (NTC) plays an important role in cyber security and network performance, for example in intrusion detection and facilitating a higher quality of service. However, due to the unbalanced nature of traffic datasets, NTC can be extremely challenging and poor management can degrade classification performance. While existing NTC methods seek to re-balance data distribution through resampling strategies, such approaches are known to suffer from information loss, overfitting, and increased model complexity. To address these challenges, we propose a new cost-sensitive deep learning approach to increase the robustness of deep learning classifiers against the imbalanced class problem in NTC. First, the dataset is divided into different partitions, and a cost matrix is created for each partition by considering the data distribution. Then, the costs are applied to the cost function layer to penalize classification errors. In our approach, costs are diverse in each type of misclassification because the cost matrix is specifically generated for each partition. To determine its utility, we implement the proposed cost-sensitive learning method in two deep learning classifiers, namely: stacked autoencoder and convolution neural networks. Our experiments on the ISCX VPN-nonVPN dataset show that the proposed approach can obtain higher classification performance on low-frequency classes, in comparison to three other NTC methods.
Article
Study question: Is birth weight for gestational age associated with infertility in adulthood among men and women? Summary answer: Being born small for gestational age (SGA) was associated with infertility in adulthood among men. What is known already: Fetal growth restriction may affect fertility, but results from previous studies have been inconsistent. Study design, size, duration: In this population-based cohort study, we used data from a Danish birth cohort, including 5594 men and 5342 women born between 1984 and 1987. Information on infertility was obtained from Danish health registers during the period from the participants' 18th birthday and up until 31 December 2017. Participants/materials, setting, methods: Participants were men and women born in two Danish municipalities, Aalborg and Odense. Information on birth weight and gestational age was obtained from birth records, and information on infertility diagnoses and fertility treatment was retrieved from the Danish National Patient Registry (NPR) and the Danish In Vitro Fertilisation (IVF) registry. Information on potential maternal confounders was obtained from questionnaires during pregnancy and was included in adjusted analyses. Logistic regression analysis was used to estimate crude and adjusted odds ratios (ORs) with 95% confidence intervals (CIs) for infertility according to birth weight for gestational age. Main results and the role of chance: Men born SGA had a 55% higher risk of being diagnosed with or treated for infertility compared to men born appropriate for gestational age (AGA) (adjusted OR = 1.55, 95% CI: 1.09-2.21). The association attenuated after exclusion of men born with hypospadias or cryptorchidism (OR = 1.37, 95% CI: 0.93-2.01). No association was found between women's birth weight for gestational age and risk of infertility (adjusted OR = 1.00, 95% CI: 0.73-1.37). Limitations, reasons for caution: Estimation of gestational age is associated with some uncertainty and might have caused non-differential misclassification. The study design implicitly assumed similar distribution of reproductive and health-seeking behaviour across the groups that were compared. Wider implications of the findings: Men born SGA had a higher risk of infertility. Genital malformations may account for part of the observed association, but this must be explored further. Study funding/competing interest(s): This study was supported by Health, Aarhus University. No competing interests are declared. Trial registration number: N/A.
Article
Study question: Is female infertility predictive of a woman's future risk of early cardiovascular disease (CVD)? Summary answer: Female infertility does not seem to be predictive of early CVD during a mean follow-up of 9 years. What is known already: Associations between infertility and comorbidity have been found in several studies, but data on the association between female infertility and risk of CVD are scarce and inconclusive. Study design, size, duration: In this nationwide cohort study, we included 87 221 women registered in the Danish National IVF register, undergoing medically assisted reproduction (MAR) between 1st of January 1994 and 31st of December 2015. The cohort was followed for incident hospitalization due to CVD in the Danish National Patient Register from enrollment to 31 December 2015. Women with a history of CVD prior to enrollment were excluded. Cox proportional hazard models with age as the underlying time scale were used to estimate hazard ratios (HR) with 95% CI of CVD among women with an infertility diagnosis, compared to women without an infertility diagnosis. All analyses were adjusted for educational attainment. Participants/materials, setting, methods: Female infertility and the reason for infertility was diagnosed and registered in the IVF register by specialists in Danish public and private fertility clinics since 1st of January 1994. In our cohort, 53 806 women (61.7%) were diagnosed with female factor infertility, while 33 415 (38.3%) did not have a female factor infertility diagnosis and made up the reference group. Main results and the role of chance: A total of 686 (1.3%) infertile women were hospitalized for CVD compared to 250 (0.7%) among women without an infertility diagnosis during a mean follow-up time of 9 years. We found no increased risk of early CVD in our analyses (adjusted HR 0.98, 95% CI: 0.85;1.14). Likewise, analyses stratified by specific infertility diagnosis, showed no risk difference. Limitations, reasons for caution: We were unable to adjust for confounding parameters such as body mass index, cigarette smoking or alcohol consumption. These results may not be generalizable to infertile women who do not seek out fertility treatment, or infertile women with other lifestyle characteristics than Danish women. Wider implications of the findings: Diagnosing female infertility or the time of MAR does not seem to be a window of opportunity where early screening for cardiovascular disease risk factors can have a prophylactic potential. Study funding/competing interest(s): This study is part of the ReproUnion collaborative study, co-financed by the European Union, Interreg V ÖKS. None of the authors declare any conflict of interest.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.