ArticlePDF Available

Comparing cost sensitive classifiers by the false-positive to false- negative ratio in diagnostic studies

May 2023

May 2023

Authors:

Vijayan .T

Bharath institute of Higher Education and Research

Attribute distribution chart.

…

Cost Sensitive classifier based on cost ratio.

…

Proposed Algorithm for Cost Ratio based Cost Sensitive Learning.

…

Uniform Variation of Total Cost.

…

Uniform Inverted Variation of Total Cost.

…

Figures - uploaded by Vijayan .T

Content may be subject to copyright.

Content uploaded by Vijayan .T

Content may be subject to copyright.

Expert Systems With Applications 227 (2023) 120303

Available online 2 May 2023

Comparing cost sensitive classiers by the false-positive to false- negative

ratio in diagnostic studies

A. Kumaravel

, T. Vijayan

Department of Information Technology, Bharath Institute of Higher Education and Research, India

Department of Electronics and Communication Engineering, Bharath Institute of Higher Education and Research, India

ARTICLE INFO

Keywords:

Cost ratio

Confusion matrix

Cost matrix

Total cost

False positive

False negative

Cost sensitive learning

In vitro fertilization

ABSTRACT

Nowadays researchers want to be cautious about cost of building models which can generate false positives and

false negatives in unexpected ways. They keep on searching for various measures for controlling such behavior

depending upon the underlying datasets. Cost sensitive classiers are the models to check the total cost due to

misclassications. In this article, the cost sensitive classiers are tried for the rst time endowed with a new

measure ‘cost ratio’ to monitor such misclassications in the sensitive diagnostic studies. The scheme for vari-

ations of such ratio is introduced and its inuence on the loss is investigated. This cost ratio, a rational number,

is made up of the integers for the cost of false positive by its’ frequency of occurrences, in the numerator and the

similar cost of false negative in the denominator. We apply this novel cost monitoring measure for learning the

sample dataset of sensitive nature in the context of in vitro fertilization (IVF) dataset indicating the success or

failure of fertilization depending on the attributes like Age, Anti-Müllerian hormone (AMH), Right ovary (RO),

Left Ovary (LO), Number of eggs, No of Inseminations, No of fertilized and Egg quality. This article mainly makes

focus on variations of different ranges of cost ratio

and establishes the possibility of reducing errors in the

predictions made.

1. Introduction

It is natural for some sensitive decisions signicantly behave differ-

ently and inuence the outputs. IVF decisions by clinicians are prone to

wrong decision if false positive frequency is not taken care. Hence in this

article we propose the new measure made up of cost ratio based on false-

positive to false- negative occurrences.

In vitro fertilization (IVF), one of the types of assisted reproductive

technology (ART), carries the procedures for getting pregnancy through

fertilization, embryo development, and implantation. Combining pre-

scriptions of medicines and surgeries, IVF supports the patients in above

mentioned procedures. The main process of IVF uses medication for

making several eggs mature and being ready for further fertilization. In

the second step the eggs are removed from the body to mix with sperm

for fertilization. Out of these fertilized eggs called embryos are

implanted in the uterus. The annual reports of Indian Forum for fertility

clinics is less that produced across the United States (CDC, 2018; Sadecki

et al., 2022) . It does not have the information on women population,

treatment and clinical locations. But in contrast the Danish project

highlights not only these missing details but also the possibility of

inuencing factors (Baldur-Felskov et al., 2012; Bungum et al., 2019;

Thorsted et al., 2019) on infertility and future consequences.

Also there are cautions for the long-term health consequences of

infertility is presented (Murugappan et al., 2019; Pisarska, 2017) while

the tools for proper evaluation of results are not sufcient. Genetic

causes for impacting the guaranteed conception also contribute to the

Infertility added to the multi-factors from, both male and female sides,

usually by disruption of ER stress and cell death. Moreover, the associ-

ated obstetrical outcomes is inuenced by the treatment of infertility

(Vander Borght & Wyns, 2018).

The problem of nding effective and efcient classiers is the central

topic of knowledge discovery eld. Many techniques, methods and

principle are applied in nding more effective, efcient and also accu-

rate classier in data mining research. It is also important to evaluate

and opt preprocessing procedures applied on the given data set thor-

oughly to construct a best learning model for processing. There exists

context where cost sensitivity plays a major role. In most of the cases the

cost sensitive models are accepted due to their potential in producing

accurate or minimal error results as performance. But rate of false pos-

itive or false negative are not controlled. Many applications demand for

cost sensitive separate measures and they may be most tting. In order

to check this hypothesis, in this article we propose a measure in terms of

ratio between false- positive to false- negative try to get the results

through total cost for training and testing. The researchers with the

similar theme used either only one type of classier to measure the total

cost or they consider many types of data sets as found in the following

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

https://doi.org/10.1016/j.eswa.2023.120303

Received 31 January 2023; Received in revised form 11 April 2023; Accepted 27 April 2023

Expert Systems With Applications 227 (2023) 120303

related works. Hence most of the time one can see controlling the

occurrence of false negatives as if this has no effect on false positives and

vice versa. The notion of cost ratio dened as metric for measuring its

inuence on the evaluating measures like accuracy, precision, recall,

total cost and likelihood ratios for the classiers is rarely found in the

literature. Hence we have a novel framework in which we address this

issue rst of its kind.

The main objective of this paper is to produce a mapping between the

cost matrices (as input to the cost sensitive classiers) and confusion

matrixes (as output for extracting the occurrences of false positive and

false negative). Below the contents are divided into seven sections.

Section 2 and 3 consist of related works and materials. Section 4 and 5

describe the data set and proposed algorithm. Section 6 and 7 present

design of experiment and performance followed by conclusion.

2. Related works

The author (Thakkar et al., 2022 2022) while predicting customer

churning rate, the AdaBoost ensemble is applied by author with the help

of cost enabled cost sensitive classiers to reduce false-negative error

and the misclassication cost more signicantly inside an error-based

framework. (Mienye & Sun, 2021) investigated the strength of cost-

sensitive learning approaches in the context of imbalanced data set

using merely the conventional machine learning methods. (Telikani

et al., 2022) investigated cost sensitive classication by deep learning

based on partitioning the dataset and their corresponding cost matrix of

the components by dening a separate cost function layer. (Thai-Nghe

et al., 2010) presented two methods for cost sensitive learning for

imbalanced data using sampling techniques and optimizing cost ratio

locally. However, in many contexts of imbalanced dataset, the

misclassication costs cannot be determined completely. The cost-

sensitive learning technique takes misclassication costs into account

during the model construction, and does not modify the imbalanced data

distribution directly. Assigning distinct costs to the training examples

seems to be the most effective approach for the problem of class

imbalanced data. The author (Weiss et al., 2007) proved the dependence

of total cost by varying the cost ratio uniformly. The author (Domingos,

1999) shows the meta cost procedure helps cost reduction. Here we

present a frame work different from earlier work by allowing multiple

classiers instead of single classier SVM classier is used in (Thai-Nghe

et al., 2010), decisions are found in (Peter., 2001). The cost oriented

classiers built so far help us to reduce the risk associated with the

distribution of false positives in the predictions. In general methods for

measuring the loss due to wrong predictions is of interest and it varies

application to application signicantly.

The effect of making false positives varies one context to another. In

catalogue mailing for business promotion may yield small negative cost

in the case of non-respondent whereas this may be relatively more when

missing the potential respondent. Many researchers deal with the vari-

ety of algorithms (Kubat & Matwin, 1997; Pes & Lai, 2021; Peter., 2001)

to increase the accuracy of the evaluated models or to reduce the

probability of making wrong predictions. Sampling of training data

directly inuences the distribution of classes. The learning models ob-

tained from highly biased datasets are incapable of producing fair pre-

dictions. Hence either oversampling or under sampling can be used to

alter the class distribution (Weiss et al., 2007) of the training data as

found in (Abe et al., 2004; Breiman et al., 2017; Chan & Stolfo, 1998) .

In (Weiss et al., 2013) authors applied the heuristic technique for the

relationship between the cost of false positive and the cost of false

negative only for attribute selection. This heuristic method works with

the cooperation from the domain experts. In this case domain expert

happens to be the physician for coronary artery disease. They also

generated result based on the cost for subset of features and even cost for

individual feature (Khan et al., 2018). Here in our proposed work we

consider cost ratio of false positive to cost negative to establish the se-

lection of appropriate cost sensitive classiers.

Here the relationship (Equivalence in the nature of distributions)

between class frequencies and misclassication based on cost ratio was

established in differently (Ioannidis et al., 2011; Peter, 2001) . The

frequencies of positive and negative examples may be monitored to

make the learning algorithms cost sensitive. The authors in (Kubat &

Matwin, 1997) suggested the approximate equality of different classes

must be adopted for better performance. Epidemiology studies in

(Ioannidis et al., 2011) considers the ratios of false negative to false

positive for identifying risk factors contributing to causes and effects for

preventing health care. The ‘Black stone’ ratio emphases the thrust of

false negative and false positive in the criminal justice system to strike

an acceptable tradeoff between their cost in terms of reward and pun-

ishment. ‘Sentimental exaggerations’ as made in the criminal justice

system or medical diagnosis system reects as cost of false positive and

false negative in many forms.

The approach addresses the challenge of handling class-imbalanced

data, where the minority class holds greater signicance than the ma-

jority class a problem that standard machine learning classiers typi-

cally struggle with. To tackle this, correlation based feature selection is

utilized as a preprocessing technique to eliminate noise features and

extract the most relevant ones and gives superior geometric perfor-

mance (Elkarami et al., 2016).

3. Methods and materials

Cost-Sensitive Learning, the construction of such classier and their

parameters are described in the following sub sections.

3.1. Cost sensitive classier

A cost-sensitive classier refers to a mechanism in machine learning

that factors in the costs linked with various forms of classication errors.

While conventional classication treats all types of errors (false positives

and false negatives) uniformly, the idea in cost-sensitive classication is

for assigning distinct costs to each type of error. There are several ap-

proaches to introduce cost-sensitivity in machine learning models. One

way is to adjust the weights of training instances based on the assigned

cost of each class. Another approach is to predict the class that mini-

mizes the expected misclassication cost, instead of the most likely class.

Using a bagged classier can enhance the accuracy of probability esti-

mates from the base classier, leading to improved performance. In

cases where the base classier is unable to handle instance weights and

the weights are non-uniform, the data can be re-sampled with replace-

ment based on the weights prior to being fed into the base classier.

3.2. Cost-sensitive learning (CSL)

It is assumed that most classiers’ misclassication costs are same,

but in reality this assumption is not true always. For example, in diag-

nosis of cancer the misclassication is very serious than a false alarm

because the patient could loose his life due to delayed diagnosis and late

treatment (Ioannidis et al., 2011) In our processes we have revised our

equations and parameters of cost calculation in terms of materials shown

below.

The cost values are tabulated in cost matrix as in Table 1, with same

structure as confusion matrix, a table with main diagonal entries aligned

exactly true as true and false as false, during training and testing

Table 1

Templatefor Cost Matrix based on confusion matrix.

Predicted Class

Negative Positive

Actual Class Negative C

Positive C

A. Kumaravel and T. Vijayan

Expert Systems With Applications 227 (2023) 120303

respectively. The non– aligned entries are in the non– diagonal posi-

tions. The below table shows the template of cost matrix according to

confusion matrix structure.

The above cost matrix helps us to calculate the total cost and we can

vary the non-diagonal elements for studying the nature of misclassi-

cation. If misclassication cost in terms of false positive or false nega-

tive, is known, total cost is chosen in this proposed model as the best

metric to evaluate classier performance as shown below. We have

enumerated only total cost evaluation for the metric of the performance

applied for the four cost sensitive learning methods using ‘tree’ type of

base classiers.

The cost ratio

is dened as the ratio

=Number of falsepositives

Number of falsenegatives (1)

The equation below shows the applied Total Cost formula.

Total Cost = (FN ×CFN) + (FP ×CFP)(2)

where

FN =# false negatives, FP =# false positives obtained from the

confusion matrix generated as outputs while training and testing are

carried out.

CFP – cost of false positive denoted by C

in the Table 1.

CFN – cost of false negative denoted by C

in the Table 1.

3.3. Cost-Sensitive classiers construction

This section explains about assumptions made to construct an algo-

rithm to generate FP -false positives and FN-false negatives in minimum

level using cost sensitive classiers. Below four different cases are

explained how this generated value are used in following processes.

The ratios of false positive to false negative is used as inputs for the

main algorithms and extract output from confusion matrix. The chal-

lenge in training through the cost sensitive classier for learning the

model is to obtain misclassication cost for expected classication

results.

Here we consider C(i, j), where i and j take values 1 or 2, indicates the

cost of predicting an instance belonging to class i while ground truth is

that it belongs to class j. The main algorithm rolls around the ratio C

(1,2)/ C(2,1) and the inverse of this by either uniform increment or

relative prime steps amounting to four different styles. The main

objective of this process is to nd acceptable such ratio as it varies across

different values in these four different styles.

The objective is to construct the mapping ζ from Q to R where the

domain is the set of rational numbers and the range is the set of real

numbers indicating the cost associated. ζ (x) =c indicates a ratio FP: FN,

x in Q takes the cost c where FP and FN are two integers (McCrimmon,

1960; Sagher, 1989; Yu-Ting, 1980) Here we have adopted three types

of variations for ×in Q, distinguished by four cases as described below

as follow.

1. Case 1. If ×is of the form 1/y where y is non negative integer.

2. Case 2. If ×is of the form p/q where p, q are non-negative integers

and gcd(p,q) =1.

3. Case 3. Reciprocal of ×in case 1.

4. Case 4. Reciprocal of ×in case 2.

These four cases allow all possible ×in Q i.e. values of ×(FP: FN-

ratio), which is considered based on number theory results discussed

in (Domingos, 1999; Thai-Nghe et al., 2010; Weiss et al., 2007) By using

these four cases the main algorithm is formed for four components

mentioned in below Table 2.

4. Data collection and preprocessing

The data collection for the underlying data set is based on IVF ex-

ercises carried out at prasanth Fertility Chennai Centre, India which had

been approved by the Centre’s Review Board. The period of collection

carried out between the year 2016 to 2018 and it has been amounted a

sample size with 327 patient records having class distribution:118

negative,209 positive instances (Hari Priya, 2021). Hence we realize the

class ratio of negative and positive 1:2.

Exclusion criteria go along the following lines:

Table 2

Algorithm components based on Cost Ratio.

Ratio Pattern (CFP:CFN) Uniform Inverse

Normal CSC-U CSC-UI

Non Uniform (Relatively Prime) CSC-NU CSC-NUI

Fig. 1. Attribute distribution chart.

A. Kumaravel and T. Vijayan

Expert Systems With Applications 227 (2023) 120303

a) Women gone through ovarian surgery or having any endocrine dis-

orders were not included for the study.

b) Those who cannot show the ovarian simulation results controlling

within 150 IU/day with pituitary suppression of FSH 100 IU/day

(Recagon) sensed by TVS scan along estradiol serum measuring as

standard methods (Bas-Lando et al., 2017; Muttukrishna et al., 2005;

Uyar et al., 2014) .

This dataset contains 8 attributes as columns and 327 patients’ re-

cords in real time. The distribution of each attribute is shown in Fig. 1.

Discretization of attributes Age, Egg Quality, Anti-Müllerian hor-

mone (AMH), Right Ovary (RO), Left Ovary (LO), No of Insemination

and No of fertilized, No of eggs.

Age: The range is divided into ve sectors s1, s2, s3, s4 and s5 as

s1:20–25, s2:26–30, s3:30–35, s4:36–40 and s5:41–46.

Egg Quality: The range is divided into ve sectors s1, s2, s3, s4 and s5

as s1:0.01, s2:0.25, s3:0.5, s4:0.75 and s5:1.

Anti-Müllerian hormone (AMH): The range is divided into ve sec-

tors s1, s2, s3, s4 and s5 as s1:0–2.0, s2:2.1–4.0, s3:4.1–6.0, s4:6.1–8.0

and s5:8.1–10.6.

Right ovary (RO), Left Ovary (LO), No of Insemination and No of

fertilized: The range is divided into ve sectors s1, s2, s3, s4 and s5 as

s1:0–5, s2:6–10, s3:11–15, s4:16–20 and s5:21–38.

No of eggs: The range is divided into ve sectors s1, s2, s3, s4 and s5

as s1:1–5, s2:6–10, s3:11–15, s4:16–20 and s5:21–43.

5. Proposed main algorithm

In this paper article cost sensitive classiers are constructed with the

help of algorithm in Fig. 3 to implement the above-described algorithm

components. To build cost sensitive classiers Weka tool (Weka, 2021)

is used by tuning the ratio

as planned in section 3.2. Below in Table 2

we have described the inter related four components of main algorithm.

The prex CSC stands for Cost Sensitive Classier and sufx U, UI,

NU and NUI denote the components of uniform, uniform inverse, non

-uniform and non -uniform inverse respectively as stated in Table 2.The

main process and its steps involved in the algorithm is shown in Fig. 2.

Fig. 2. Cost Sensitive classier based on cost ratio.

Fig 3. Proposed Algorithm for Cost Ratio based Cost Sensitive Learning.

A. Kumaravel and T. Vijayan

Expert Systems With Applications 227 (2023) 120303

5.1. Pseudo code for cost sensitive classier by cost ratio

In Fig. 3 we specify the steps of algorithm components annotated

in‘[…]’ and other steps common for all the 4 cases are described in the

rest.

Dataset D contains the underlying date instances, here the instances

of patient attributes as shown in 3 contributing the class values either

positive or negative. This dataset contains 327 records and 8 features in

each record. Being the context to be treated cost-sensitive, preprocessing

like normalization and attribute selection is not considered as all attri-

butes carry equal signicance. The data collection does not contain any

missing or blank data.

In line 2, b

denotes any tree classier∈{J48, LMT, ADT, Decision

Stump}.

Between the two loops, outer loop is for iterating over four tree

classiers and inner loop for iteration over the index i variation for the

cost ratio extracted from the cost matrix. After xing the loop variants

for the current iteration, the “TC” procedure call takes care of training,

testing and classifying and nally the total cost involved. Final output is

presented by the optimal value among the set of total cost generated by

the above iterations.

5.1.1. Component for CSC-U:

To obtain the cost value as discussed earlier the ratio

’s numerator is

incremented uniformly from 1 to10 and denominator is xed.

5.1.2. Component for CSC-UI:

This is similar to that of CSC -UI by just inversing Component for

CSC-U where CFN =c

=i andc

=1, i.e. i∈{1, 2,..,10}.

5.1.3. Component for CSC-NU:

The above algorithm is applied with the ratio

as m: n where m & n

are relatively primes and their values are with in principle range 1 to10.

The only extra complexity for non-uniform cost algorithm is testing the

relative primality condition by greatest common divisor that is gcd (m,

n) =1.

5.1.4. Component for CSC-NUI:

From the ratio

calculated by CFP/CFN, we consider its inverse and

repeat the classication process to get the new results for both case

uniform and non-uniform through their index in non-decreasing order.

Hence we obtain the non-uniform cost algorithm namely algorithm NUI

by simply inverting the loop indices in the algorithm for NU.

Fig. 4. Uniform Variation of Total Cost.

Fig. 5. Uniform Inverted Variation of Total Cost.

A. Kumaravel and T. Vijayan

Expert Systems With Applications 227 (2023) 120303

5.2. Interpretation of main algorithm

To get the output list of classier’s performance features like accu-

racy, precision, recall, total cost and likelihood ratio we apply the main

algorithm and each component works depending on the range type of

the cost ratio. The innermost part is common to all types of components

and it consists of the steps for classifying and generating confusion

matrix. The index of the outermost loop varies over base classiers (for

these experiments we consider only decision trees). The next level inner

loop’s index type is varying for each component. In the rst component

CSC-U the index (cost ratio) varies uniformly by unit increment whereas

in CSC-UI the index is inverted. The two components in the non-uniform

case CSC-NU and CSC-NUIare similar except the index determined by

cost ratio with numerator and denominator selected as co primes. The

inputs of the main algorithm are dataset and the cost matrix. The entries

in the cost matrix are assumed to be integers for the sake of simplicity.

The values of right diagonal elements c

and c

are restricted to

principle values 1 to 10 and we construct the four types of ratios based

on these values. The reasons for the selection of values in principle range

are rstly most of the classiers show their behavior stable in this range

and secondly even if it is not the case, the extremely large learning time

for larger values.

The comparisons of above mentioned four components in algorithm

are shown through tables and graphs below.

6. Experimental results

We implemented data from clinical records of IVF from data base in

tree classiers for cost sensitive learners in Weka platform (Weka,

2021). The best four classiers namely J48, ADT, LMT,and Decision

Stump was adopted for this test.

In Figs. 4 and 5, by comparing the graph of CSC-U and CSC-UI it

shows that magnitude of total cost is high while increasing false positive

in main algorithm and also noted that the increase in total cost is

increasing gradually.it is also noted that choosing right tree classier is

important because it is observed that in both in CSC-U J48 produces

more total cost and decision stump shows low total cost, but in CSC-UI

the reverse is seen where J48 produces less total cost compare to deci-

sion stump. (See Table 3 and 4).

In Figs. 6 and 7, by comparing the graph of CSC-NU and CSC-NUI it

shows that magnitude of total cost is high while increasing false positive

in main algorithm and also noted that the increase in total cost is

increasing gradually it is also noted that choosing right tree classier is

important because it is observed that in both in CSC-NU LMT produces

more total cost and decision stump shows low total cost, but in CSC-NUI

the reverse is seen where Decision Stump produces less total cost

compare to decision stump (See Table 5 and 6).

7. Conclusion

The cost sensitive model for the IVF data sets is processed for four

different ranges of cost ratio. The results show the inuence of cost ratio

false positive to false negative is varying for different types of tree

classier as done here namely J48, ADT, LMT and Decision stump for

Table 3

Performance by Total cost based on the ratio (false positive: false negative) in

Cost Sensitive Classiers applying CSC-U.

Total Cost for IVF Cost sensitive Classiers CSC-U

Cost Ratio J48 LMT ADT Decision Stump

1:1 121 131 107 123

1:2 191 214 184 237

1:3 261 297 261 351

1:4 331 380 338 465

1:5 401 463 415 579

1:6 471 546 492 693

1:7 541 629 569 807

1:8 611 712 646 921

1:9 681 795 723 1035

1:10 751 876 800 1149

Table 4

Performance by Total cost based on the ratio (false positive: false negative) in

Cost Sensitive Classiers applying CSC-UI.

Total Cost for IVF Cost sensitive Classiers CSC-UI

Cost Ratio J48 LMT ADT Decision Stump

1:1 121 131 107 123

2:1 172 179 137 132

3:1 223 227 167 141

4:1 274 275 197 150

5:1 325 323 228 159

6:1 376 371 257 168

7:1 427 419 287 177

8:1 478 467 317 186

9:1 529 515 347 195

10:1 580 563 377 204

Fig. 6. Non – Uniform Variation of Total Cost.

A. Kumaravel and T. Vijayan

Expert Systems With Applications 227 (2023) 120303

IVF dataset. More over the magnitude of total cost is increasing gradu-

ally with change in ratio and based on the obtained results best classier

can been chosen from the better understanding of classier in future.

The limitations primarily are visible in the number of iterations based on

the index varying only in the initial segment of integers 1–10. Though it

works for the purposes establishing the existence of mapping on the cost

ratio and demonstrating the same, this restriction can be relaxed higher

number of iterations with computing facility for more time and space for

generating the cost models. The future work can be extended with- the

study for other types of cost sensitive meta classiers to measure the

error cost as discussed in this work.

CRediT authorship contribution statement

A. Kumaravel: Conceptualization, Methodology, Writing – original

draft, Visualization, Supervision, Writing – review & editing. T.

Vijayan: Software, Data curation, Investigation, Validation, Project

administration.

Declaration of Competing Interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

Data availability

Data will be made available on request.

Acknowledgements

We would like to thank authorities of Prasanth Fertility Hospital,

Chennai, India for allowing the real time data used for this research

work.

Fig. 7. Non – Uniform Inverted Variation of Total Cost.

Table 5

Performance by Total cost based on the ratio (false positive: false negative) in

Cost Sensitive Classiers applying CSC-NU.

Total Cost for IVF Cost sensitive Classiers CSC-NU

Cost Ratio J48 LMT ADT Decision Stump

1:1 121 131 107 123

1:3 223 227 167 141

1:4 274 275 197 150

2:3 293 310 244 255

2:5 244 406 304 273

2:7 497 502 364 291

2:9 599 327 424 309

3:4 414 441 351 378

3:7 567 585 441 405

4:5 535 572 458 501

4:7 637 668 518 519

4:9 739 764 578 537

5:6 657 703 565 624

5:7 707 751 595 633

5:8 758 799 625 642

5:9 809 847 655 651

6:7 777 834 672 747

7:8 898 965 779 870

7:9 949 1013 809 879

8:9 1019 1096 886 993

Table 6

Performance by Total cost based on the ratio (false positive: false negative) in

Cost Sensitive Classiers applying CSC-NUI.

Total Cost for IVF Cost sensitive Classiers CSC-NUI

Cost Ratio J48 LMT ADT Decision Stump

1: 1 121 131 107 123

3: 1 261 297 261 351

3: 2 312 345 291 360

4: 1 331 380 338 465

4: 3 433 476 398 483

5: 2 452 511 445 588

5: 4 554 607 505 606

6: 5 675 738 612 729

7: 2 592 677 599 816

7: 3 643 725 629 825

7: 4 694 773 659 834

7: 5 745 821 689 843

7: 6 796 869 719 852

8: 5 815 904 766 957

8: 7 917 1000 826 975

9: 2 732 843 753 1044

9: 4 834 939 813 1062

9: 5 885 987 843 1071

9: 7 987 1083 903 1089

9: 8 1038 1131 933 1098

A. Kumaravel and T. Vijayan

Expert Systems With Applications 227 (2023) 120303

References

Abe, N., Zadrozny, B., & Langford, J. (2004). An iterative method for multi-class cost-

sensitive learning. Proceedings of the Tenth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. https://doi.org/10.1145/1014052.1014056.

Baldur-Felskov, B., Kjaer, S. K., Albieri, V., Steding-Jessen, M., Kjaer, T., Johansen, C.,

Dalton, S. O., & Jensen, A. (2012). Psychiatric disorders in women with fertility

problems: results from a large Danish register-based cohort study. Human

Reproduction, 28(3), 683–690. https://doi.org/10.1093/humrep/des422

Bas-Lando, M., Rabinowitz, R., Farkash, R., Algur, N., Rubinstein, E., Schonberger, O., &

Eldar-Geva, T. (2017). Prediction value of anti-Mullerian hormone (AMH) serum

levels and antral follicle count (AFC) in hormonal contraceptive (HC) users and non-

HC users undergoing IVF-PGD treatment. Gynecological Endocrinology, 33(10),

797–800. https://doi.org/10.1080/09513590.2017.1320376

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017, October 19).

Classication And Regression Trees. https://doi.org/10.1201/9781315139470.

Bungum, A. B., Glazer, C. H., Arendt, L. H., Schmidt, L., Pinborg, A., Bonde, J. P., &

Tøttenborg, S. S. (2019). Risk of hospitalization for early onset of cardiovascular

disease among infertile women: a register-based cohort study. Human Reproduction,

34(11), 2274–2281. https://doi.org/10.1093/humrep/dez154

Chan, P. K., & Stolfo, S. (1998). Toward scalable learning with non-uniform class and

cost distributions: A case study in credit card fraud detection. Knowledge Discovery

and Data Mining.

CDC. (2018). 2017 Fertility Clinic Success Rates | Assisted Reproductive Technology

(ART) Report | Reproductive Health | CDC. https://www.cdc.gov/art/repor

ts/2017/fertility-clinic.html.

Domingos, P. (1999). MetaCost. Proceedings of the Fifth ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/

312129.312220.

Elkarami, B., Alkhateeb, A., & Rueda, L. (2016, May). Cost-sensitive classication on

class-balanced ensembles for imbalanced non-coding RNA data. 2016 IEEE EMBS

International Student Conference (ISC). https://doi.org/10.1109/

embsisc.2016.7508607.

Hari Priya, G., et al. (2021). Classiers with synthetic oversampling pre-process for In

Vitro Fertilization predictions. Indian Journal of Computer Science and Engineering, 12

(6), 1532–1541. https://doi.org/10.21817/indjcse/2021/v12i6/211206061.

Ioannidis, J. P. A., Tarone, R., & McLaughlin, J. K. (2011). The False-positive to False-

negative Ratio in Epidemiologic Studies. Epidemiology, 22(4), 450–456. https://doi.

org/10.1097/ede.0b013e31821b506e

McCrimmon, K. (1960). Enumeration of the positive rationals. The American

Mathematical Monthly, 67(9), 868.

Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., & Togneri, R. (2018). Cost-Sensitive

Learning of Deep Feature Representations From Imbalanced Data. IEEE Transactions

on Neural Networks and Learning Systems, 29(8), 3573–3587. https://doi.org/

10.1109/tnnls.2017.2732482

Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-

sided selection. Proceedings of the 14th International Conference in Ma chine

Learning, Nashville, 179-186.

Mienye, I. D., & Sun, Y. (2021). Performance analysis of cost-sensitive learning methods

with application to imbalanced medical data. Informatics in Medicine Unlocked, 25,

Article 100690. https://doi.org/10.1016/j.imu.2021.100690

Murugappan, G., Li, S., Lathi, R. B., Baker, V. L., & Eisenberg, M. L. (2019). Increased risk

of incident chronic medical conditions in infertile women: analysis of US claims data.

American Journal of Obstetrics and Gynecology, 220(5), 473.e1–473.e14. https://doi.

org/10.1016/j.ajog.2019.01.214

Muttukrishna, S., McGarrigle, H., Wakim, R., Khadum, I., Ranieri, D., & Serhal, P. (2005).

Antral follicle count, anti-mullerian hormone and inhibin B: predictors of ovarian

response in assisted reproductive technology? BJOG: An International Journal of

Obstetrics & Gynaecology, 112(10), 1384–1390. https://doi.org/10.1111/j.1471-

0528.2005.00670.x

Pes, B., & Lai, G. (2021). Cost-sensitive learning strategies for high-dimensional and

imbalanced data: a comparative study. Peer J Computer Science, 7. https://doi.org/

10.7717/peerj-cs.832

Peter. (2001, August). The foundations of cost-sensitive learning. IJCAI’01: Proceedings of

the 17th International Joint Conference on Articial Intelligence, 2, 973–978. https://

doi.org/10.5555/1642194.1642224.

Pisarska, M. D. (2017, June 28). Fertility Status and Overall Health. PubMed Central

(PMC). https://doi.org/10.1055/s-0037-1603728.

Sadecki, E., Weaver, A., Zhao, Y., Stewart, E. A., & Ainsworth, A. J. (2022). Fertility

trends and comparisons in a historical cohort of US women with primary infertility.

Reproductive Health, 19(1). https://doi.org/10.1186/s12978-021-01313-6

Telikani, A., Gandomi, A. H., Choo, K. K. R., & Shen, J. (2022). A cost-sensitive deep

learning-based approach for network trafc classication. IEEE Transactions on

Network and Service Management, 19(1), 661–670. https://doi.org/10.1109/

tnsm.2021.3112283

Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010, July). Cost-sensitive learning

methods for imbalanced data. The 2010 International Joint Conference on Neural

Networks (IJCNN). https://doi.org/10.1109/ijcnn.2010.5596486.

Thakkar, H. K., Desai, A., Ghosh, S., Singh, P., & Sharma, G. (2022, January 22).

Clairvoyant: AdaBoost with Cost-Enabled Cost-Sensitive Classier for Customer

Churn Prediction. Computational Intelligence and Neuroscience, 2022, 1–11. https://

doi.org/10.1155/2022/9028580.

Thorsted, A., Lauridsen, J., Høyer, B., Arendt, L. H., Bech, B., Toft, G., Hougaard, K.,

Olsen, J., Bonde, J. P., & Ramlau-Hansen, C. (2019). Birth weight for gestational age

and the risk of infertility: a Danish cohort study. Human Reproduction, 35(1),

195–202. https://doi.org/10.1093/humrep/dez232

Uyar, A., Bener, A., & Ciray, H. N. (2014). Predictive modeling of implantation outcome

in an in vitro fertilization setting. Medical Decision Making, 35(6), 714–725. https://

doi.org/10.1177/0272989x14535984

Vander Borght, M., & Wyns, C. (2018). Fertility and infertility: Denition and

epidemiology. Clinical Biochemistry, 62, 2–10. https://doi.org/10.1016/j.

clinbiochem.2018.03.012

Weiss, G. M., McCarthy, K., & Zabar, B. (2007). Cost-sensitive learning vs. sampling:

Which is best for handling unbalanced classes with unequal error costs? DMIN, 7

(35–41), 24.

Weiss, Y., Elovici, Y., & Rokach, L. (2013). February). The CASH algorithm-cost-sensitive

attribute selection using histograms. Information Sciences, 222, 247–268. https://doi.

org/10.1016/j.ins.2011.01.035

Weka (2021). Department of Computer Science: University of Waikato. (n.d.). Department of

Computer Science: University of Waikato. http://www.cs.waikato.ac.nz.

Sagher, Y. (1989). Counting the rationals. Amer. Math. Monthly, 96(9), 823.

Yu-Ting, S. (1980). A “Natural” enumeration of non-negative rational numbers–an

informal discussion. The American Mathematical Monthly, 87(1), 25. https://doi.org/

10.2307/2320374

A. Kumaravel and T. Vijayan

ResearchGate has not been able to resolve any citations for this publication.

Clairvoyant: AdaBoost with Cost-Enabled Cost-Sensitive Classifier for Customer Churn Prediction

Article

Full-text available

Jan 2022
Comput Intell Neurosci

Customer churn prediction is one of the challenging problems and paramount concerns for telecommunication industries. With the increasing number of mobile operators, users can switch from one mobile operator to another if they are unsatisfied with the service. Marketing literature states that it costs 5–10 times more to acquire a new customer than retain an existing one. Hence, effective customer churn management has become a crucial demand for mobile communication operators. Researchers have proposed several classifiers and boosting methods to control customer churn rate, including deep learning (DL) algorithms. However, conventional classification algorithms follow an error-based framework that focuses on improving the classifier’s accuracy over cost sensitization. Typical classification algorithms treat misclassification errors equally, which is not applicable in practice. On the contrary, DL algorithms are computationally expensive as well as time-consuming. In this paper, a novel class-dependent cost-sensitive boosting algorithm called AdaBoostWithCost is proposed to reduce the churn cost. This study demonstrates the empirical evaluation of the proposed AdaBoostWithCost algorithm, which consistently outperforms the discrete AdaBoost algorithm concerning telecom churn prediction. The key focus of the AdaBoostWithCost classifier is to reduce false-negative error and the misclassification cost more significantly than the AdaBoost.

Fertility trends and comparisons in a historical cohort of US women with primary infertility

Article

Full-text available

Jan 2022
Reprod Health

Background There is growing interest in long-term outcomes following infertility and infertility treatment. However, there are few detailed longitudinal cohorts available for this work. This study aimed to assemble a historical cohort of women with primary infertility and age-matched controls to evaluate fertility trends, sequelae, and sociodemographic differences. Described here are cohort group characteristics and associated reproductive trends over time. Methods A population-based historical cohort was created using the Rochester Epidemiology Project (REP) record-linkage system (Olmsted County, MN). The cohort included women aged 18–50 with a diagnosis of primary infertility between January 1, 1980, and December 31, 1999. As part of a case–control study, we identified 1:1 age-matched female controls from the same community and era. Results A total of 1001 women with primary infertility and 1001 age-matched controls were identified. The women with primary infertility were significantly more likely to be married, college educated, use barrier contraception, and non-smokers compared to age-matched controls. The incidence of primary infertility increased from 14 to 20 per 10,000 person years from 1980–1985 to 1995–1999. Ovulatory dysfunction and unexplained infertility were the most common causes of primary infertility and clomiphene was the most widely used fertility medication. Rates of in vitro fertilization (IVF) increased from 1.8% during 1980–1985 to 26.0% during 1995–1999. Conclusion Women with primary infertility were found to have unique sociodemographic characteristics compared to age-matched control women, which is consistent with previous research. The incidence of diagnosed primary infertility increased from 1980 to 1999, as did use of IVF.

Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study

Article

Full-text available

Dec 2021

High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.

Performance analysis of cost-sensitive learning methods with application to imbalanced medical data

Article

Full-text available

Aug 2021

Many real-world machine learning applications require building models using highly imbalanced datasets. Usually, in medical datasets, the healthy patients or samples are dominant, making them the majority class, while the sick patients are few, making them the minority class. Researchers have proposed numerous machine learning methods to predict medical diagnosis. Still, the class imbalance problem makes it difficult for classifiers to adequately learn and distinguish between the minority and majority classes. Cost-sensitive learning and resampling techniques are used to deal with the class imbalance problem. This research focuses on developing robust cost-sensitive classifiers by modifying the objective functions of some well-known algorithms, such as logistic regression, decision tree, extreme gradient boosting, and random forest, which are then used to efficiently predict medical diagnosis. Meanwhile, as opposed to resampling techniques, our approach does not alter the original data distribution. Firstly, we implement the standard versions of these algorithms to provide a baseline for performance comparison. Secondly, we develop their corresponding cost-sensitive algorithms. For the proposed approaches, it is not necessary to change the distribution of the original data as the modified algorithms consider the imbalanced class distribution during training, thereby resulting in more reliable performance than when the data is resampled. Four popular medical datasets, including the Pima Indians Diabetes, Haberman Breast Cancer, Cervical Cancer Risk Factors, and Chronic Kidney Disease datasets, are used in the experiments to validate the performance of the proposed approach. The experimental results show that the cost-sensitive methods yield superior performance compared to the standard algorithms.

Classifiers with synthetic oversampling pre-process for In Vitro Fertilization predictions

Article

Dec 2021

A Cost-Sensitive Deep Learning-Based Approach for Network Traffic Classification

Article

Sep 2021

Network traffic classification (NTC) plays an important role in cyber security and network performance, for example in intrusion detection and facilitating a higher quality of service. However, due to the unbalanced nature of traffic datasets, NTC can be extremely challenging and poor management can degrade classification performance. While existing NTC methods seek to re-balance data distribution through resampling strategies, such approaches are known to suffer from information loss, overfitting, and increased model complexity. To address these challenges, we propose a new cost-sensitive deep learning approach to increase the robustness of deep learning classifiers against the imbalanced class problem in NTC. First, the dataset is divided into different partitions, and a cost matrix is created for each partition by considering the data distribution. Then, the costs are applied to the cost function layer to penalize classification errors. In our approach, costs are diverse in each type of misclassification because the cost matrix is specifically generated for each partition. To determine its utility, we implement the proposed cost-sensitive learning method in two deep learning classifiers, namely: stacked autoencoder and convolution neural networks. Our experiments on the ISCX VPN-nonVPN dataset show that the proposed approach can obtain higher classification performance on low-frequency classes, in comparison to three other NTC methods.

Birth weight for gestational age and the risk of infertility: A Danish cohort study

Article

Dec 2019
HUM REPROD

Study question: Is birth weight for gestational age associated with infertility in adulthood among men and women? Summary answer: Being born small for gestational age (SGA) was associated with infertility in adulthood among men. What is known already: Fetal growth restriction may affect fertility, but results from previous studies have been inconsistent. Study design, size, duration: In this population-based cohort study, we used data from a Danish birth cohort, including 5594 men and 5342 women born between 1984 and 1987. Information on infertility was obtained from Danish health registers during the period from the participants' 18th birthday and up until 31 December 2017. Participants/materials, setting, methods: Participants were men and women born in two Danish municipalities, Aalborg and Odense. Information on birth weight and gestational age was obtained from birth records, and information on infertility diagnoses and fertility treatment was retrieved from the Danish National Patient Registry (NPR) and the Danish In Vitro Fertilisation (IVF) registry. Information on potential maternal confounders was obtained from questionnaires during pregnancy and was included in adjusted analyses. Logistic regression analysis was used to estimate crude and adjusted odds ratios (ORs) with 95% confidence intervals (CIs) for infertility according to birth weight for gestational age. Main results and the role of chance: Men born SGA had a 55% higher risk of being diagnosed with or treated for infertility compared to men born appropriate for gestational age (AGA) (adjusted OR = 1.55, 95% CI: 1.09-2.21). The association attenuated after exclusion of men born with hypospadias or cryptorchidism (OR = 1.37, 95% CI: 0.93-2.01). No association was found between women's birth weight for gestational age and risk of infertility (adjusted OR = 1.00, 95% CI: 0.73-1.37). Limitations, reasons for caution: Estimation of gestational age is associated with some uncertainty and might have caused non-differential misclassification. The study design implicitly assumed similar distribution of reproductive and health-seeking behaviour across the groups that were compared. Wider implications of the findings: Men born SGA had a higher risk of infertility. Genital malformations may account for part of the observed association, but this must be explored further. Study funding/competing interest(s): This study was supported by Health, Aarhus University. No competing interests are declared. Trial registration number: N/A.

Risk of hospitalization for early onset of cardiovascular disease among infertile women: A register-based cohort study

Article

Oct 2019
HUM REPROD

Study question: Is female infertility predictive of a woman's future risk of early cardiovascular disease (CVD)? Summary answer: Female infertility does not seem to be predictive of early CVD during a mean follow-up of 9 years. What is known already: Associations between infertility and comorbidity have been found in several studies, but data on the association between female infertility and risk of CVD are scarce and inconclusive. Study design, size, duration: In this nationwide cohort study, we included 87 221 women registered in the Danish National IVF register, undergoing medically assisted reproduction (MAR) between 1st of January 1994 and 31st of December 2015. The cohort was followed for incident hospitalization due to CVD in the Danish National Patient Register from enrollment to 31 December 2015. Women with a history of CVD prior to enrollment were excluded. Cox proportional hazard models with age as the underlying time scale were used to estimate hazard ratios (HR) with 95% CI of CVD among women with an infertility diagnosis, compared to women without an infertility diagnosis. All analyses were adjusted for educational attainment. Participants/materials, setting, methods: Female infertility and the reason for infertility was diagnosed and registered in the IVF register by specialists in Danish public and private fertility clinics since 1st of January 1994. In our cohort, 53 806 women (61.7%) were diagnosed with female factor infertility, while 33 415 (38.3%) did not have a female factor infertility diagnosis and made up the reference group. Main results and the role of chance: A total of 686 (1.3%) infertile women were hospitalized for CVD compared to 250 (0.7%) among women without an infertility diagnosis during a mean follow-up time of 9 years. We found no increased risk of early CVD in our analyses (adjusted HR 0.98, 95% CI: 0.85;1.14). Likewise, analyses stratified by specific infertility diagnosis, showed no risk difference. Limitations, reasons for caution: We were unable to adjust for confounding parameters such as body mass index, cigarette smoking or alcohol consumption. These results may not be generalizable to infertile women who do not seek out fertility treatment, or infertile women with other lifestyle characteristics than Danish women. Wider implications of the findings: Diagnosing female infertility or the time of MAR does not seem to be a window of opportunity where early screening for cardiovascular disease risk factors can have a prophylactic potential. Study funding/competing interest(s): This study is part of the ReproUnion collaborative study, co-financed by the European Union, Interreg V ÖKS. None of the authors declare any conflict of interest.

Increased risk of incident chronic disease in infertile women: analysis of US claims data

Article

Sep 2018
FERTIL STERIL

Classification And Regression Trees

Book

Oct 2017

The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

Comparing cost sensitive classifiers by the false-positive to false- negative ratio in diagnostic studies

Figures

Recommended publications

The effectiveness of cost sensitive machine learning algorithms in classifying Zeus flows

On the utilization of the adjoint method in microwave tomography

Thresholding for Making Classifiers Cost Sensitive

COP: A new corner detector