Conference PaperPDF Available

Training deep neural networks on imbalanced data sets

July 2016

July 2016

DOI:10.1109/IJCNN.2016.7727770

Conference: 2016 International Joint Conference on Neural Networks (IJCNN)

Authors:

Shoujin Wang

University of Technology Sydney

Show all 6 authorsHide

Our proposed MFE and MSFE methods always achieve higher Fmeasure values than the conventional MSE method under the same loss values on household data set with Imb.level of 10% ( Only the parts of the three curves under the common loss values are shown ).

…

Our proposed MFE and MSFE approches achiveve higher AUC than the MSE approch under the same loss values on household data set with Imb.level of 10% (Only the parts of the three curves under the common loss values are shown ).

…

Figures - uploaded by Shoujin Wang

Content may be subject to copyright.

Content uploaded by Shoujin Wang

Content may be subject to copyright.

Training Deep Neural Networks on

Imbalanced Data Sets

Shoujin Wang



, Wei Liu



, Jia Wu



, Longbing Cao



, Qinxue Meng



, Paul J. Kennedy





Advanced Analytics Institute, University of Technology Sydney, Sydney, Australia



Centre for Quantum Computation & Intelligent Systems, University of Technology Sydney, Sydney, Australia

Email:

Shoujin.Wang@student.uts.edu.au,{Wei.Liu, Longbing.Cao}@uts.edu.au, {Jia.Wu, Qinxue.Meng, Paul.Kennedy}@uts.edu.au

Abstract—Deep learning has become increasingly popular in

both academic and industrial areas in the past years. Various

domains including pattern recognition, computer vision, and

natural language processing have witnessed the great power of

deep networks. However, current studies on deep learning

mainly focus on data sets with balanced class labels, while its

performance on imbalanced data is not well examined.

Imbalanced data sets exist widely in real world and they have

been providing great challenges for classification tasks. In this

paper, we focus on the problem of classification using deep

network on imbalanced data sets. Specifically, a novel loss

function called mean false error together with its improved

version mean squared false error are proposed for the training of

deep networks on imbalanced data sets. The proposed method

can effectively capture classification errors from both majority

class and minority class equally. Experiments and comparisons

demonstrate the superiority of the proposed approach compared

with conventional methods in classifying imbalanced data sets on

deep neural networks.

Keywords—deep neural network; loss function; data imbalance

I. I

NTRODUCTION

Recently, rapid developments in science and technology

have promoted the growth and availability of data at an

explosive rate in various domains. The ever-increasingly large

amount of data and more and more complex data structure lead

us to the so called “big data era”. This brings a great

opportunity for data mining and knowledge discovery and

many challenges as well. A noteworthy challenge is data

imbalance. Although more and more raw data is getting easy to

be accessed, much of which has imbalanced distributions,

namely a few object classes are abundant while others only

have limited representations. This is termed as the “Class-

imbalance” problem in data mining community and it is

inherently in almost all the collected data sets [1]. For instance,

in the clinical diagnostic data, most of the people are healthy

while only a quite low proportion of them are unhealthy. In

classification tasks, data sets are usually classified into binary-

class data sets and multi-class data sets according to the

number of classes. Accordingly, classification can be

categories as binary classification and multi-class classification

[2], [3]. This paper mainly focuses on the binary-classification

problem and the experimental data sets are binary-class ones

(A multi-class problem can generally be transformed into a

binary-class one by binarization). For a binary-class data set,

we call the data set imbalanced if the minority class is under-

represented compared to the majority class, e.g., the majority

class severely out represents the minority class [4].

Data imbalance can lead to unexpected mistakes and even

serious consequences in data analysis especially in

classification tasks. This is because the skewed distribution of

class instances forces the classification algorithms to be biased

to majority class. Therefore, the concepts of the minority class

are not learned adequately. As a result, the standard classifiers

(classifiers don’t consider data imbalance) tend to misclassify

the minority samples into majority samples when the data is

imbalanced, which results in quite poor classification

performance. This may lead to a heavy price in real life.

Considering the above diagnose data example, it is obvious

that the patient constitute the minority class while the healthy

persons constitute the majority one. If a patient was

misdiagnosed as a healthy person, it would delay the best

treatment time and cause significant consequences.

Though data imbalance has been proved to be a serious

problem, it is not addressed well in the standard classification

algorithms. Most of classifiers were designed under the

assumption that the data is balanced and evenly distributed on

each class. Many efforts have been made in some well-studied

classification algorithms to solve this problem professionally.

For example, sampling techniques and cost sensitive methods

are broadly applied in SVM, neural network and other

classifiers to solve the problem of class imbalance from

different perspectives. Sampling aims to transfer the

imbalanced data into balanced one by various sampling

techniques while cost sensitive methods try to make the

standard classifiers more sensitive to the minority class by

adding different cost factors into the algorithms.

However, in the field of deep learning, very limited work

has been done on this issue to the best of our knowledge. Most

of the existing deep learning algorithms do not take the data

imbalance problem into consideration. As a result, these

algorithms can perform well on the balanced data sets while

their performance cannot be guaranteed on imbalanced data

sets.

In this work, we aim to address the problem of class

imbalance in deep learning. Specifically, different forms of

loss functions are proposed to make the learning algorithms

more sensitive to the minority class and then achieve higher

classification accuracy. Besides that, we also illustrate why we

4368

978-1-5090-0620-5/16/$31.00 c

2016 IEEE

propose this kind of loss function and how it can outperform

the commonly used loss function in deep learning algorithms.

Currently, mean squared error (MSE) is the most

commonly used loss function in the standard deep learning

algorithms. It works well on balanced data sets while it fails to

deal with imbalanced ones. The reason is that MSE captures

the errors from an overall perspective, which means it

calculates the loss by firstly sum up all the errors from the

whole data set and then calculates the average value. This can

capture the errors from the majority and minority classes

equally when the binary-classes data sets are balanced.

However, when the data set is imbalanced, the error from the

majority class contributes much more to the loss value than the

error from the minority class. In this way, this loss function is

biased towards majority class and fails to capture the errors

from two classes equally. Further, the algorithms are very

likely to learn biased representative features from the majority

class and then achieve biased classification results.

To make up for this shortcoming of mean squared error

loss function used in deep learning, a new loss function called

mean false error (MFE) together with its improved version

mean squared false error (MSFE) are proposed. Being

different from MSE loss function, our proposed loss functions

can capture the errors both from majority class and minority

class equally. Specifically, our proposed loss functions firstly

calculate the average error in each class separately and then

add them together, which is demonstrated in part

in detail. In

this way, each class can contribute to the final loss value

equally.

TABLE I. A

N EXAMPLE OF CONFUSION MATRIX

Predicted Class

P N

True

Class

P’

86 4 90

N’

5 5 10

91 9

Let’s take the binary classification problem shown in Table

as an example. For the classification problem in Table

ĉ,

compute the loss value using MSE, MFE and MSFE

respectively as follows. Please note that here is just an example

for the calculation of three different loss values, the formal

definitions of these loss functions are given in part

ċ.

Please

note that in this binary classification problem, the error of a

certain sample is 0 if the sample is predicted correctly,

otherwise the error is 1.

݈

୑ୗ୉

ൌ

ସାହ

ଽ଴ାଵ଴

ൌͲǤͲͻ (1.1)

୑୊୉

ൌ

ହ

ଵ଴

൅

ସ

ଽ଴

ൌͲǤͷͶ (1.2)

݈

୑ୗ୊୉

ൌሺ

ହ

ଵ଴

ሻ

ଶ

൅ሺ

ସ

ଽ଴

ሻ

ଶ

ൌͲǤʹͷ (1.3)

From table

ĉ,

it is quite clear that the overall classification

accuracy is (86+5)/(90+10)=91%. However, different loss

values can be achieved when different kinds of loss functions

are used as showed in Eq. (1.1) to Eq. (1.3). In addition, the

loss values computed using our proposed MFE and MSFE loss

functions are much larger than that of MSE. This means a

higher loss values can be achieved when MFE (MSFE) is used

as the loss function instead of MSE under the same

classification accuracy. In other words, under the condition of

the same loss values, a higher classification accuracy can be

achieved on imbalanced data sets when MFE (MSFE) is used

as the loss function rather than MSE. This empirically

demonstrates that our proposed loss functions can outperform

the commonly used MSE in imbalanced data sets. It should be

noted that only the advantages of MFE and MSFE over MSE

are illustrated here, and the reason why MSFE is introduced as

an improved version of MFE will be given in part ċ.

The contributions of this paper are summarized as˖

(1). Two novel loss functions are proposed to solve the data

imbalance problem in deep network.

(2). The advantages of these proposed loss functions over

the commonly used MSE are analyzed theoretically.

(3). The effect of these proposed loss functions on the back-

propagation process of deep learning is analyzed by examining

relations for propagated gradients.

(4). Empirical study on real world data sets is conducted to

validate the effectiveness of our proposed loss functions.

The left parts of this paper are organized as follows. In part

Ċ, we review the previous studies addressing data imbalance.

Followed by problem formulation and statement in part ċ, a

brief introduction of DNN is given in part Č. Part č describes

the experiments of applying our proposed loss functions on real

world data sets. Finally, the paper concludes in part Ď.

II.

RELATED WORK

How to deal with imbalanced data sets is a key issue in

classification and it is well explored during the past decades.

Until now, this issue is solved mainly in three ways, sampling

techniques, cost sensitive methods and the hybrid methods

combining these two. This section reviews these three

mainstream methods and then gives a special review on the

imbalance problem in neural network field, which is generally

thought to be the ancestor of deep learning and deep neural

network.

A.Sampling techinique

Sampling is thought to be a pre-processing technique as it

deals with the data imbalance problem from data itself

perspective. Specifically, it tries to provide a balanced

distribution by transferring the imbalanced data into balanced

one and then works with classification algorithms to get results.

Various sampling techniques have been proposed from

different perspectives. Random oversampling [5], [6] is one of

the simplest sampling methods. It randomly duplicates a

certain number of samples from the minority class and then

augment them into the original data set. On the contrary,

under-sampling randomly remove a certain number of

instances from the majority class to achieve a balanced data

set. Although these sampling techniques are easy to implement

and effective, they may bring some problems. For example,

random oversampling may lead to overfitting while random

under-sampling may lose some important information. To

2016 International Joint Conference on Neural Networks (IJCNN) 4369

avoid these potential issues, a more complex and reasonable

sampling method is proposed. Specifically. The synthetic

minority oversampling technique (SMOTE) has proven to be

quite powerful which has achieved a great deal of success in

various applications [7], [8]. SMOTE creates artificial data

based on the similarities between existing minority samples.

Although many promising benefits have been shown by

SMOTE, some drawbacks still exist like over generalization

and variance [9], [10].

B.Cost sensitive learning

In addition to sampling technique, another way to deal with

data imbalance problem is cost sensitive learning. It targets at

the data imbalance problem from the algorithm perspective. Be

contrasted with sampling methods, cost sensitive learning

methods solve data imbalance problem based on the

consideration of the cost associated with misclassifying

samples [11]. In particular, it assigns different cost values for

the misclassification of the samples. For instance, the cost of

misclassifying a patient into a healthy man would much higher

than the opposite. This is because the former may lose the

chance of the best treatment and even lose one’s life while the

latter just leads to more examinations. Typically, in a binary

classification, the cost is zero for correct classification for

either class and the cost of misclassifying minority is higher

than misclassifying majority. An objective function for the cost

sensitive learning can be constructed based on the aggregation

of the overall cost on the whole training set. An optimal

classifier can be learned by minimizing the objective function

[12,13,23]. Though cost sensitive algorithms can significantly

improve the classification performance, they can be only

applicable when the specific cost values of misclassification

are known. Unfortunately, in many cases, an explicit

description of the cost is hard to define, instead, only an

informal assertion is known like “the cost of misclassification

of minority samples is higher than the contrary situation” [14].

In addition, it would be quite challenging and even impossible

to determine the cost of misclassification in some particular

domains [15].

C.Imbalance problem in neural network

In the area of neural network, many efforts have been made

to address data imbalance problem. Nearly all of the work falls

into the three main streams of the solutions to imbalance

problem mentioned above. In particular, it’s the specific

implementations of either sampling or cost sensitive methods

or their combinations on neural networks though the details

may differ. Kukar presented a few different approaches for

cost-sensitive modifications of the back-propagation learning

algorithm for multilayered feed forward neural networks. He

described four approaches to learn cost sensitive neural

networks by adding cost factors to different parts of the

original back propagation algorithms. As a result, cost-

sensitive classification, adapting the output of the network,

adapting the learning rate and minimization the

misclassification costs are proposed [16]. Zhou empirically

studied the effect of sampling and threshold-moving in training

cost-sensitive neural networks. Both oversampling and under

sampling techniques are used to modify the distribution of the

training data set. Threshold-moving tries to move the output

threshold toward inexpensive classes such that examples with

higher costs become harder to be misclassified [17]. Other

similar work on this issue includes [18,19,20,24]. Although

some work has been done to solve the data imbalance problem

in neural network, quite few literatures related to the imbalance

problem of deep network can be seen so far. How to tackle the

data imbalance problem in the learning process of deep neural

network is of great value to explore. In particular, it can

broaden the application situations of the powerful deep neural

network, making it not only work well on balanced data but

also on imbalanced data.

III.

PROBLEM FORMULATION

We address the data imbalance problem during the training

of deep neural network (DNN) . Specifically, we mainly focus

on the loss function.

Generally, an error function expressed as the loss over the

whole training set is introduced in the training process of the

DNN. A set of optimal parameters for a DNN is achieved by

minimizing errors in training the DNN iteratively. A general

form of error function is given in equation (3.1):

ܧሺߠሻൌ݈൫ࢊ

ሺ௜ሻ

ǡ࢟

ఏ

ሺ௜ሻ

൯ǡ (3.1)

where the predicted output of ݅

௧௛

object ࢟

஘

ሺ୧ሻ

is parameterized

by the weights and biases Ʌ of the network. For simplicity, we

will just denote ࢟

ఏ

ሺ௜ሻ

as ࢟

ሺ௜ሻ

or ࢟ in the following discussions. ݈

denotes a kind of loss function. ࢊ

ሺ௜ሻ

אሼͲǡͳሽ

ଵൈ௡

is the desired

output with the constraint σࢊ

୬

ؔͳ

୬

and n is the total number

of neurons in the output layer , which is equal to the number of

classes. In this work, we only consider the binary classification

problem, so n=2. Note that, the value of ܧሺߠሻ is higher when

the model performs poorly on the training data set. The

learning algorithm aims to find the optimal parameter ( ߠ

ሻ

which brings the minimum possible error value ܧ

ሺߠሻ.

Therefore, the optimization objective is expressed as:

ൌ

ఏ

ܧሺߠሻ. (3.2)

The loss function ݈ሺήሻ in Eq. (3.1) can be in many different

forms, such as the Mean Squared Error (MSE) or Cross

Entropy (CE) loss. Out of the various forms of loss function,

MSE is the most widely used in the literature. Next, we will

first give a brief introduction of the commonly used loss

function and then propose two kinds of novel loss functions

which target at imbalanced data sets.

A.MSE loss:

This kind of loss function minimizes the squared error

between the predicted output and the ground-truth and can be

expressed as follows:

݈ൌ

ଵ

ெ

σσ

ଵ

ଶ

௡௜

ሺ݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

ሻ

ଶ

(3.3)

where M is the total number of samples and ݀

௡

ሺ௜ሻ

represents the

desired value of i

sample on n

neuron while ݕ

௡

ሺ௜ሻ

is the

corresponding predicted value. For instance, in the scenario of

binary classification, if the 4

sample actually belonged to the

second class while it is predicted as the first class incorrectly,

then the label vector and prediction vector for this sample is

ࢊ

ሺସሻ

ൌሾͲǡͳሿ

்

and ࢟

ሺସሻ

ൌ ሾͳǡͲሿ

்

respectively. Further, we have

4370 2016 International Joint Conference on Neural Networks (IJCNN)

݀

ଵ

ሺସሻ

ൌͲ and ݀

ଶ

ሺସሻ

ൌͳ while ݕ

ଵ

ሺସሻ

ൌͳ andݕ

ଶ

ሺସሻ

ൌͲ. So the

error of this sample is 1/2*((0-1)^2+(1-0)^2)=1, further all the

error of a collection of samples predicted by a classifier is the

number of incorrectly predicted samples in binary

classification problem, which can be seen from Eq. (1.1).

In addition, ݕ

௡

ሺ௜ሻ

can be expressed as a function of the

output of the previous layer ݋

௡

ሺ௜ሻ

using the logistic function [16]:

௡

ሺ௜ሻ

ൌ

ଵ

ଵାୣ୶୮ሺି௢

೙

ሺ೔ሻ

ሻ

(3.4)

B.MFE loss:

Now we introduce the Mean False Error loss function

proposed by our research. The concept “false error” is inspired

by the concepts “false positive rate” and “false negative rate”

in the confusion matrix and it is defined with the consideration

of both false positive error and false negative error. This kind

of loss function is designed to improve the classification

performance on the imbalanced data sets. Specifically, it makes

the loss more sensitive to the errors from the minority class

compared with the commonly used MSE loss by computing the

errors on different classes separately. Formally,

 ܨܲܧ ൌ

σσ

ଵ

ଶ

ሺ݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

ሻ

ଶ

݅ൌͳ



(3.5)

ܨܰܧ ൌ

σσ

ଵ

ଶ

ሺ݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

ሻ

ଶ

݅ൌͳ



(3.6)

݈Ԣ ൌ ܨܲܧ ൅ ܨܰܧ (3.7)

where FPE and FNE are mean false positive error and mean

false negative error respectively and they capture the error on

the negative class and positive class correspondingly. The loss

݈Ԣ is defined as the sum of the mean error from the two

different classes, which is illustrated in Eq. (3.7). N and P are

the numbers of samples in negative class and positive class

respectively. A specific example to calculate ݈Ԣ is illustrated in

Eq. (1.2) where the partσ

ଵ

ଶ

ሺ݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

ሻ

ଶ

is simplified to 1

based on the computation result in part ċ.A. In imbalanced

classification issues, researchers usually care more about the

classification accuracy of the minority class. Thereby the

minority class is treated as the positive class in most works.

Without loss of generality, we also let the minority class to be

the positive class in this work.

Note that only the form of loss function is redefined in our

work compared to the traditional deep network .Therefore, ݀

௡

ሺ௜ሻ

and ݕ

௡

ሺ௜ሻ

are associated with the same meanings as they are in

the MSE scenario and ݕ

௡

ሺ௜ሻ

here is still computed using Eq.

(3.4) .

C.MSFE loss:

The Mean Squared False Error loss function is designed

to improve the performance of MFE loss defined before.

Firstly, it calculates the FPE and FNE values using Eq. (3.5)

and Eq. (3.6). Then another function rather than Eq. (3.7) will

be used to integrate FPE and FNE. Formally,

̶݈ ൌ ܨܲܧ

ଶ

൅ܨܰܧ

ଶ

 (3.8)

A specific example to calculate ̶݈ is illustrated in Eq. (1.3).

The reason why MSFE can improve the performance of MSE

is explained here. In the MFE scenario, when we minimize the

loss, it can only guarantee the minimization of the sum of FPE

and FNE, which is not enough to guarantee high classification

accuracy on the positive class. To achieve high accuracy on

positive class, the false negative error should be quite low.

While in the imbalanced data sets, the case is that FPE tends to

contribute much more than FNE to their sum (the MFE loss)

due to the much more samples in negative class than the

positive class. As a result, the MFE loss is not sensitive to the

error of positive class and the minimization of MFE cannot

guarantee a perfect accuracy on positive class. The MSFE can

solve this problem effectively. Importantly, the loss function in

MSFE can be expressed as follows:

̶݈ ൌ ܨܲܧ

ଶ

൅ܨܰܧ

ଶ

ൌ

ଵ

ଶ

ሺሺܨܲܧ ൅ ܨܰܧሻ

ଶ

൅ ሺܨܲܧ െ ܨܰܧሻ

ଶ

ሻ (3.9)

So the minimization of MSFE is actually to minimize

ሺܨܲܧ ൅ ܨܰܧሻ

ଶ

and ሺܨܲܧ െ ܨܰܧሻ

ଶ

at the same time. In this

way, the minimization operation in the algorithm is able to find

a minimal sum of FPE and FNE and minimal the difference

between them. In other words, both the errors on positive class

and negative class will be minimized at the same time, which

can balance the accuracy on the two classes while keeping high

accuracy on the positive class.

Different loss functions lead to different gradient

computations in the back-propagation algorithms. Next, we

discuss the gradient computations when using the above

different loss functions.

D.MSE loss back-propagation:

During the supervised training process, the loss function

minimizes the difference between the predicted outputs

࢟

ሺ௜ሻ

and the ground-truth labels ࢊ

ሺ௜ሻ

across the entire training

data set (Eq. (3.5)). For the MSE loss, the gradient at each

neuron in the output layer can be derived from Eq. (3.3) as

follows:

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌെሺ݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

ሻ

డ௬

೙

ሺ೔ሻ

డ௢

೙

ሺ೔ሻ

(3.10)

Based on Eq. (3.4), the derivative of ݕ

௡

ሺ௜ሻ

with respect to ݋

௡

ሺ௜ሻ

is:

డ௬

೙

ሺ೔ሻ

డ௢

೙

ሺ೔ሻ

ൌݕ

௡

ሺ௜ሻ

ሺͳ െ ݕ

௡

ሺ௜ሻ

ሻ (3.11)

The derivative of the loss function in the output layer with

respect to the output of the previous layer is therefore given by

the Eq. (3.12):

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌെሺ݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

ሻݕ

௡

ሺ௜ሻ

ሺͳ െ ݕ

௡

ሺ௜ሻ

ሻ (3.12)

E.MFE loss back-propagation:

For the MFE loss given in Eq. (3.5) to Eq. (3.7), the

derivative can be calculated at each neuron in the output layer

as follows:

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌ

డி௉ா

డ௢

೙

ሺ೔ሻ

൅

డிோ

డ௢

೙

ሺ೔ሻ

2016 International Joint Conference on Neural Networks (IJCNN) 4371

=െ

ଵ

ே

൫݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

൯

డ௬

೙

ሺ೔ሻ

డ௢

೙

ሺ೔ሻ

െ

ଵ

௉

൫݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

൯

డ௬

೙

ሺ೔ሻ

డ௢

೙

ሺ೔ሻ

(3.13)

Substitute Eq. (3.11) into Eq. (3.13), we can get the derivative

of the MFE loss with respect to the output of the previous layer:

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌെ

ଵ

ே

൫݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

൯ݕ

௡

ሺ௜ሻ

൫ͳ െ ݕ

௡

ሺ௜ሻ

൯ǡ ሺ݅אࡺ

ሻ (3.14)

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌെ

ଵ

௉

൫݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

൯ݕ

௡

ሺ௜ሻ

൫ͳ െ ݕ

௡

ሺ௜ሻ

൯ǡ ሺ݅אࡼ

ሻ (3.15)

where ܰ and ܲ are the numbers of samples in negative class

and positive class respectively. ࡺ and ࡼ are the negative

sample set and positive sample set respectively. Specifically,

we use different derivatives for samples from each class. Eq.

(3.14) is used when the sample is from the negative class while

Eq. (3.15) is used when it belongs to the positive class.

F.MSFE loss back-propagation:

For the MSFE loss given in Eq. (3.8), the derivative can be

calculated at each neuron in the output layer as follows:



డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌʹܨܲܧή

డி௉ா

డ௢

೙

ሺ೔ሻ

൅ʹܨܰܧή

డிோ

డ௢

೙

ሺ೔ሻ

(3.16)

where

డி௉ா

డ௢

೙

ሺ೔ሻ

and

డிோ

డ௢

೙

ሺ೔ሻ

have been computed in Eq. (3.13),

substitute it into Eq. (3.16), the derivatives at each neuron for

different classes can be given as:

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌെ

ଶி௉ா

ே

൫݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

൯ݕ

௡

ሺ௜ሻ

൫ͳ െ ݕ

௡

ሺ௜ሻ

൯ǡ ሺ݅אࡺ

ሻ (3.17)

డ௟൫ࢊ

ሺ೔ሻ

ǡ࢟

ሺ೔ሻ

൯

డ௢

೙

ሺ೔ሻ

ൌെ

ଶிோ

௉

൫݀

௡

ሺ௜ሻ

െݕ

௡

ሺ௜ሻ

൯ݕ

௡

ሺ௜ሻ

൫ͳ െ ݕ

௡

ሺ௜ሻ

൯ǡ ሺ݅אࡼ

ሻ (3.18)

where ܰ and ܲ together with ࡺ and ࡼ have the same meanings

as that used in MFE loss. Similarly, for the samples from

different classes, different derivatives are used in the training

process.

IV. DEEP NEURAL NETWORK

We use deep neural network (DNN) to learn the feature

representation from the imbalanced and high dimensional data

sets for classification tasks. Specifically, here DNN refers to

neural networks with multiple hidden layers. With multiple

layers, DNN owns a strong generalization and extraction

ability for data especially for those high dimensional data sets.

The structure of the network used in this work is similar to the

classical deep neural network illustrated in [21] except that the

proposed loss layer is more sensitive to imbalanced data sets

using our proposed loss functions. Note that the DNN in our

work is trained with MFE loss and MSFE loss proposed by us

in Eq. (3.5) to Eq. (3.8) while DNN trained with MSE loss will

be used as a baseline in our experiment.

How to determine network structure parameters like the

number of layers and the number of neurons in each layer is a

difficult problem in the training of deep networks and it’s out

of the scope of this work. In our work, different numbers of

layers and neurons for DNN (use MSE loss function) are tried

on each data set. Those parameters which make the network

achieve the best classification performance are chosen to build

the network in our work. For example, for the Household data

set used in our experiment, a DNN with MSE as the loss

function is built to decide the network structure. Specifically,

we first use one hidden layer to test the classification

performance of the DNN on that data set and then add to two

hidden layers, three hidden layers or more. Similarly, when the

number of hidden layers is chosen, different numbers of

neurons on those hidden layers are examined on the same data

set until to gain the best classification performance. Using this

heuristic approach, the structure of DNN with the best

performance is chosen for each specific data set in our

experiment. It should be noted that, when the number of layers

increases, the classification performance firstly increases to a

peak point and then decrease. Things are the same for the

number of neurons. The specific settings are shown in Table Ċ.

TABLE II. DNN

PARAMETER SETTING

Data

set

Number of

Hidden Layers

Number of Neurons on Hidden

Layers (from bottom to up)

Household 3 1000, 300, 100

Tree 1 3 1000, 100, 10

Tree 2 3 1000, 100, 10

Doc. 1

Doc. 2

3000, 1000, 300, 100, 30, 10

Doc. 3

Doc. 4

3000, 1000, 300, 100, 30, 10

3000, 1500, 800, 400, 200, 50

Doc. 5 6 3000, 1500, 800, 400, 200, 50

V. EXPERIMENTS AND RESULTS

In this section, we evaluate the effectiveness of our

proposed loss functions on 8 imbalanced data sets, out of

which, three ones are images extracted from the CIFAR-100

data set and five ones are documents extracted from the 20

Newsgroup data set. All of them are of high dimensions,

specifically, the image data sets have 3072 dimensions while

the documents own 11669 dimensions extracted from the

original 66861 dimensions. Each data set contains various

numbers of samples and they are splatted into the training set

and testing set. The deep neural networks (DNNs) are firstly

trained on the training set and then tested on the testing set in

terms of their classification performance. To test the

classification performances of our proposed methods under

different imbalance degrees, the DNNs are trained and tested

when each data set is with different levels of imbalance. The

details of the data sets and experimental settings will be

explained in the next section.

A.Data sets and experimental settings

Image Classification: CIFAR-100 contains 60,000 images

belonging to 100 classes (600 images/class) which are further

divided into 20 superclasses. The standard train/test split for

each class is 500/100 images. To evaluate our algorithm on

various scales data sets, three data sets of different sizes are

extracted from this data set. The first one is relatively large and

it is the mixture of two superclasses household furniture and

household electrical devices, which is denoted as Household in

the experiment. The other two small ones have approximate

sizes, each of which is the combination of two classes

randomly selected from the superclass trees. Specifically, one

is the mixture of maple tree and oak tree and the other is the

blending of maple tree and palm tree. These two data sets are

4372 2016 International Joint Conference on Neural Networks (IJCNN)

denoted as Tree 1 and Tree 2 respectively in the experiment.

To imbalance the data distribution to different degrees, we

reduce the representation of one of the two classes in each

extracted data set to 20%, 10% and 5% images respectively.

Document classification: 20 Newsgroups is a collection

of approximately 20,000 newsgroup documents, partitioned

(nearly) evenly across 20 different newsgroups with around

600 documents contained in each newsgroup. We extract five

data sets from this data set with two randomly selected

newsgroups contained in each one. To be more specific, the

five extracted data sets are the mixture of alt.atheism and

rec.sport.baseball, alt.atheism and rec.sport.hockey,

talk.politics.misc and rec.sport.baseball, talk.politics.misc and

rec.sport.hockey, talk.religion.misc and soc.religion.christian

respectively, which are denoted as Doc.1 to Doc.5

correspondingly in the experiment. To transform the data

distribution into different imbalance levels, we reduce the

representation of one of the two classes in each data set to 20%,

10% and 5% of documents respectively.

B.Experimental results

To evaluate our proposed algorithm, we compare the

classification performances of the DNN trained using our

proposed MFE and MSFE loss functions with that of the DNN

trained using conventional MSE loss function respectively. To

be more specific, DNNs are trained with one of the three loss

functions each time on one data set in the training procedure.

As a result, three DNNs with different parameters (weights and

bias) are achieved to make prediction in the following testing

procedure. To characterize the classification performance on

the imbalanced data sets more effectively, two metrics F-

measure and AUC [22] which are commonly used in the

imbalanced data sets are chosen as the evaluation metrics in

our experiments. In general, people focus more on the

classification accuracy of the minority class rather than the

majority class when the data is imbalanced. Without loss of

generality, we mainly focus on the classification performance

of minority class which is treated as the positive class in the

experiments. We conduct experiments on the three image data

sets and five document data sets mentioned before and the

corresponding results are shown in Table ċand Table Č

respectively. In the two tables, the Imb. level means the

imbalance level of the data sets. For instance, the value 20% of

Imb. level in the Household data set means the number of

samples in the minority class equals to twenty precents of the

majority one.

TABLE III. EXPERIMENTAL RESULTS ON THREE IMAGE DATA SETS

Data

set

Imb.

level

F-measure AUC

DNN

(MSE)

DNN

(MFE)

DNN

(MSFE)

DNN

(MSE)

DNN

(MFE)

DNN

(MSFE)

Househ

-old

20% 0.3913 0.4138 0.4271 0.7142 0.7397 0.7354

10% 0.2778 0.2797 0.3151 0.7125 0.7179 0.7193

5% 0.1143

0.1905 0.2353 0.6714 0.695 0.697

Tree 1 20% 0.55 0.55 0.5366 0.81 0.814 0.8185

10% 0.4211 0.4211 0.4211 0.796 0.799 0.799

5% 0.1667

0.2353 0.2353 0.792 0.8 0.8

Tree 2 20% 0.4348 0.4255 0.4255 0.848 0.845 0.844

10% 0.1818 0.2609 0.25 0.805 0.805 0.806

5% 0 0.1071 0.1481 0.548 0.652 0.7

TABLE IV. EXPERIMENTAL RESULTS ON FIVE DOCUMENT DATA SETS

Data

set

Imb.

level

F-measure AUC

DNN

(MSE)

DNN

(MFE)

DNN

(MSFE)

DNN

(MSE)

DNN

(MFE)

DNN

(MSFE)

20% 0.2341

0.2574 0.2549 0.5948 0.5995 0.5987

Doc. 1 10% 0.1781 0.1854 0.1961 0.5349 0.5462 0.5469

5% 0.1356

0.1456 0.1456 0.5336 0.5436 0.5436

20% 0.3408 0.3393 0.3393 0.6462 0.6464 0.6464

Doc. 2 10% 0.2094 0.2 0.2 0.631 0.6319 0.6322

5% 0.1256 0.1171

0.1262 0.6273 0.6377 0.6431

20% 0.2929

0.2957 0.2957 0.5862 0.587 0.587

Doc. 3 10% 0.1596 0.1627 0.1698 0.5577 0.5756 0.5865

5% 0.0941

0.1118 0.1084 0.5314 0.5399 0.5346

20% 0.3723

0.3843 0.3668 0.6922 0.7031 0.7054

Doc. 4 10% 0.1159 0.2537 0.2574 0.5623 0.6802 0.6816

5% 0.1287

0.172 0.172 0.6041 0.609 0.609

20% 0.3103

0.3222 0.3222 0.6011 0.5925 0.5925

Doc. 5 10% 0.1829 0.1808 0.1839 0.5777 0.5836 0.5837

5% 0.0946

0.1053 0.1053 0.5682 0.573 0.573

The classification performance of DNNs trained using

different loss functions on different data sets is shown in Table

ċ and Table Č. Specifically, for each data set, the more

imbalanced the data is the worse classification performance we

achieve, which is illustrated by the general downward trends of

both F-measure and AUC with the increase of the imbalance

degree (the smaller the Imb. level the more imbalanced the data

set is). More importantly, for most of the data sets, the DNNs

trained using MFE or MSFE loss functions achieve either equal

or better performances than the DNNs trained using MSE loss

function on the same data set associated with the same

imbalance level (Those results from our algorithms better than

that from the conventional algorithms are in bold face in Table

ċand Table Č). These results empirically verify the

theoretical analysis in part ċ. One more interesting thing is

that, our proposed methods can lift the F-measure and AUC

more obviously in the extremely imbalanced data sets such as

those data sets with Imb. level of 5%, which shows the more

imbalanced the data is the more effective our methods are. For

example, in the Tree 2 data set, when Imb. level is 5%, the

boosting values of F-measure and AUC are 0.1071 and 0.104

respectively by replacing MSE with MFE, while the boosting

values are only 0.0791 and 0 under the Imb. level of 10%.

In addition to the optimal classification performance of the

algorithms on each data set shown in Table ċ and Table Č,

we also test the performances of these algorithms under the

same loss values on some data sets. Specifically, the F-measure

and AUC values of our proposed MFE and MSFE algorithms

and the baseline MSE algorithm are illustrated in Fig.1 and

Fig.2 along with the decrease of the loss values on the

Household data set. It can be clearly seen that both the F-meas.

and AUC resulted from MFE and MSFE are much higher than

those resulted from MSE under all the loss values. This

empirically verifies the theory analysis illustrated in the

introduction part that higher classification accuracy can be

achieved on imbalanced data sets when MFE (MSFE) is used

as the loss function rather than MSE. Another advantage of our

methods is that the performance is more stable compared with

the heavily fluctuated performance resulted from MSE

methods, which is clearly shown by the relatively smooth

curves achieved by our methods together with the jumping

2016 International Joint Conference on Neural Networks (IJCNN) 4373

curve obtained from MSE related approach. This can benefit

much on the gradient descent optimization during the training

of DNN. Specifically, with a relatively stable trend, the optimal

point with the best performance can be found more easily.

Fig. 1. Our proposed MFE and MSFE methods always achieve higher F-

measure values than the conventional MSE method under the same loss

values on household data set with Imb.level of 10% ( Only the parts of the

three curves under the common loss values are shown ).

Fig.2. Our proposed MFE and MSFE approches achiveve higher AUC than

the MSE approch under the same loss values on household data set with

Imb.level of 10% (Only the parts of the three curves under the common loss

values are shown ).

VI. CONCLUSIONS

Although deep neural networks have been widely explored

and proven to be effective on a wide variety of balanced data

sets, few studies paid attention to the data imbalance problem.

In order to resolve this issue, we proposed a novel loss function

MFE plus its improved version MSFE used for the training of

deep neural network (DNN) to deal with the class-imbalance

problem. We demonstrated their advantages over the

conventional MSE loss function from the theory perspective

and their effects on the back propagation procedures in DNN

training. Experimental results on both image and document

data sets show that our proposed loss functions outperform

the commonly used MSE on imbalanced data sets, especially

on extremely imbalanced data sets. In future work, we will

explore the effectiveness of our proposed loss functions on

different network structures like DBN and CNN.

REFERENCES

[1] N.V. Chawla, N. Japkowicz and A. Kotcz, Editorial: special issue on

learning from imbalanced data sets, ACM SIGKDD Explorations

Newsletter vol.6 (1), pp.1-6, 2004.

[2] J. Wu, S.Pan, X. Zhu, Z. Cai, “Boosting For Multi-Graph

Classification,” IEEE Trans. Cybernetics, vol.45(3), pp.430-443, 2015.

[3] J. Wu, X. Zhu, C. Zhang, P.S. Yu, “Bag Constrained Structure Pattern

Mining for Multi-Graph Classification,” IEEE Trans. Knowl. Data Eng,

vol.26(10), pp.2382-2396, 2014.

[4] H. He and X. Shen, “A Ranked Subspace Learning Method for Gene

Expression Data Classification,” IJCAI2007, pp.358-364.

[5] J. C. Candy and G. C. Temes, “Oversampling delta-sigma data

converters: theory, and simulation,” University of Texas Press, 1962.

[6] H. Li, J. Li, P. C. Chang, and J. Sun, “Parametric prediction on default

risk of chinese listed tourism companies by using random oversampling,

and locally linear embeddings on imbalanced samples,” International

Journal of Hospitality Management, vol.35, pp.141–151, 2013.

[7] N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer,

“SMOTE: Synthetic Minority Over-Sampling Technique,” J. Artificial

Intelligence Research, vol. 16, pp.321-357, 2002.

[8] J. Mathew,M. Luo, C. K. Pang and H. L.Chan , ”Kernel-based smote for

SVM classification of imbalanced datasets,” IECON2015, pp.1127-

1132.

[9] B. X. Wang and N. Japkowicz, “Imbalanced Data Set Learning with

Synthetic Samples,” Proc. IRIS Machine Learning Workshop, 2004.

[10] H. B. He and A. G. Edwardo, “Learning from imbalanced data,” IEEE

Transactions On Knowledge And Data Engineering, vol.21(9), pp.1263–

1284, 2009.

[11] N. Thai-Nghe, Z. Gatner, L. Schmidt-Thieme, “Cost-sensitive learning

methods for imbalanced data,” IJCNN2010, pp.1–8.

[12] P. Domingos, “MetaCost: A General Method for Making Classifiers

Cost-Sensitive,” ICDM1999, pp.155-164.

[13] C. Elkan, “The Foundations of Cost-Sensitive Learning,” IJCAI2001,

pp.973-978.

[14] M. A. Maloof, “Learning When Data Sets Are Imbalanced and When

Costs Are Unequal and Unknown,” ICML’03 Workshop on Learning

from Imbalanced Data Sets, 2003.

[15] M. Maloof, P. Langley, S. Sage, and T. Binford, “Learning to Detect

Rooftops in Aerial Images,” Proc. Image Understanding Workshop, pp.

835-845, 1997.

[16] M. Z. Kukar and I. Kononenko, “Cost-Sensitive Learning with Neural

Networks,”ECAI 1998, pp.445-449.

[17] Z. H. Zhou and X.Y. Liu, “Training Cost-Sensitive Neural Networks

with Methods Addressing the Class Imbalance Problem,”IEEE Trans.

Knowledge and Data Eng, vol. 18 (1), pp.63-77, 2006.

[18] C. H. Tsai, L. C. Chang and H. C. Chiang, ” Forecasting of ozone

episode days by cost-sensitive neural network methods,” Science of the

Total Environment, vol.407 (6) , pp.2124–2135, 2009.

[19] M. Lin, K. Tang and X. Yao, ” Dynamic sampling approach to training

neural networks for multiclass imbalance classification,” IEEE TNNLS,

vol.24 (4) ,pp. 647–660, 2013.

[20] B. krawczyk and M. Wozniak, ”Cost-Sensitive Neural Network with

ROC-Based Moving Threshold for Imbalanced Classification,”

IDEAL2015, pp.45-52.

[21] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring

strategies for training deep neural networks,” Journal of machine

learning research, vol 10, pp.1-40, 2009.

[22] T. Fawcett, "An introduction to ROC analysis," Pattern recognition

letters,vol 27 (8), pp. 861-874, 2006.

[23] Liu, W., Chan, J., Bailey, J., Leckie, C., & Kotagiri, R. “Mining labelled

tensors by discovering both their common and discriminative subspaces.”

In Proc. of the 2013 SIAM International Conference on Data Mining.

[24] Liu, W., Kan, A., Chan, J., Bailey, J., Leckie, C., Pei, J., & Kotagiri, R.

“On compressing weighted time-evolving graphs. In Proceedings of the

21st ACM International Conference on Information and Knowledge

Management.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.11

0.1

0.09

0.086

0.082

0.08

0.078

0.076

0.074

0.072

0.07

0.068

0.066

0.064

0.061

0.06

0.059

F-measure

Loss values

MSE

MFE

MSF

0.5

0.55

0.6

0.65

0.7

0.75

0.11

0.1

0.09

0.086

0.082

0.08

0.078

0.076

0.074

0.072

0.07

0.068

0.066

0.064

0.061

0.06

0.059

AUC

Loss values

MSE

MFE

MSFE

4374 2016 International Joint Conference on Neural Networks (IJCNN)

Is the use of Adam optimiser and label smoothing adequate for optimizing YOLOv7 and YOLOv7-E6E to attain a high-quality automated analysis and differential diagnostic evaluation of bronchoalveolar lavage fluid?

Preprint

Full-text available

Jul 2024

Background: In a world where lower respiratory tract infections rank among the leading causes of death and disability-adjusted life years (DALYs), precise and timely diagnosis is crucial. Bronchoalveolar lavage (BAL) fluid analysis is a pivotal diagnostic tool in pneumology and intensive care medicine, but its effectiveness relies on individual expertise. Our research focuses on the "You Only Look Once" (YOLO) algorithm, aiming to improve the precision and efficiency of BAL cell detection. Methods: We assess various YOLOv7 iterations, including YOLOv7, YOLOv7 with Adam and label smoothing, YOLOv7-E6E, and YOLOv7-E6E with Adam and label smoothing focusing on the detection of four key cell types of diagnostic importance in BAL fluid: macrophages, lymphocytes, neutrophils, and eosinophils. This study utilized cytospin preparations of BAL fluid, employing May-Grunwald-Giemsa staining, and analyzed a dataset comprising 2,032 images with 42,221 annotations. Classification performance was evaluated using recall, precision, F1 score, mAP@.5 and mAP@.5;.95 along with a confusion matrix. Results: The comparison of four algorithmic approaches revealed minor distinctions in mean results, falling short of statistical significance (p < 0.01; p < 0.05). YOLOv7, with an inference time of 13.5 ms for 640 x 640 px images, achieved commendable performance across all cell types, boasting an average F1 metric of 0.922, precision of 0.916, recall of 0.928, and mAP@.5 of 0.966. Remarkably, all cell classifications exhibited consistent outcomes, with no significant disparities among classes. Notably, YOLOv7 demonstrated marginally superior class value dispersion when compared to YOLOv7-adam-label-smoothing, YOLOv7-E6E, and YOLOv7-adam-label-smoothing, albeit without statistical significance. Conclusion: Consequently, there is limited justification for deploying the more computationally intensive YOLOv7-E6E and YOLOv7-E6E-adam-label-smoothing models. This investigation indicates that the default YOLOv7 variant is the preferred choice for differential cytology due to its accessibility, lower computational demands, and overall more consistent results than comparative studies.

Surface defect identification method for hot-rolled steel plates based on random data balancing and lightweight convolutional neural network

Article

Full-text available

May 2024

Hot-rolled strip steel is an extremely important industrial foundational material. The rapid and precise identification of surface defects in hot-rolled strip steel is beneficial for enhancing the quality of steel materials and reducing economic losses. Current research primarily focuses on using convolutional neural networks (CNNs) for strip steel surface defect identification. Although the accuracy of identification has remarkably improved in comparison with traditional machine learning methods, it has overlooked issues related to dataset preprocessing and the problem of nonlightweight CNN models with large model parameters and high computational complexity. To address the abovementioned issues, this study proposes a hot-rolled steel strip surface defect identification method based on random data balancing and the lightweight CNN MobileNet-Pro. Random data balancing employs image augmentation to eliminate the differences in the quantity of categories between the hot-rolled strip steel surface defect data, providing diverse images to alleviate overfitting during model training. MobileNet-Pro is used to increase the model’s effective receptive field. Building upon MobileNetV1, it introduces large convolutional kernels and improves depth-wise separable convolution. Experiments show that the new MobileNet-Pro, after random data balancing on the X-SDD dataset, achieves an accuracy of 96.47%, surpassing RepVGG + SA (95.10% accuracy, nonlightweight) and ResNet50 (93.86% accuracy, nonlightweight). Additionally, MobileNet-Pro outperforms mainstream lightweight networks from the MobileNet series, ShuffleNetV2, and GhostnetV2 in terms of performance on the CIFAR-100 and PASCAL VOC 2007 datasets, demonstrating excellent generalization capabilities. All our code and models are available on GitHub: https://github.com/OnlyForWW/MobileNet-Pro.

Class imbalance in multi-resident activity recognition: an evaluative study on explainability of deep learning approaches

Article

Full-text available

Jun 2024
Univers Access Inform Soc

Recognizing multiple residents’ activities is a pivotal domain within active and assisted living technologies, where the diversity of actions in a multi-occupant home poses a challenge due to their uneven distribution. Frequent activities contrast with those occurring sporadically, necessitating adept handling of class imbalance to ensure the integrity of activity recognition systems based on raw sensor data. While deep learning has proven its merit in identifying activities for solitary residents within balanced datasets, its application to multi-resident scenarios requires careful consideration. This study provides a comprehensive survey on the issue of class imbalance and explores the efficacy of Long Short-Term Memory and Bidirectional Long Short-Term Memory networks in discerning activities of multiple residents, considering both individual and aggregate labeling of actions. Through rigorous experimentation with data-level and algorithmic strategies to address class imbalances, this research scrutinizes the explicability of deep learning models, enhancing their transparency and reliability. Performance metrics are drawn from a series of evaluations on three distinct, highly imbalanced smart home datasets, offering insights into the models’ behavior and contributing to the advancement of trustworthy multi-resident activity recognition systems.

Evaluation of resampling techniques for deep learning based identification of promising genotypes in sugarcane varietal trials

Article

Full-text available

Apr 2024

Deep learning is a class of machine learning algorithms that extract high-level features from the raw input for making intelligent decisions. Identification of promising genotypes in varietal trials is one of many agriculture domain applications requiring implementation of deep learning to perform intelligent decision using varietal trial data. However, it has been found that varietal trial data to be used for identification is highly imbalanced one providing great challenges for classification tasks in deep learning. For example, only 33 genotypes were identified as promising in zonal varietal trials of All India Coordinated Research Project (AICRP) on Sugarcane during 2016-21, while those of non-promising class are 148. Balancing an imbalanced class is crucial as the classification model, which is trained using the imbalanced class dataset will tend to exhibit the prediction accuracy according to the highest class of the dataset. One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced. Study was conducted to implement and evaluate four resampling techniques viz. random undersampling, random oversampling, ensemble, SMOTE to balance varietal trial dataset in order to build deep learning model to identify promising genotypes in sugarcane. Paper describes the methodology used in our approach for building deep learning model using resampling techniques and then presented comparative performance of these approaches in identifying promising genotypes. Results indicate that SMOTE and random oversampling performed well for balancing imbalanced dataset for developing deep learning model in comparison to no-resampling of imbalanced dataset. SMOTE outperformed all resampling techniques by achieving high values of precision, recall and F1 score for both positive and negative classes. However, ensemble and random undersampling methods did not showed good results in comparison to SMOTE and random oversampling technique. Studies conducted will be useful in developing artificial intelligence based tools for automatic identification of promising genotypes in varietals trials of sugarcane in particular, as well as other crops in general.

Beyond 5G Network Failure Classification for Network Digital Twin Using Graph Neural Network

Preprint

Jun 2024

Fifth-generation (5G) core networks in network digital twins (NDTs) are complex systems with numerous components, generating considerable data. Analyzing these data can be challenging due to rare failure types, leading to imbalanced classes in multiclass classification. To address this problem, we propose a novel method of integrating a graph Fourier transform (GFT) into a message-passing neural network (MPNN) designed for NDTs. This approach transforms the data into a graph using the GFT to address class imbalance, whereas the MPNN extracts features and models dependencies between network components. This combined approach identifies failure types in real and simulated NDT environments, demonstrating its potential for accurate failure classification in 5G and beyond (B5G) networks. Moreover, the MPNN is adept at learning complex local structures among neighbors in an end-to-end setting. Extensive experiments have demonstrated that the proposed approach can identify failure types in three multiclass domain datasets at multiple failure points in real networks and NDT environments. The results demonstrate that the proposed GFT-MPNN can accurately classify network failures in B5G networks, especially when employed within NDTs to detect failure types.

A convolutional neural network with image and numerical data to improve farming of edible crickets as a source of food-A decision support system

Article

Full-text available

May 2024

Crickets (Gryllus bimaculatus) produce sounds as a natural means to communicate and convey various behaviors and activities, including mating, feeding, aggression, distress, and more. These vocalizations are intricately linked to prevailing environmental conditions such as temperature and humidity. By accurately monitoring, identifying, and appropriately addressing these behaviors and activities, the farming and production of crickets can be enhanced. This research implemented a decision support system that leverages machine learning (ML) algorithms to decode and classify cricket songs, along with their associated key weather variables (temperature and humidity). Videos capturing cricket behavior and weather variables were recorded. From these videos, sound signals were extracted and classified such as calling, aggression, and courtship. Numerical and image features were extracted from the sound signals and combined with the weather variables. The extracted numerical features, i.e., Mel-Frequency Cepstral Coefficients (MFCC), Linear Frequency Cepstral Coefficients, and chroma, were used to train shallow (support vector machine, k-nearest neighbors, and random forest (RF)) ML algorithms. While image features, i.e., spectrograms, were used to train different state-of-the-art deep ML models, i,e., convolutional neural network architectures (ResNet152V2, VGG16, and EfficientNetB4). In the deep ML category, ResNet152V2 had the best accuracy of 99.42%. The RF algorithm had the best accuracy of 95.63% in the shallow ML category when trained with a combination of MFCC+chroma and after feature selection. In descending order of importance, the top 6 ranked features in the RF algorithm were, namely humidity, temperature, C#, mfcc11, mfcc10, and D. From the selected features, it is notable that temperature and humidity are necessary for growth and metabolic activities in insects. Moreover, the songs produced by certain cricket species naturally align to musical tones such as C# and D as ranked by the algorithm. Using this knowledge, a decision support system was built to guide farmers about the optimal temperature and humidity ranges and interpret the songs (calling, aggression, and courtship) in relation to weather variables. With this information, farmers can put in place suitable measures such as temperature regulation, humidity control, addressing aggressors, and other relevant interventions to minimize or eliminate losses and enhance cricket production.

Convolutional neural networks combined with conventional filtering to semantically segment plant roots in rapidly scanned X-ray computed tomography volumes with high noise levels

Article

Full-text available

May 2024
PLANT METHODS

Background X-ray computed tomography (CT) is a powerful tool for measuring plant root growth in soil. However, a rapid scan with larger pots, which is required for throughput-prioritized crop breeding, results in high noise levels, low resolution, and blurred root segments in the CT volumes. Moreover, while plant root segmentation is essential for root quantification, detailed conditional studies on segmenting noisy root segments are scarce. The present study aimed to investigate the effects of scanning time and deep learning-based restoration of image quality on semantic segmentation of blurry rice (Oryza sativa) root segments in CT volumes. Results VoxResNet, a convolutional neural network-based voxel-wise residual network, was used as the segmentation model. The training efficiency of the model was compared using CT volumes obtained at scan times of 33, 66, 150, 300, and 600 s. The learning efficiencies of the samples were similar, except for scan times of 33 and 66 s. In addition, The noise levels of predicted volumes differd among scanning conditions, indicating that the noise level of a scan time ≥ 150 s does not affect the model training efficiency. Conventional filtering methods, such as median filtering and edge detection, increased the training efficiency by approximately 10% under any conditions. However, the training efficiency of 33 and 66 s-scanned samples remained relatively low. We concluded that scan time must be at least 150 s to not affect segmentation. Finally, we constructed a semantic segmentation model for 150 s-scanned CT volumes, for which the Dice loss reached 0.093. This model could not predict the lateral roots, which were not included in the training data. This limitation will be addressed by preparing appropriate training data. Conclusions A semantic segmentation model can be constructed even with rapidly scanned CT volumes with high noise levels. Given that scanning times ≥ 150 s did not affect the segmentation results, this technique holds promise for rapid and low-dose scanning. This study offers insights into images other than CT volumes with high noise levels that are challenging to determine when annotating.

A fast balanced two-branch network with model uncertainty for long-tailed data

Conference Paper

May 2024

Revolutionizing Agriculture: Embracing Modern Strategies for the Management of Coffee Leaf Rust Disease

Chapter

Jul 2024

Millions of people throughout the world enjoy coffee every day, and coffee-growing regions rely on this product for their economic growth. Coffee is one of the most valuable commodities in the world. Unfortunately, the sustainability of this popular drink is threatened by a disease called coffee leaf rust. Coffee leaf rust is a devasting disease caused by fungus that can damage coffee plantations and affect coffee production globally. Due to the persistence and adaptability of coffee leaf rust, coffee farmers are forced to address this disease on a continual basis. Thus, the use of modern pest and disease management methods to ensure the continued production of coffee and protect the livelihoods of those who depend on this crop is essential. This chapter highlights the transformative potential of deep learning in detecting coffee leaf rust in the fight against coffee leaf rust disease, which can offer a ray of hope for the advanced management of coffee leaf rust in the future.

Prediction of developmental toxic effects of fine particulate matter (PM2.5) water-soluble components via machine learning through observation of PM2.5 from diverse urban areas

Article

Jun 2024
SCI TOTAL ENVIRON

Cost-Sensitive Neural Network with ROC-Based Moving Threshold for Imbalanced Classification

Conference Paper

Jan 2015

Pattern classification algorithms usually assume, that the distribution of examples in classes is roughly balanced. However, in many cases one of the classes is dominant in comparison with others. Here, the classifier will become biased towards the majority class. This scenario is known as imbalanced classification. As the minority class is usually the one more valuable, we need to counter the imbalance effect by using one of several dedicated techniques. Cost-sensitive methods assume a penalty factor for misclassifying the minority objects. This way, by assuming a higher cost to minority objects we boost their importance for the classification process. In this paper, we propose a model of cost-sensitive neural network with moving threshold. It relies on scaling the output of the classifier with a given cost function. This way, we adjust our support functions towards the minority class. We propose a novel method for automatically determining the cost, based on the Receiver Operating Characteristic (ROC) curve analysis. It allows us to select the most efficient cost factor for a given dataset. Experimental comparison with state-of-the-art methods for imbalanced classification and backed-up by a statistical analysis prove the effectiveness of our proposal.

Kernel-based SMOTE for SVM classification of imbalanced datasets

Conference Paper

Nov 2015

Mining Labelled Tensors by Discovering both their Common and Discriminative Subspaces

Chapter

May 2013

Conventional non-negative tensor factorization (NTF) methods assume there is only one tensor that needs to be decomposed to low-rank factors. However, in practice data are usually generated from different time periods or by different class labels, which are represented by a sequence of multiple tensors associated with different labels. This raises the problem that when one needs to analyze and compare multiple tensors, existing NTF is unsuitable for discovering all potentially useful patterns: 1) if one factorizes each tensor separately, the common information shared by the tensors is lost in the factors, and 2) if one concatenates these tensors together and forms a larger tensor to factorize, the intrinsic discriminative subspaces that are unique to each tensor are not captured. The cause of such an issue is from the fact that conventional factorization methods handle data observations in an unsupervised way, which only considers features and not labels of the data. To tackle this problem, in this paper we design a novel factorization algorithm called CDNTF (common and discriminative subspace non-negative tensor factorization), which takes both features and class labels into account in the factorization process. CDNTF uses a set of labelled tensors as input and computes both their common and discriminative subspaces simultaneously as output. We design an iterative algorithm that solves the common and discriminative subspace factorization problem with a proof of convergence. Experiment results on solving graph classification problems demonstrate the power and the effectiveness of the subspaces discovered by our method. Conventional non-negative tensor factorization (NTF) methods assume there is only one tensor that needs to be decomposed to low-rank factors. However, in practice data are usually generated from different time periods or by different class labels, which are represented by a sequence of multiple tensors associated with different labels. This raises the problem that when one needs to analyze and compare multiple tensors, existing NTF is unsuitable for discovering all potentially useful patterns: 1) if one factorizes each tensor separately, the common information shared by the tensors is lost in the factors, and 2) if one concatenates these tensors together and forms a larger tensor to factorize, the intrinsic discriminative subspaces that are unique to each tensor are not captured. The cause of such an issue is from the fact that conventional factorization methods handle data observations in an unsupervised way, which only considers features and not labels of the data. To tackle this problem, in this paper we design a novel factorization algorithm called CDNTF (common and discriminative subspace non-negative tensor factorization), which takes both features and class labels into account in the factorization process. CDNTF uses a set of labelled tensors as input and computes both their common and discriminative subspaces simultaneously as output. We design an iterative algorithm that solves the common and discriminative subspace factorization problem with a proof of convergence. Experiment results on solving graph classification problems demonstrate the power and the effectiveness of the subspaces discovered by our method.

Parametric prediction on default risk of Chinese listed tourism companies by using random oversampling, isomap, and locally linear embeddings on imbalanced samples

Article

Dec 2013
Int J Hospit Manag

This research pioneers the default risk parametric prediction of Chinese tourism companies with random oversampling and manifold learning for parametric modelling on imbalanced samples to relax the requirement on sample availability. Four specific approaches were employed: standardization; standardization random oversampling; standardization -> isomap + locally linear embeddings; and standardization -> random oversampling -> isomap + locally linear embeddings. Empirical results indicate that: random oversampling successfully improved the tourism default risk prediction; the integration of isomap and locally linear embeddings is beneficial in default risk prediction using highly skewed tourism data with absolute minority samples; and after the use of random oversampling on initial data, the integrated approach improved in forecasting tourism default risk prior to two years versus one year. (C) 2013 Elsevier Ltd. All rights reserved.

Bag Constrained Structure Pattern Mining for Multi-Graph Classification

Article

Oct 2014

This paper formulates a multi-graph learning task. In our problem setting, a bag contains a number of graphs and a class label. A bag is labeled positive if at least one graph in the bag is positive, and negative otherwise. In addition, the genuine label of each graph in a positive bag is unknown, and all graphs in a negative bag are negative. The aim of multi-graph learning is to build a learning model from a number of labeled training bags to predict previously unseen test bags with maximum accuracy. This problem setting is essentially different from existing multi-instance learning (MIL), where instances in MIL share well-defined feature values, but no features are available to represent graphs in a multi-graph bag. To solve the problem, we propose a Multi-Graph Feature based Learning ( gMGFL) algorithm that explores and selects a set of discriminative subgraphs as features to transfer each bag into a single instance, with the bag label being propagated to the transferred instance. As a result, the multi-graph bags form a labeled training instance set, so generic learning algorithms, such as decision trees, can be used to derive learning models for multi-graph classification. Experiments and comparisons on real-world multi-graph tasks demonstrate the algorithm performance.

Boosting for Multi-Graph Classification

Article

Jul 2014

In this paper, we formulate a novel graph-based learning problem, multi-graph classification (MGC), which aims to learn a classifier from a set of labeled bags each containing a number of graphs inside the bag. A bag is labeled positive, if at least one graph in the bag is positive, and negative otherwise. Such a multi-graph representation can be used for many real-world applications, such as webpage classification, where a webpage can be regarded as a bag with texts and images inside the webpage being represented as graphs. This problem is a generalization of multi-instance learning (MIL) but with vital differences, mainly because instances in MIL share a common feature space whereas no feature is available to represent graphs in a multi-graph bag. To solve the problem, we propose a boosting based multi-graph classification framework (bMGC). Given a set of labeled multi-graph bags, bMGC employs dynamic weight adjustment at both bag- and graph-levels to select one subgraph in each iteration as a weak classifier. In each iteration, bag and graph weights are adjusted such that an incorrectly classified bag will receive a higher weight because its predicted bag label conflicts to the genuine label, whereas an incorrectly classified graph will receive a lower weight value if the graph is in a positive bag (or a higher weight if the graph is in a negative bag). Accordingly, bMGC is able to differentiate graphs in positive and negative bags to derive effective classifiers to form a boosting model for MGC. Experiments and comparisons on real-world multi-graph learning tasks demonstrate the algorithm performance.

On compressing weighted time-evolving graphs

Conference Paper

Oct 2012

Existing graph compression techniquesmostly focus on static graphs. However for many practical graphs such as social networks the edge weights frequently change over time. This phenomenon raises the question of how to compress dynamic graphs while maintaining most of their intrinsic structural patterns at each time snapshot. In this paper we show that the encoding cost of a dynamic graph is proportional to the heterogeneity of a three dimensional tensor that represents the dynamic graph. We propose an effective algorithm that compresses a dynamic graph by reducing the heterogeneity of its tensor representation, and at the same time also maintains a maximum lossy compression error at any time stamp of the dynamic graph. The bounded compression error benefits compressed graphs in that they retain good approximations of the original edge weights, and hence properties of the original graph (such as shortest paths) are well preserved. To the best of our knowledge, this is the first work that compresses weighted dynamic graphs with bounded lossy compression error at any time snapshot of the graph.

Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification

Article

Apr 2013

Class imbalance learning tackles supervised learning problems where some classes have significantly more examples than others. Most of the existing research focused only on binary-class cases. In this paper, we study multiclass imbalance problems and propose a dynamic sampling method (DyS) for multilayer perceptrons (MLP). In DyS, for each epoch of the training process, every example is fed to the current MLP and then the probability of it being selected for training the MLP is estimated. DyS dynamically selects informative data to train the MLP. In order to evaluate DyS and understand its strength and weakness, comprehensive experimental studies have been carried out. Results on 20 multiclass imbalanced data sets show that DyS can outperform the compared methods, including pre-sample methods, active learning methods, cost-sensitive methods, and boosting-type methods.

Over sampling Delta-Sigma Data Convertors: Theory, Design, and Simulation

Article

Jan 1992

Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown

Article

Jul 2003

Marcus A. Maloof

The problem of learning from imbalanced data sets, while not the same problem as learning when misclassication costs are un- equal and unknown, can be handled in a simi- lar manner. That is, in both contexts, we can use techniques from roc analysis to help with classier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these re- sults to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classiers that fell on the same roc curve.

Training deep neural networks on imbalanced data sets

Figures

Recommended publications

Pattern Selection for Support Vector Regression based on Sparseness and Variability

Toward a new three layer neural network with dynamical optimal training Performance

Velocity analysis on common offset GPR data: A deep learning approach

Improving Speech Recognition Learning through Lazy Training