Conference PaperPDF Available

Training deep neural networks on imbalanced data sets

Authors:
Training Deep Neural Networks on
Imbalanced Data Sets
Shoujin Wang
, Wei Liu
, Jia Wu
, Longbing Cao
, Qinxue Meng
, Paul J. Kennedy
Advanced Analytics Institute, University of Technology Sydney, Sydney, Australia
Centre for Quantum Computation & Intelligent Systems, University of Technology Sydney, Sydney, Australia
Email:
Shoujin.Wang@student.uts.edu.au,{Wei.Liu, Longbing.Cao}@uts.edu.au, {Jia.Wu, Qinxue.Meng, Paul.Kennedy}@uts.edu.au
Abstract—Deep learning has become increasingly popular in
both academic and industrial areas in the past years. Various
domains including pattern recognition, computer vision, and
natural language processing have witnessed the great power of
deep networks. However, current studies on deep learning
mainly focus on data sets with balanced class labels, while its
performance on imbalanced data is not well examined.
Imbalanced data sets exist widely in real world and they have
been providing great challenges for classification tasks. In this
paper, we focus on the problem of classification using deep
network on imbalanced data sets. Specifically, a novel loss
function called mean false error together with its improved
version mean squared false error are proposed for the training of
deep networks on imbalanced data sets. The proposed method
can effectively capture classification errors from both majority
class and minority class equally. Experiments and comparisons
demonstrate the superiority of the proposed approach compared
with conventional methods in classifying imbalanced data sets on
deep neural networks.
Keywords—deep neural network; loss function; data imbalance
I. I
NTRODUCTION
Recently, rapid developments in science and technology
have promoted the growth and availability of data at an
explosive rate in various domains. The ever-increasingly large
amount of data and more and more complex data structure lead
us to the so called “big data era”. This brings a great
opportunity for data mining and knowledge discovery and
many challenges as well. A noteworthy challenge is data
imbalance. Although more and more raw data is getting easy to
be accessed, much of which has imbalanced distributions,
namely a few object classes are abundant while others only
have limited representations. This is termed as the “Class-
imbalance” problem in data mining community and it is
inherently in almost all the collected data sets [1]. For instance,
in the clinical diagnostic data, most of the people are healthy
while only a quite low proportion of them are unhealthy. In
classification tasks, data sets are usually classified into binary-
class data sets and multi-class data sets according to the
number of classes. Accordingly, classification can be
categories as binary classification and multi-class classification
[2], [3]. This paper mainly focuses on the binary-classification
problem and the experimental data sets are binary-class ones
(A multi-class problem can generally be transformed into a
binary-class one by binarization). For a binary-class data set,
we call the data set imbalanced if the minority class is under-
represented compared to the majority class, e.g., the majority
class severely out represents the minority class [4].
Data imbalance can lead to unexpected mistakes and even
serious consequences in data analysis especially in
classification tasks. This is because the skewed distribution of
class instances forces the classification algorithms to be biased
to majority class. Therefore, the concepts of the minority class
are not learned adequately. As a result, the standard classifiers
(classifiers don’t consider data imbalance) tend to misclassify
the minority samples into majority samples when the data is
imbalanced, which results in quite poor classification
performance. This may lead to a heavy price in real life.
Considering the above diagnose data example, it is obvious
that the patient constitute the minority class while the healthy
persons constitute the majority one. If a patient was
misdiagnosed as a healthy person, it would delay the best
treatment time and cause significant consequences.
Though data imbalance has been proved to be a serious
problem, it is not addressed well in the standard classification
algorithms. Most of classifiers were designed under the
assumption that the data is balanced and evenly distributed on
each class. Many efforts have been made in some well-studied
classification algorithms to solve this problem professionally.
For example, sampling techniques and cost sensitive methods
are broadly applied in SVM, neural network and other
classifiers to solve the problem of class imbalance from
different perspectives. Sampling aims to transfer the
imbalanced data into balanced one by various sampling
techniques while cost sensitive methods try to make the
standard classifiers more sensitive to the minority class by
adding different cost factors into the algorithms.
However, in the field of deep learning, very limited work
has been done on this issue to the best of our knowledge. Most
of the existing deep learning algorithms do not take the data
imbalance problem into consideration. As a result, these
algorithms can perform well on the balanced data sets while
their performance cannot be guaranteed on imbalanced data
sets.
In this work, we aim to address the problem of class
imbalance in deep learning. Specifically, different forms of
loss functions are proposed to make the learning algorithms
more sensitive to the minority class and then achieve higher
classification accuracy. Besides that, we also illustrate why we
4368
978-1-5090-0620-5/16/$31.00 c
2016 IEEE
propose this kind of loss function and how it can outperform
the commonly used loss function in deep learning algorithms.
Currently, mean squared error (MSE) is the most
commonly used loss function in the standard deep learning
algorithms. It works well on balanced data sets while it fails to
deal with imbalanced ones. The reason is that MSE captures
the errors from an overall perspective, which means it
calculates the loss by firstly sum up all the errors from the
whole data set and then calculates the average value. This can
capture the errors from the majority and minority classes
equally when the binary-classes data sets are balanced.
However, when the data set is imbalanced, the error from the
majority class contributes much more to the loss value than the
error from the minority class. In this way, this loss function is
biased towards majority class and fails to capture the errors
from two classes equally. Further, the algorithms are very
likely to learn biased representative features from the majority
class and then achieve biased classification results.
To make up for this shortcoming of mean squared error
loss function used in deep learning, a new loss function called
mean false error (MFE) together with its improved version
mean squared false error (MSFE) are proposed. Being
different from MSE loss function, our proposed loss functions
can capture the errors both from majority class and minority
class equally. Specifically, our proposed loss functions firstly
calculate the average error in each class separately and then
add them together, which is demonstrated in part
ċ
in detail. In
this way, each class can contribute to the final loss value
equally.
TABLE I. A
N EXAMPLE OF CONFUSION MATRIX
Predicted Class
P N
True
Class
P’
86 4 90
N’
5 5 10
91 9
Let’s take the binary classification problem shown in Table
ĉ
as an example. For the classification problem in Table
ĉ,
we
compute the loss value using MSE, MFE and MSFE
respectively as follows. Please note that here is just an example
for the calculation of three different loss values, the formal
definitions of these loss functions are given in part
ċ.
Please
note that in this binary classification problem, the error of a
certain sample is 0 if the sample is predicted correctly,
otherwise the error is 1.
݈
୑ୗ୉
ସାହ
ଽ଴ାଵ଴
ൌͲǤͲͻ (1.1)
݈
୑୊୉
ଵ଴
ଽ଴
ൌͲǤͷͶ (1.2)
݈
୑ୗ୊୉
ൌሺ
ଵ଴
൅ሺ
ଽ଴
ൌͲǤʹͷ (1.3)
From table
ĉ,
it is quite clear that the overall classification
accuracy is (86+5)/(90+10)=91%. However, different loss
values can be achieved when different kinds of loss functions
are used as showed in Eq. (1.1) to Eq. (1.3). In addition, the
loss values computed using our proposed MFE and MSFE loss
functions are much larger than that of MSE. This means a
higher loss values can be achieved when MFE (MSFE) is used
as the loss function instead of MSE under the same
classification accuracy. In other words, under the condition of
the same loss values, a higher classification accuracy can be
achieved on imbalanced data sets when MFE (MSFE) is used
as the loss function rather than MSE. This empirically
demonstrates that our proposed loss functions can outperform
the commonly used MSE in imbalanced data sets. It should be
noted that only the advantages of MFE and MSFE over MSE
are illustrated here, and the reason why MSFE is introduced as
an improved version of MFE will be given in part ċ.
The contributions of this paper are summarized as˖
(1). Two novel loss functions are proposed to solve the data
imbalance problem in deep network.
(2). The advantages of these proposed loss functions over
the commonly used MSE are analyzed theoretically.
(3). The effect of these proposed loss functions on the back-
propagation process of deep learning is analyzed by examining
relations for propagated gradients.
(4). Empirical study on real world data sets is conducted to
validate the effectiveness of our proposed loss functions.
The left parts of this paper are organized as follows. In part
Ċ, we review the previous studies addressing data imbalance.
Followed by problem formulation and statement in part ċ, a
brief introduction of DNN is given in part Č. Part č describes
the experiments of applying our proposed loss functions on real
world data sets. Finally, the paper concludes in part Ď.
II.
RELATED WORK
How to deal with imbalanced data sets is a key issue in
classification and it is well explored during the past decades.
Until now, this issue is solved mainly in three ways, sampling
techniques, cost sensitive methods and the hybrid methods
combining these two. This section reviews these three
mainstream methods and then gives a special review on the
imbalance problem in neural network field, which is generally
thought to be the ancestor of deep learning and deep neural
network.
A.Sampling techinique
Sampling is thought to be a pre-processing technique as it
deals with the data imbalance problem from data itself
perspective. Specifically, it tries to provide a balanced
distribution by transferring the imbalanced data into balanced
one and then works with classification algorithms to get results.
Various sampling techniques have been proposed from
different perspectives. Random oversampling [5], [6] is one of
the simplest sampling methods. It randomly duplicates a
certain number of samples from the minority class and then
augment them into the original data set. On the contrary,
under-sampling randomly remove a certain number of
instances from the majority class to achieve a balanced data
set. Although these sampling techniques are easy to implement
and effective, they may bring some problems. For example,
random oversampling may lead to overfitting while random
under-sampling may lose some important information. To
2016 International Joint Conference on Neural Networks (IJCNN) 4369
avoid these potential issues, a more complex and reasonable
sampling method is proposed. Specifically. The synthetic
minority oversampling technique (SMOTE) has proven to be
quite powerful which has achieved a great deal of success in
various applications [7], [8]. SMOTE creates artificial data
based on the similarities between existing minority samples.
Although many promising benefits have been shown by
SMOTE, some drawbacks still exist like over generalization
and variance [9], [10].
B.Cost sensitive learning
In addition to sampling technique, another way to deal with
data imbalance problem is cost sensitive learning. It targets at
the data imbalance problem from the algorithm perspective. Be
contrasted with sampling methods, cost sensitive learning
methods solve data imbalance problem based on the
consideration of the cost associated with misclassifying
samples [11]. In particular, it assigns different cost values for
the misclassification of the samples. For instance, the cost of
misclassifying a patient into a healthy man would much higher
than the opposite. This is because the former may lose the
chance of the best treatment and even lose one’s life while the
latter just leads to more examinations. Typically, in a binary
classification, the cost is zero for correct classification for
either class and the cost of misclassifying minority is higher
than misclassifying majority. An objective function for the cost
sensitive learning can be constructed based on the aggregation
of the overall cost on the whole training set. An optimal
classifier can be learned by minimizing the objective function
[12,13,23]. Though cost sensitive algorithms can significantly
improve the classification performance, they can be only
applicable when the specific cost values of misclassification
are known. Unfortunately, in many cases, an explicit
description of the cost is hard to define, instead, only an
informal assertion is known like “the cost of misclassification
of minority samples is higher than the contrary situation” [14].
In addition, it would be quite challenging and even impossible
to determine the cost of misclassification in some particular
domains [15].
C.Imbalance problem in neural network
In the area of neural network, many efforts have been made
to address data imbalance problem. Nearly all of the work falls
into the three main streams of the solutions to imbalance
problem mentioned above. In particular, it’s the specific
implementations of either sampling or cost sensitive methods
or their combinations on neural networks though the details
may differ. Kukar presented a few different approaches for
cost-sensitive modifications of the back-propagation learning
algorithm for multilayered feed forward neural networks. He
described four approaches to learn cost sensitive neural
networks by adding cost factors to different parts of the
original back propagation algorithms. As a result, cost-
sensitive classification, adapting the output of the network,
adapting the learning rate and minimization the
misclassification costs are proposed [16]. Zhou empirically
studied the effect of sampling and threshold-moving in training
cost-sensitive neural networks. Both oversampling and under
sampling techniques are used to modify the distribution of the
training data set. Threshold-moving tries to move the output
threshold toward inexpensive classes such that examples with
higher costs become harder to be misclassified [17]. Other
similar work on this issue includes [18,19,20,24]. Although
some work has been done to solve the data imbalance problem
in neural network, quite few literatures related to the imbalance
problem of deep network can be seen so far. How to tackle the
data imbalance problem in the learning process of deep neural
network is of great value to explore. In particular, it can
broaden the application situations of the powerful deep neural
network, making it not only work well on balanced data but
also on imbalanced data.
III.
PROBLEM FORMULATION
We address the data imbalance problem during the training
of deep neural network (DNN) . Specifically, we mainly focus
on the loss function.
Generally, an error function expressed as the loss over the
whole training set is introduced in the training process of the
DNN. A set of optimal parameters for a DNN is achieved by
minimizing errors in training the DNN iteratively. A general
form of error function is given in equation (3.1):
ܧߠൌ݈
ǡ࢟
൯ǡ (3.1)
where the predicted output of ݅
௧௛
object
is parameterized
by the weights and biases Ʌ of the network. For simplicity, we
will just denote
as
ሺ௜ሻ
or in the following discussions. ݈
denotes a kind of loss function.
אሼͲǡͳ
ଵൈ௡
is the desired
output with the constraint σ
୬
ؔͳ
and n is the total number
of neurons in the output layer , which is equal to the number of
classes. In this work, we only consider the binary classification
problem, so n=2. Note that, the value of ܧߠ is higher when
the model performs poorly on the training data set. The
learning algorithm aims to find the optimal parameter ( ߠ
כ
which brings the minimum possible error value ܧ
כ
ߠ.
Therefore, the optimization objective is expressed as:
ߠ
כ
ൌ
ܧߠ. (3.2)
The loss function ݈ሺήሻ in Eq. (3.1) can be in many different
forms, such as the Mean Squared Error (MSE) or Cross
Entropy (CE) loss. Out of the various forms of loss function,
MSE is the most widely used in the literature. Next, we will
first give a brief introduction of the commonly used loss
function and then propose two kinds of novel loss functions
which target at imbalanced data sets.
A.MSE loss:
This kind of loss function minimizes the squared error
between the predicted output and the ground-truth and can be
expressed as follows:
݈ൌ
σσ
ሺ݀
െݕ
(3.3)
where M is the total number of samples and ݀
represents the
desired value of i
th
sample on n
th
neuron while ݕ
is the
corresponding predicted value. For instance, in the scenario of
binary classification, if the 4
th
sample actually belonged to the
second class while it is predicted as the first class incorrectly,
then the label vector and prediction vector for this sample is
ൌሾͲǡͳ
and
ൌ ሾͳǡͲሿ
respectively. Further, we have
4370 2016 International Joint Conference on Neural Networks (IJCNN)
݀
ൌͲ and ݀
ൌͳ while ݕ
ൌͳ andݕ
ൌͲ. So the
error of this sample is 1/2*((0-1)^2+(1-0)^2)=1, further all the
error of a collection of samples predicted by a classifier is the
number of incorrectly predicted samples in binary
classification problem, which can be seen from Eq. (1.1).
In addition, ݕ
can be expressed as a function of the
output of the previous layer ݋
using the logistic function [16]:
ݕ
ଵାୣ୶୮ሺି௢
(3.4)
B.MFE loss:
Now we introduce the Mean False Error loss function
proposed by our research. The concept “false error” is inspired
by the concepts “false positive rate” and “false negative rate”
in the confusion matrix and it is defined with the consideration
of both false positive error and false negative error. This kind
of loss function is designed to improve the classification
performance on the imbalanced data sets. Specifically, it makes
the loss more sensitive to the errors from the minority class
compared with the commonly used MSE loss by computing the
errors on different classes separately. Formally,
ܨܲܧ ൌ
ͳ
ܰ
σσ
ሺ݀
െݕ
݊
ܰ
݅ൌͳ
(3.5)
ܨܰܧ ൌ
ͳ
ܲ
σσ
ሺ݀
െݕ
݊
ܲ
݅ൌͳ
(3.6)
݈Ԣ ൌ ܨܲܧ ൅ ܨܰܧ (3.7)
where FPE and FNE are mean false positive error and mean
false negative error respectively and they capture the error on
the negative class and positive class correspondingly. The loss
݈Ԣ is defined as the sum of the mean error from the two
different classes, which is illustrated in Eq. (3.7). N and P are
the numbers of samples in negative class and positive class
respectively. A specific example to calculate ݈Ԣ is illustrated in
Eq. (1.2) where the partσ
ሺ݀
െݕ
݊
is simplified to 1
based on the computation result in part ċ.A. In imbalanced
classification issues, researchers usually care more about the
classification accuracy of the minority class. Thereby the
minority class is treated as the positive class in most works.
Without loss of generality, we also let the minority class to be
the positive class in this work.
Note that only the form of loss function is redefined in our
work compared to the traditional deep network .Therefore, ݀
and ݕ
are associated with the same meanings as they are in
the MSE scenario and ݕ
here is still computed using Eq.
(3.4) .
C.MSFE loss:
The Mean Squared False Error loss function is designed
to improve the performance of MFE loss defined before.
Firstly, it calculates the FPE and FNE values using Eq. (3.5)
and Eq. (3.6). Then another function rather than Eq. (3.7) will
be used to integrate FPE and FNE. Formally,
̶݈ ൌ ܨܲܧ
൅ܨܰܧ
(3.8)
A specific example to calculate ̶݈ is illustrated in Eq. (1.3).
The reason why MSFE can improve the performance of MSE
is explained here. In the MFE scenario, when we minimize the
loss, it can only guarantee the minimization of the sum of FPE
and FNE, which is not enough to guarantee high classification
accuracy on the positive class. To achieve high accuracy on
positive class, the false negative error should be quite low.
While in the imbalanced data sets, the case is that FPE tends to
contribute much more than FNE to their sum (the MFE loss)
due to the much more samples in negative class than the
positive class. As a result, the MFE loss is not sensitive to the
error of positive class and the minimization of MFE cannot
guarantee a perfect accuracy on positive class. The MSFE can
solve this problem effectively. Importantly, the loss function in
MSFE can be expressed as follows:
̶݈ ൌ ܨܲܧ
൅ܨܰܧ
ൌ
ሺሺܨܲܧ ൅ ܨܰܧ
൅ ሺܨܲܧ െ ܨܰܧ
(3.9)
So the minimization of MSFE is actually to minimize
ሺܨܲܧ ൅ ܨܰܧሻ
and ሺܨܲܧ െ ܨܰܧሻ
at the same time. In this
way, the minimization operation in the algorithm is able to find
a minimal sum of FPE and FNE and minimal the difference
between them. In other words, both the errors on positive class
and negative class will be minimized at the same time, which
can balance the accuracy on the two classes while keeping high
accuracy on the positive class.
Different loss functions lead to different gradient
computations in the back-propagation algorithms. Next, we
discuss the gradient computations when using the above
different loss functions.
D.MSE loss back-propagation:
During the supervised training process, the loss function
minimizes the difference between the predicted outputs
ሺ௜ሻ
and the ground-truth labels
ሺ௜ሻ
across the entire training
data set (Eq. (3.5)). For the MSE loss, the gradient at each
neuron in the output layer can be derived from Eq. (3.3) as
follows:
డ௟
ǡ࢟
డ௢
ൌെ݀
െݕ
డ௬
డ௢
(3.10)
Based on Eq. (3.4), the derivative of ݕ
with respect to ݋
is:
డ௬
డ௢
ൌݕ
ሺͳ െ ݕ
(3.11)
The derivative of the loss function in the output layer with
respect to the output of the previous layer is therefore given by
the Eq. (3.12):
డ௟
ǡ࢟
డ௢
ൌെ݀
െݕ
ሻݕ
ሺͳ െ ݕ
(3.12)
E.MFE loss back-propagation:
For the MFE loss given in Eq. (3.5) to Eq. (3.7), the
derivative can be calculated at each neuron in the output layer
as follows:
డ௟
ǡ࢟
డ௢
డி௉ா
డ௢
డிோ
డ௢
2016 International Joint Conference on Neural Networks (IJCNN) 4371
=
൫݀
െݕ
డ௬
డ௢
൫݀
െݕ
డ௬
డ௢
(3.13)
Substitute Eq. (3.11) into Eq. (3.13), we can get the derivative
of the MFE loss with respect to the output of the previous layer:
డ௟
ǡ࢟
డ௢
ൌെ
൫݀
െݕ
൯ݕ
൫ͳ െ ݕ
൯ǡ ݅אࡺ
(3.14)
డ௟
ǡ࢟
డ௢
ൌെ
൫݀
െݕ
൯ݕ
൫ͳ െ ݕ
൯ǡ ݅אࡼ
(3.15)
where ܰ and ܲ are the numbers of samples in negative class
and positive class respectively. and are the negative
sample set and positive sample set respectively. Specifically,
we use different derivatives for samples from each class. Eq.
(3.14) is used when the sample is from the negative class while
Eq. (3.15) is used when it belongs to the positive class.
F.MSFE loss back-propagation:
For the MSFE loss given in Eq. (3.8), the derivative can be
calculated at each neuron in the output layer as follows:

డ௟
ǡ࢟
డ௢
ൌʹܨܲܧή
డி௉ா
డ௢
൅ʹܨܰܧή
డிோ
డ௢
(3.16)
where
డி௉ா
డ௢
and
డிோ
డ௢
have been computed in Eq. (3.13),
substitute it into Eq. (3.16), the derivatives at each neuron for
different classes can be given as:
డ௟
ǡ࢟
డ௢
ൌെ
ଶி௉ா
൫݀
െݕ
൯ݕ
൫ͳ െ ݕ
൯ǡ ݅אࡺ
(3.17)
డ௟
ǡ࢟
డ௢
ൌെ
ଶிோ
൫݀
െݕ
൯ݕ
൫ͳ െ ݕ
൯ǡ ݅אࡼ
(3.18)
where ܰ and ܲ together with and have the same meanings
as that used in MFE loss. Similarly, for the samples from
different classes, different derivatives are used in the training
process.
IV. DEEP NEURAL NETWORK
We use deep neural network (DNN) to learn the feature
representation from the imbalanced and high dimensional data
sets for classification tasks. Specifically, here DNN refers to
neural networks with multiple hidden layers. With multiple
layers, DNN owns a strong generalization and extraction
ability for data especially for those high dimensional data sets.
The structure of the network used in this work is similar to the
classical deep neural network illustrated in [21] except that the
proposed loss layer is more sensitive to imbalanced data sets
using our proposed loss functions. Note that the DNN in our
work is trained with MFE loss and MSFE loss proposed by us
in Eq. (3.5) to Eq. (3.8) while DNN trained with MSE loss will
be used as a baseline in our experiment.
How to determine network structure parameters like the
number of layers and the number of neurons in each layer is a
difficult problem in the training of deep networks and it’s out
of the scope of this work. In our work, different numbers of
layers and neurons for DNN (use MSE loss function) are tried
on each data set. Those parameters which make the network
achieve the best classification performance are chosen to build
the network in our work. For example, for the Household data
set used in our experiment, a DNN with MSE as the loss
function is built to decide the network structure. Specifically,
we first use one hidden layer to test the classification
performance of the DNN on that data set and then add to two
hidden layers, three hidden layers or more. Similarly, when the
number of hidden layers is chosen, different numbers of
neurons on those hidden layers are examined on the same data
set until to gain the best classification performance. Using this
heuristic approach, the structure of DNN with the best
performance is chosen for each specific data set in our
experiment. It should be noted that, when the number of layers
increases, the classification performance firstly increases to a
peak point and then decrease. Things are the same for the
number of neurons. The specific settings are shown in Table Ċ.
TABLE II. DNN
PARAMETER SETTING
Data
set
Number of
Hidden Layers
Number of Neurons on Hidden
Layers (from bottom to up)
Household 3 1000, 300, 100
Tree 1 3 1000, 100, 10
Tree 2 3 1000, 100, 10
Doc. 1
Doc. 2
6
6
3000, 1000, 300, 100, 30, 10
3000, 1000, 300, 100, 30, 10
Doc. 3
Doc. 4
6
6
3000, 1000, 300, 100, 30, 10
3000, 1500, 800, 400, 200, 50
Doc. 5 6 3000, 1500, 800, 400, 200, 50
V. EXPERIMENTS AND RESULTS
In this section, we evaluate the effectiveness of our
proposed loss functions on 8 imbalanced data sets, out of
which, three ones are images extracted from the CIFAR-100
data set and five ones are documents extracted from the 20
Newsgroup data set. All of them are of high dimensions,
specifically, the image data sets have 3072 dimensions while
the documents own 11669 dimensions extracted from the
original 66861 dimensions. Each data set contains various
numbers of samples and they are splatted into the training set
and testing set. The deep neural networks (DNNs) are firstly
trained on the training set and then tested on the testing set in
terms of their classification performance. To test the
classification performances of our proposed methods under
different imbalance degrees, the DNNs are trained and tested
when each data set is with different levels of imbalance. The
details of the data sets and experimental settings will be
explained in the next section.
A.Data sets and experimental settings
Image Classification: CIFAR-100 contains 60,000 images
belonging to 100 classes (600 images/class) which are further
divided into 20 superclasses. The standard train/test split for
each class is 500/100 images. To evaluate our algorithm on
various scales data sets, three data sets of different sizes are
extracted from this data set. The first one is relatively large and
it is the mixture of two superclasses household furniture and
household electrical devices, which is denoted as Household in
the experiment. The other two small ones have approximate
sizes, each of which is the combination of two classes
randomly selected from the superclass trees. Specifically, one
is the mixture of maple tree and oak tree and the other is the
blending of maple tree and palm tree. These two data sets are
4372 2016 International Joint Conference on Neural Networks (IJCNN)
denoted as Tree 1 and Tree 2 respectively in the experiment.
To imbalance the data distribution to different degrees, we
reduce the representation of one of the two classes in each
extracted data set to 20%, 10% and 5% images respectively.
Document classification: 20 Newsgroups is a collection
of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups with around
600 documents contained in each newsgroup. We extract five
data sets from this data set with two randomly selected
newsgroups contained in each one. To be more specific, the
five extracted data sets are the mixture of alt.atheism and
rec.sport.baseball, alt.atheism and rec.sport.hockey,
talk.politics.misc and rec.sport.baseball, talk.politics.misc and
rec.sport.hockey, talk.religion.misc and soc.religion.christian
respectively, which are denoted as Doc.1 to Doc.5
correspondingly in the experiment. To transform the data
distribution into different imbalance levels, we reduce the
representation of one of the two classes in each data set to 20%,
10% and 5% of documents respectively.
B.Experimental results
To evaluate our proposed algorithm, we compare the
classification performances of the DNN trained using our
proposed MFE and MSFE loss functions with that of the DNN
trained using conventional MSE loss function respectively. To
be more specific, DNNs are trained with one of the three loss
functions each time on one data set in the training procedure.
As a result, three DNNs with different parameters (weights and
bias) are achieved to make prediction in the following testing
procedure. To characterize the classification performance on
the imbalanced data sets more effectively, two metrics F-
measure and AUC [22] which are commonly used in the
imbalanced data sets are chosen as the evaluation metrics in
our experiments. In general, people focus more on the
classification accuracy of the minority class rather than the
majority class when the data is imbalanced. Without loss of
generality, we mainly focus on the classification performance
of minority class which is treated as the positive class in the
experiments. We conduct experiments on the three image data
sets and five document data sets mentioned before and the
corresponding results are shown in Table ċand Table Č
respectively. In the two tables, the Imb. level means the
imbalance level of the data sets. For instance, the value 20% of
Imb. level in the Household data set means the number of
samples in the minority class equals to twenty precents of the
majority one.
TABLE III. EXPERIMENTAL RESULTS ON THREE IMAGE DATA SETS
Data
set
Imb.
level
F-measure AUC
DNN
(MSE)
DNN
(MFE)
DNN
(MSFE)
DNN
(MSE)
DNN
(MFE)
DNN
(MSFE)
Househ
-old
20% 0.3913 0.4138 0.4271 0.7142 0.7397 0.7354
10% 0.2778 0.2797 0.3151 0.7125 0.7179 0.7193
5% 0.1143
0.1905 0.2353 0.6714 0.695 0.697
Tree 1 20% 0.55 0.55 0.5366 0.81 0.814 0.8185
10% 0.4211 0.4211 0.4211 0.796 0.799 0.799
5% 0.1667
0.2353 0.2353 0.792 0.8 0.8
Tree 2 20% 0.4348 0.4255 0.4255 0.848 0.845 0.844
10% 0.1818 0.2609 0.25 0.805 0.805 0.806
5% 0 0.1071 0.1481 0.548 0.652 0.7
TABLE IV. EXPERIMENTAL RESULTS ON FIVE DOCUMENT DATA SETS
Data
set
Imb.
level
F-measure AUC
DNN
(MSE)
DNN
(MFE)
DNN
(MSFE)
DNN
(MSE)
DNN
(MFE)
DNN
(MSFE)
20% 0.2341
0.2574 0.2549 0.5948 0.5995 0.5987
Doc. 1 10% 0.1781 0.1854 0.1961 0.5349 0.5462 0.5469
5% 0.1356
0.1456 0.1456 0.5336 0.5436 0.5436
20% 0.3408 0.3393 0.3393 0.6462 0.6464 0.6464
Doc. 2 10% 0.2094 0.2 0.2 0.631 0.6319 0.6322
5% 0.1256 0.1171
0.1262 0.6273 0.6377 0.6431
20% 0.2929
0.2957 0.2957 0.5862 0.587 0.587
Doc. 3 10% 0.1596 0.1627 0.1698 0.5577 0.5756 0.5865
5% 0.0941
0.1118 0.1084 0.5314 0.5399 0.5346
20% 0.3723
0.3843 0.3668 0.6922 0.7031 0.7054
Doc. 4 10% 0.1159 0.2537 0.2574 0.5623 0.6802 0.6816
5% 0.1287
0.172 0.172 0.6041 0.609 0.609
20% 0.3103
0.3222 0.3222 0.6011 0.5925 0.5925
Doc. 5 10% 0.1829 0.1808 0.1839 0.5777 0.5836 0.5837
5% 0.0946
0.1053 0.1053 0.5682 0.573 0.573
The classification performance of DNNs trained using
different loss functions on different data sets is shown in Table
ċ and Table Č. Specifically, for each data set, the more
imbalanced the data is the worse classification performance we
achieve, which is illustrated by the general downward trends of
both F-measure and AUC with the increase of the imbalance
degree (the smaller the Imb. level the more imbalanced the data
set is). More importantly, for most of the data sets, the DNNs
trained using MFE or MSFE loss functions achieve either equal
or better performances than the DNNs trained using MSE loss
function on the same data set associated with the same
imbalance level (Those results from our algorithms better than
that from the conventional algorithms are in bold face in Table
ċand Table Č). These results empirically verify the
theoretical analysis in part ċ. One more interesting thing is
that, our proposed methods can lift the F-measure and AUC
more obviously in the extremely imbalanced data sets such as
those data sets with Imb. level of 5%, which shows the more
imbalanced the data is the more effective our methods are. For
example, in the Tree 2 data set, when Imb. level is 5%, the
boosting values of F-measure and AUC are 0.1071 and 0.104
respectively by replacing MSE with MFE, while the boosting
values are only 0.0791 and 0 under the Imb. level of 10%.
In addition to the optimal classification performance of the
algorithms on each data set shown in Table ċ and Table Č,
we also test the performances of these algorithms under the
same loss values on some data sets. Specifically, the F-measure
and AUC values of our proposed MFE and MSFE algorithms
and the baseline MSE algorithm are illustrated in Fig.1 and
Fig.2 along with the decrease of the loss values on the
Household data set. It can be clearly seen that both the F-meas.
and AUC resulted from MFE and MSFE are much higher than
those resulted from MSE under all the loss values. This
empirically verifies the theory analysis illustrated in the
introduction part that higher classification accuracy can be
achieved on imbalanced data sets when MFE (MSFE) is used
as the loss function rather than MSE. Another advantage of our
methods is that the performance is more stable compared with
the heavily fluctuated performance resulted from MSE
methods, which is clearly shown by the relatively smooth
curves achieved by our methods together with the jumping
2016 International Joint Conference on Neural Networks (IJCNN) 4373
curve obtained from MSE related approach. This can benefit
much on the gradient descent optimization during the training
of DNN. Specifically, with a relatively stable trend, the optimal
point with the best performance can be found more easily.
Fig. 1. Our proposed MFE and MSFE methods always achieve higher F-
measure values than the conventional MSE method under the same loss
values on household data set with Imb.level of 10% ( Only the parts of the
three curves under the common loss values are shown ).
Fig.2. Our proposed MFE and MSFE approches achiveve higher AUC than
the MSE approch under the same loss values on household data set with
Imb.level of 10% (Only the parts of the three curves under the common loss
values are shown ).
VI. CONCLUSIONS
Although deep neural networks have been widely explored
and proven to be effective on a wide variety of balanced data
sets, few studies paid attention to the data imbalance problem.
In order to resolve this issue, we proposed a novel loss function
MFE plus its improved version MSFE used for the training of
deep neural network (DNN) to deal with the class-imbalance
problem. We demonstrated their advantages over the
conventional MSE loss function from the theory perspective
and their effects on the back propagation procedures in DNN
training. Experimental results on both image and document
data sets show that our proposed loss functions outperform
the commonly used MSE on imbalanced data sets, especially
on extremely imbalanced data sets. In future work, we will
explore the effectiveness of our proposed loss functions on
different network structures like DBN and CNN.
REFERENCES
[1] N.V. Chawla, N. Japkowicz and A. Kotcz, Editorial: special issue on
learning from imbalanced data sets, ACM SIGKDD Explorations
Newsletter vol.6 (1), pp.1-6, 2004.
[2] J. Wu, S.Pan, X. Zhu, Z. Cai, “Boosting For Multi-Graph
Classification,” IEEE Trans. Cybernetics, vol.45(3), pp.430-443, 2015.
[3] J. Wu, X. Zhu, C. Zhang, P.S. Yu, “Bag Constrained Structure Pattern
Mining for Multi-Graph Classification,” IEEE Trans. Knowl. Data Eng,
vol.26(10), pp.2382-2396, 2014.
[4] H. He and X. Shen, “A Ranked Subspace Learning Method for Gene
Expression Data Classification,” IJCAI2007, pp.358-364.
[5] J. C. Candy and G. C. Temes, “Oversampling delta-sigma data
converters: theory, and simulation,” University of Texas Press, 1962.
[6] H. Li, J. Li, P. C. Chang, and J. Sun, “Parametric prediction on default
risk of chinese listed tourism companies by using random oversampling,
and locally linear embeddings on imbalanced samples,” International
Journal of Hospitality Management, vol.35, pp.141–151, 2013.
[7] N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer,
“SMOTE: Synthetic Minority Over-Sampling Technique,” J. Artificial
Intelligence Research, vol. 16, pp.321-357, 2002.
[8] J. Mathew,M. Luo, C. K. Pang and H. L.Chan , ”Kernel-based smote for
SVM classification of imbalanced datasets,” IECON2015, pp.1127-
1132.
[9] B. X. Wang and N. Japkowicz, “Imbalanced Data Set Learning with
Synthetic Samples,” Proc. IRIS Machine Learning Workshop, 2004.
[10] H. B. He and A. G. Edwardo, “Learning from imbalanced data,” IEEE
Transactions On Knowledge And Data Engineering, vol.21(9), pp.1263–
1284, 2009.
[11] N. Thai-Nghe, Z. Gatner, L. Schmidt-Thieme, “Cost-sensitive learning
methods for imbalanced data,” IJCNN2010, pp.1–8.
[12] P. Domingos, “MetaCost: A General Method for Making Classifiers
Cost-Sensitive,” ICDM1999, pp.155-164.
[13] C. Elkan, “The Foundations of Cost-Sensitive Learning,” IJCAI2001,
pp.973-978.
[14] M. A. Maloof, “Learning When Data Sets Are Imbalanced and When
Costs Are Unequal and Unknown,” ICML’03 Workshop on Learning
from Imbalanced Data Sets, 2003.
[15] M. Maloof, P. Langley, S. Sage, and T. Binford, “Learning to Detect
Rooftops in Aerial Images,” Proc. Image Understanding Workshop, pp.
835-845, 1997.
[16] M. Z. Kukar and I. Kononenko, “Cost-Sensitive Learning with Neural
Networks,”ECAI 1998, pp.445-449.
[17] Z. H. Zhou and X.Y. Liu, “Training Cost-Sensitive Neural Networks
with Methods Addressing the Class Imbalance Problem,”IEEE Trans.
Knowledge and Data Eng, vol. 18 (1), pp.63-77, 2006.
[18] C. H. Tsai, L. C. Chang and H. C. Chiang, ” Forecasting of ozone
episode days by cost-sensitive neural network methods,” Science of the
Total Environment, vol.407 (6) , pp.2124–2135, 2009.
[19] M. Lin, K. Tang and X. Yao, ” Dynamic sampling approach to training
neural networks for multiclass imbalance classification,” IEEE TNNLS,
vol.24 (4) ,pp. 647–660, 2013.
[20] B. krawczyk and M. Wozniak, ”Cost-Sensitive Neural Network with
ROC-Based Moving Threshold for Imbalanced Classification,”
IDEAL2015, pp.45-52.
[21] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring
strategies for training deep neural networks,” Journal of machine
learning research, vol 10, pp.1-40, 2009.
[22] T. Fawcett, "An introduction to ROC analysis," Pattern recognition
letters,vol 27 (8), pp. 861-874, 2006.
[23] Liu, W., Chan, J., Bailey, J., Leckie, C., & Kotagiri, R. “Mining labelled
tensors by discovering both their common and discriminative subspaces.”
In Proc. of the 2013 SIAM International Conference on Data Mining.
[24] Liu, W., Kan, A., Chan, J., Bailey, J., Leckie, C., Pei, J., & Kotagiri, R.
“On compressing weighted time-evolving graphs. In Proceedings of the
21st ACM International Conference on Information and Knowledge
Management.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.11
0.1
0.09
0.086
0.082
0.08
0.078
0.076
0.074
0.072
0.07
0.068
0.066
0.064
0.061
0.06
0.059
F-measure
Loss values
MSE
MFE
MSF
E
0.5
0.55
0.6
0.65
0.7
0.75
0.11
0.1
0.09
0.086
0.082
0.08
0.078
0.076
0.074
0.072
0.07
0.068
0.066
0.064
0.061
0.06
0.059
AUC
Loss values
MSE
MFE
MSFE
4374 2016 International Joint Conference on Neural Networks (IJCNN)
... This selection process allowed us to achieve a fair distribution of the cell classes included. The other samples were excluded as their inclusion was likely to have a negative impact on the performance of the algorithm [22]. ...
... The F1 scores, precision, recall, mAP@.5, and confusion matrix results demonstrate high accuracy, as do the values for each individual cell type. Special care was taken to ensure a similar number of annotations across all cell types to avoid bias introduced by an uneven distribution of cell populations in the training set [22]. For this reason, mast cells were excluded from the data set due to their low abundance. ...
Preprint
Full-text available
Background: In a world where lower respiratory tract infections rank among the leading causes of death and disability-adjusted life years (DALYs), precise and timely diagnosis is crucial. Bronchoalveolar lavage (BAL) fluid analysis is a pivotal diagnostic tool in pneumology and intensive care medicine, but its effectiveness relies on individual expertise. Our research focuses on the "You Only Look Once" (YOLO) algorithm, aiming to improve the precision and efficiency of BAL cell detection. Methods: We assess various YOLOv7 iterations, including YOLOv7, YOLOv7 with Adam and label smoothing, YOLOv7-E6E, and YOLOv7-E6E with Adam and label smoothing focusing on the detection of four key cell types of diagnostic importance in BAL fluid: macrophages, lymphocytes, neutrophils, and eosinophils. This study utilized cytospin preparations of BAL fluid, employing May-Grunwald-Giemsa staining, and analyzed a dataset comprising 2,032 images with 42,221 annotations. Classification performance was evaluated using recall, precision, F1 score, mAP@.5 and mAP@.5;.95 along with a confusion matrix. Results: The comparison of four algorithmic approaches revealed minor distinctions in mean results, falling short of statistical significance (p < 0.01; p < 0.05). YOLOv7, with an inference time of 13.5 ms for 640 x 640 px images, achieved commendable performance across all cell types, boasting an average F1 metric of 0.922, precision of 0.916, recall of 0.928, and mAP@.5 of 0.966. Remarkably, all cell classifications exhibited consistent outcomes, with no significant disparities among classes. Notably, YOLOv7 demonstrated marginally superior class value dispersion when compared to YOLOv7-adam-label-smoothing, YOLOv7-E6E, and YOLOv7-adam-label-smoothing, albeit without statistical significance. Conclusion: Consequently, there is limited justification for deploying the more computationally intensive YOLOv7-E6E and YOLOv7-E6E-adam-label-smoothing models. This investigation indicates that the default YOLOv7 variant is the preferred choice for differential cytology due to its accessibility, lower computational demands, and overall more consistent results than comparative studies.
... Convolutional neural networks (CNNs) [6][7][8][9][10][11][12][13][14][15][16][17][18][19] have shown remarkable potential in surface defect identification due to their ability to learn complex patterns from images. CNN-based techniques have achieved notable success in hotrolled strip steel surface defect identification [6][7][8]20]. ...
... Low-parameter and low floating-point operations per second (FLOPs) CNN models are required to achieve fast and precise surface defect identification. Wang et al. [9] empirically demonstrated that imbalanced datasets severely impact the performance of CNNs. Feng et al. [10] introduced a fixed data augmentation method for generating images, reducing the gap in the number of images among different categories and improving CNN performance. ...
Article
Full-text available
Hot-rolled strip steel is an extremely important industrial foundational material. The rapid and precise identification of surface defects in hot-rolled strip steel is beneficial for enhancing the quality of steel materials and reducing economic losses. Current research primarily focuses on using convolutional neural networks (CNNs) for strip steel surface defect identification. Although the accuracy of identification has remarkably improved in comparison with traditional machine learning methods, it has overlooked issues related to dataset preprocessing and the problem of nonlightweight CNN models with large model parameters and high computational complexity. To address the abovementioned issues, this study proposes a hot-rolled steel strip surface defect identification method based on random data balancing and the lightweight CNN MobileNet-Pro. Random data balancing employs image augmentation to eliminate the differences in the quantity of categories between the hot-rolled strip steel surface defect data, providing diverse images to alleviate overfitting during model training. MobileNet-Pro is used to increase the model’s effective receptive field. Building upon MobileNetV1, it introduces large convolutional kernels and improves depth-wise separable convolution. Experiments show that the new MobileNet-Pro, after random data balancing on the X-SDD dataset, achieves an accuracy of 96.47%, surpassing RepVGG + SA (95.10% accuracy, nonlightweight) and ResNet50 (93.86% accuracy, nonlightweight). Additionally, MobileNet-Pro outperforms mainstream lightweight networks from the MobileNet series, ShuffleNetV2, and GhostnetV2 in terms of performance on the CIFAR-100 and PASCAL VOC 2007 datasets, demonstrating excellent generalization capabilities. All our code and models are available on GitHub: https://github.com/OnlyForWW/MobileNet-Pro.
... Empirical results of the proposed DOS framework showed improvement in addressing the class imbalance problem. A new loss function in a deep neural network is proposed in [51] which captures classification errors from both majority and minority classes. Another method was presented in [52] to optimize the network parameters and class sensitive costs. ...
Article
Full-text available
Recognizing multiple residents’ activities is a pivotal domain within active and assisted living technologies, where the diversity of actions in a multi-occupant home poses a challenge due to their uneven distribution. Frequent activities contrast with those occurring sporadically, necessitating adept handling of class imbalance to ensure the integrity of activity recognition systems based on raw sensor data. While deep learning has proven its merit in identifying activities for solitary residents within balanced datasets, its application to multi-resident scenarios requires careful consideration. This study provides a comprehensive survey on the issue of class imbalance and explores the efficacy of Long Short-Term Memory and Bidirectional Long Short-Term Memory networks in discerning activities of multiple residents, considering both individual and aggregate labeling of actions. Through rigorous experimentation with data-level and algorithmic strategies to address class imbalances, this research scrutinizes the explicability of deep learning models, enhancing their transparency and reliability. Performance metrics are drawn from a series of evaluations on three distinct, highly imbalanced smart home datasets, offering insights into the models’ behavior and contributing to the advancement of trustworthy multi-resident activity recognition systems.
... ANN models constructed and trained with imbalanced data cannot recognize minority data well.Model so developed will recognize majority data well but have poor performance on recognizing minority data and pose a major challenge (Bagui and Li 2021). Imbalanced data sets exist widely in real world and they have been providing great challenges for classification tasks (Wang et al. 2016). Fraud detection, churn prediction, spam detection, claim prediction, anomaly detection, and outlier detection are examples of imbalanced data. ...
Article
Full-text available
Deep learning is a class of machine learning algorithms that extract high-level features from the raw input for making intelligent decisions. Identification of promising genotypes in varietal trials is one of many agriculture domain applications requiring implementation of deep learning to perform intelligent decision using varietal trial data. However, it has been found that varietal trial data to be used for identification is highly imbalanced one providing great challenges for classification tasks in deep learning. For example, only 33 genotypes were identified as promising in zonal varietal trials of All India Coordinated Research Project (AICRP) on Sugarcane during 2016-21, while those of non-promising class are 148. Balancing an imbalanced class is crucial as the classification model, which is trained using the imbalanced class dataset will tend to exhibit the prediction accuracy according to the highest class of the dataset. One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced. Study was conducted to implement and evaluate four resampling techniques viz. random undersampling, random oversampling, ensemble, SMOTE to balance varietal trial dataset in order to build deep learning model to identify promising genotypes in sugarcane. Paper describes the methodology used in our approach for building deep learning model using resampling techniques and then presented comparative performance of these approaches in identifying promising genotypes. Results indicate that SMOTE and random oversampling performed well for balancing imbalanced dataset for developing deep learning model in comparison to no-resampling of imbalanced dataset. SMOTE outperformed all resampling techniques by achieving high values of precision, recall and F1 score for both positive and negative classes. However, ensemble and random undersampling methods did not showed good results in comparison to SMOTE and random oversampling technique. Studies conducted will be useful in developing artificial intelligence based tools for automatic identification of promising genotypes in varietals trials of sugarcane in particular, as well as other crops in general.
... For example, minority samples can contribute more to the loss thanks to the development of Focal Loss in [52]. Deep networks were trained on imbalanced datasets using a novel loss function termed Mean Squared False Error (MSFE), which was proposed in [53]. In order to particularly handle the class-imbalanced problem on graph data, two new models have been developed recently: Dual-Regularized Graph Convolutional Network (DRGCN) [54] and GraphSMOTE [55]. ...
Preprint
Fifth-generation (5G) core networks in network digital twins (NDTs) are complex systems with numerous components, generating considerable data. Analyzing these data can be challenging due to rare failure types, leading to imbalanced classes in multiclass classification. To address this problem, we propose a novel method of integrating a graph Fourier transform (GFT) into a message-passing neural network (MPNN) designed for NDTs. This approach transforms the data into a graph using the GFT to address class imbalance, whereas the MPNN extracts features and models dependencies between network components. This combined approach identifies failure types in real and simulated NDT environments, demonstrating its potential for accurate failure classification in 5G and beyond (B5G) networks. Moreover, the MPNN is adept at learning complex local structures among neighbors in an end-to-end setting. Extensive experiments have demonstrated that the proposed approach can identify failure types in three multiclass domain datasets at multiple failure points in real networks and NDT environments. The results demonstrate that the proposed GFT-MPNN can accurately classify network failures in B5G networks, especially when employed within NDTs to detect failure types.
... Based on the distribution of those classes, Sections 2.2 (Step 8) and 3.1 identified and treated (using SMOTE) the class imbalance problem to prevent engineering a biased machine learning model that could understand and interpret the majority class more than the minority one. After resolving the class imbalance problem, we anticipate an equal/unbiased interpretation of the classes leading to a better performance of the model (Wang et al., 2016;Deng et al., 2022). Besides recording more calling songs compared to courtship and aggression songs, the study by Doherty (1985) also recorded more calling songs of the Gryllus bimaculatus de Geer cricket species and noted that the calling songs are more important than others since they trigger recognition and elicit phonotaxis (movement toward males) in female crickets. ...
Article
Full-text available
Crickets (Gryllus bimaculatus) produce sounds as a natural means to communicate and convey various behaviors and activities, including mating, feeding, aggression, distress, and more. These vocalizations are intricately linked to prevailing environmental conditions such as temperature and humidity. By accurately monitoring, identifying, and appropriately addressing these behaviors and activities, the farming and production of crickets can be enhanced. This research implemented a decision support system that leverages machine learning (ML) algorithms to decode and classify cricket songs, along with their associated key weather variables (temperature and humidity). Videos capturing cricket behavior and weather variables were recorded. From these videos, sound signals were extracted and classified such as calling, aggression, and courtship. Numerical and image features were extracted from the sound signals and combined with the weather variables. The extracted numerical features, i.e., Mel-Frequency Cepstral Coefficients (MFCC), Linear Frequency Cepstral Coefficients, and chroma, were used to train shallow (support vector machine, k-nearest neighbors, and random forest (RF)) ML algorithms. While image features, i.e., spectrograms, were used to train different state-of-the-art deep ML models, i,e., convolutional neural network architectures (ResNet152V2, VGG16, and EfficientNetB4). In the deep ML category, ResNet152V2 had the best accuracy of 99.42%. The RF algorithm had the best accuracy of 95.63% in the shallow ML category when trained with a combination of MFCC+chroma and after feature selection. In descending order of importance, the top 6 ranked features in the RF algorithm were, namely humidity, temperature, C#, mfcc11, mfcc10, and D. From the selected features, it is notable that temperature and humidity are necessary for growth and metabolic activities in insects. Moreover, the songs produced by certain cricket species naturally align to musical tones such as C# and D as ranked by the algorithm. Using this knowledge, a decision support system was built to guide farmers about the optimal temperature and humidity ranges and interpret the songs (calling, aggression, and courtship) in relation to weather variables. With this information, farmers can put in place suitable measures such as temperature regulation, humidity control, addressing aggressors, and other relevant interventions to minimize or eliminate losses and enhance cricket production.
... Therefore, at least two classes, root and nodule, are required. Generally, the proportion of nodule to root voxels is considerably less than 1 (possibly zero for individuals without nodules), indicating imbalanced training data, which is a limitation for proper training [63]. GTs should be created from multiple CT volumes to ensure adequate training data for the nodules, and the extremely imbalanced training data set should be adjusted to address these issues. ...
Article
Full-text available
Background X-ray computed tomography (CT) is a powerful tool for measuring plant root growth in soil. However, a rapid scan with larger pots, which is required for throughput-prioritized crop breeding, results in high noise levels, low resolution, and blurred root segments in the CT volumes. Moreover, while plant root segmentation is essential for root quantification, detailed conditional studies on segmenting noisy root segments are scarce. The present study aimed to investigate the effects of scanning time and deep learning-based restoration of image quality on semantic segmentation of blurry rice (Oryza sativa) root segments in CT volumes. Results VoxResNet, a convolutional neural network-based voxel-wise residual network, was used as the segmentation model. The training efficiency of the model was compared using CT volumes obtained at scan times of 33, 66, 150, 300, and 600 s. The learning efficiencies of the samples were similar, except for scan times of 33 and 66 s. In addition, The noise levels of predicted volumes differd among scanning conditions, indicating that the noise level of a scan time ≥ 150 s does not affect the model training efficiency. Conventional filtering methods, such as median filtering and edge detection, increased the training efficiency by approximately 10% under any conditions. However, the training efficiency of 33 and 66 s-scanned samples remained relatively low. We concluded that scan time must be at least 150 s to not affect segmentation. Finally, we constructed a semantic segmentation model for 150 s-scanned CT volumes, for which the Dice loss reached 0.093. This model could not predict the lateral roots, which were not included in the training data. This limitation will be addressed by preparing appropriate training data. Conclusions A semantic segmentation model can be constructed even with rapidly scanned CT volumes with high noise levels. Given that scanning times ≥ 150 s did not affect the segmentation results, this technique holds promise for rapid and low-dose scanning. This study offers insights into images other than CT volumes with high noise levels that are challenging to determine when annotating.
Chapter
Millions of people throughout the world enjoy coffee every day, and coffee-growing regions rely on this product for their economic growth. Coffee is one of the most valuable commodities in the world. Unfortunately, the sustainability of this popular drink is threatened by a disease called coffee leaf rust. Coffee leaf rust is a devasting disease caused by fungus that can damage coffee plantations and affect coffee production globally. Due to the persistence and adaptability of coffee leaf rust, coffee farmers are forced to address this disease on a continual basis. Thus, the use of modern pest and disease management methods to ensure the continued production of coffee and protect the livelihoods of those who depend on this crop is essential. This chapter highlights the transformative potential of deep learning in detecting coffee leaf rust in the fight against coffee leaf rust disease, which can offer a ray of hope for the advanced management of coffee leaf rust in the future.
Conference Paper
Pattern classification algorithms usually assume, that the distribution of examples in classes is roughly balanced. However, in many cases one of the classes is dominant in comparison with others. Here, the classifier will become biased towards the majority class. This scenario is known as imbalanced classification. As the minority class is usually the one more valuable, we need to counter the imbalance effect by using one of several dedicated techniques. Cost-sensitive methods assume a penalty factor for misclassifying the minority objects. This way, by assuming a higher cost to minority objects we boost their importance for the classification process. In this paper, we propose a model of cost-sensitive neural network with moving threshold. It relies on scaling the output of the classifier with a given cost function. This way, we adjust our support functions towards the minority class. We propose a novel method for automatically determining the cost, based on the Receiver Operating Characteristic (ROC) curve analysis. It allows us to select the most efficient cost factor for a given dataset. Experimental comparison with state-of-the-art methods for imbalanced classification and backed-up by a statistical analysis prove the effectiveness of our proposal.
Chapter
Conventional non-negative tensor factorization (NTF) methods assume there is only one tensor that needs to be decomposed to low-rank factors. However, in practice data are usually generated from different time periods or by different class labels, which are represented by a sequence of multiple tensors associated with different labels. This raises the problem that when one needs to analyze and compare multiple tensors, existing NTF is unsuitable for discovering all potentially useful patterns: 1) if one factorizes each tensor separately, the common information shared by the tensors is lost in the factors, and 2) if one concatenates these tensors together and forms a larger tensor to factorize, the intrinsic discriminative subspaces that are unique to each tensor are not captured. The cause of such an issue is from the fact that conventional factorization methods handle data observations in an unsupervised way, which only considers features and not labels of the data. To tackle this problem, in this paper we design a novel factorization algorithm called CDNTF (common and discriminative subspace non-negative tensor factorization), which takes both features and class labels into account in the factorization process. CDNTF uses a set of labelled tensors as input and computes both their common and discriminative subspaces simultaneously as output. We design an iterative algorithm that solves the common and discriminative subspace factorization problem with a proof of convergence. Experiment results on solving graph classification problems demonstrate the power and the effectiveness of the subspaces discovered by our method. Conventional non-negative tensor factorization (NTF) methods assume there is only one tensor that needs to be decomposed to low-rank factors. However, in practice data are usually generated from different time periods or by different class labels, which are represented by a sequence of multiple tensors associated with different labels. This raises the problem that when one needs to analyze and compare multiple tensors, existing NTF is unsuitable for discovering all potentially useful patterns: 1) if one factorizes each tensor separately, the common information shared by the tensors is lost in the factors, and 2) if one concatenates these tensors together and forms a larger tensor to factorize, the intrinsic discriminative subspaces that are unique to each tensor are not captured. The cause of such an issue is from the fact that conventional factorization methods handle data observations in an unsupervised way, which only considers features and not labels of the data. To tackle this problem, in this paper we design a novel factorization algorithm called CDNTF (common and discriminative subspace non-negative tensor factorization), which takes both features and class labels into account in the factorization process. CDNTF uses a set of labelled tensors as input and computes both their common and discriminative subspaces simultaneously as output. We design an iterative algorithm that solves the common and discriminative subspace factorization problem with a proof of convergence. Experiment results on solving graph classification problems demonstrate the power and the effectiveness of the subspaces discovered by our method.
Article
This research pioneers the default risk parametric prediction of Chinese tourism companies with random oversampling and manifold learning for parametric modelling on imbalanced samples to relax the requirement on sample availability. Four specific approaches were employed: standardization; standardization random oversampling; standardization -> isomap + locally linear embeddings; and standardization -> random oversampling -> isomap + locally linear embeddings. Empirical results indicate that: random oversampling successfully improved the tourism default risk prediction; the integration of isomap and locally linear embeddings is beneficial in default risk prediction using highly skewed tourism data with absolute minority samples; and after the use of random oversampling on initial data, the integrated approach improved in forecasting tourism default risk prior to two years versus one year. (C) 2013 Elsevier Ltd. All rights reserved.
Article
This paper formulates a multi-graph learning task. In our problem setting, a bag contains a number of graphs and a class label. A bag is labeled positive if at least one graph in the bag is positive, and negative otherwise. In addition, the genuine label of each graph in a positive bag is unknown, and all graphs in a negative bag are negative. The aim of multi-graph learning is to build a learning model from a number of labeled training bags to predict previously unseen test bags with maximum accuracy. This problem setting is essentially different from existing multi-instance learning (MIL), where instances in MIL share well-defined feature values, but no features are available to represent graphs in a multi-graph bag. To solve the problem, we propose a Multi-Graph Feature based Learning ( gMGFL) algorithm that explores and selects a set of discriminative subgraphs as features to transfer each bag into a single instance, with the bag label being propagated to the transferred instance. As a result, the multi-graph bags form a labeled training instance set, so generic learning algorithms, such as decision trees, can be used to derive learning models for multi-graph classification. Experiments and comparisons on real-world multi-graph tasks demonstrate the algorithm performance.
Article
In this paper, we formulate a novel graph-based learning problem, multi-graph classification (MGC), which aims to learn a classifier from a set of labeled bags each containing a number of graphs inside the bag. A bag is labeled positive, if at least one graph in the bag is positive, and negative otherwise. Such a multi-graph representation can be used for many real-world applications, such as webpage classification, where a webpage can be regarded as a bag with texts and images inside the webpage being represented as graphs. This problem is a generalization of multi-instance learning (MIL) but with vital differences, mainly because instances in MIL share a common feature space whereas no feature is available to represent graphs in a multi-graph bag. To solve the problem, we propose a boosting based multi-graph classification framework (bMGC). Given a set of labeled multi-graph bags, bMGC employs dynamic weight adjustment at both bag- and graph-levels to select one subgraph in each iteration as a weak classifier. In each iteration, bag and graph weights are adjusted such that an incorrectly classified bag will receive a higher weight because its predicted bag label conflicts to the genuine label, whereas an incorrectly classified graph will receive a lower weight value if the graph is in a positive bag (or a higher weight if the graph is in a negative bag). Accordingly, bMGC is able to differentiate graphs in positive and negative bags to derive effective classifiers to form a boosting model for MGC. Experiments and comparisons on real-world multi-graph learning tasks demonstrate the algorithm performance.
Conference Paper
Existing graph compression techniquesmostly focus on static graphs. However for many practical graphs such as social networks the edge weights frequently change over time. This phenomenon raises the question of how to compress dynamic graphs while maintaining most of their intrinsic structural patterns at each time snapshot. In this paper we show that the encoding cost of a dynamic graph is proportional to the heterogeneity of a three dimensional tensor that represents the dynamic graph. We propose an effective algorithm that compresses a dynamic graph by reducing the heterogeneity of its tensor representation, and at the same time also maintains a maximum lossy compression error at any time stamp of the dynamic graph. The bounded compression error benefits compressed graphs in that they retain good approximations of the original edge weights, and hence properties of the original graph (such as shortest paths) are well preserved. To the best of our knowledge, this is the first work that compresses weighted dynamic graphs with bounded lossy compression error at any time snapshot of the graph.
Article
Class imbalance learning tackles supervised learning problems where some classes have significantly more examples than others. Most of the existing research focused only on binary-class cases. In this paper, we study multiclass imbalance problems and propose a dynamic sampling method (DyS) for multilayer perceptrons (MLP). In DyS, for each epoch of the training process, every example is fed to the current MLP and then the probability of it being selected for training the MLP is estimated. DyS dynamically selects informative data to train the MLP. In order to evaluate DyS and understand its strength and weakness, comprehensive experimental studies have been carried out. Results on 20 multiclass imbalanced data sets show that DyS can outperform the compared methods, including pre-sample methods, active learning methods, cost-sensitive methods, and boosting-type methods.
Article
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassication costs are un- equal and unknown, can be handled in a simi- lar manner. That is, in both contexts, we can use techniques from roc analysis to help with classier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these re- sults to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classiers that fell on the same roc curve.