ArticlePDF Available

ANALYZING THE IMPACT OF RESAMPLING METHOD FOR IMBALANCED DATA TEXT IN INDONESIAN SCIENTIFIC ARTICLES CATEGORIZATION

Authors:
  • National Research and Innovation Agency - BRIN

Abstract and Figures

The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given misleading results. It is caused because machine learning algorithms are designated to work best with balanced data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues, the most popular technique is resampling the dataset to modify the number of instances in the majority and minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or combined both of them, have been proposed and continue until now. Resampling techniques may increase or decrease the classifier performance. Comparative research on resampling methods in structured data has been widely carried out, but studies that compare resampling methods with unstructured data are very rarely conducted. That raises many questions, one of which is whether this method is applied to unstructured data such as text that has large dimensions and very diverse characters. To understand how different resampling techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis using various resampling methods with several classification algorithms to classify articles at the Indonesian Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced data text generally to improve the classifier performance but they are doesn’t give significant result because data text has very diverse and large dimensions.
Content may be subject to copyright.
133
ANALYZING THE IMPACT OF RESAMPLING METHOD
FOR IMBALANCED DATA TEXT IN INDONESIAN
SCIENTIFIC ARTICLES CATEGORIZATION
Ariani Indrawati1*, Hendro Subagyo2, Andre Sihombing3,
Wagiyah4, Sjaeful Afandi5
1,2,3,4,5Indonesian Institute of Science
*Correspondence: indrawati.ariani@gmail.com
ABSTRACT
The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given
misleading results. It is caused because machine learning algorithms are designated to work best with balanced
data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues,
the most popular technique is resampling the dataset to modify the number of instances in the majority and
minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or
combined both of them, have been proposed and continue until now. Resampling techniques may increase or
decrease the classifier performance. Comparative research on resampling methods in structured data has been
widely carried out, but studies that compare resampling methods with unstructured data are very rarely
conducted. That raises many questions, one of which is whether this method is applied to unstructured data
such as text that has large dimensions and very diverse characters. To understand how different resampling
techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis
using various resampling methods with several classification algorithms to classify articles at the Indonesian
Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced
data text generally to improve the classifier performance but they are doesn’t give significant result because
data text has very diverse and large dimensions.
ABSTRAK
Dataset yang tidak seimbang jika digunakan pada kecerdasan buatan, machine learning, dan data mining sering
kali memberikan hasil yang keliru. Hal tersebut dikarenakan algoritma machine learning dirancang untuk
berkerja secara optimal dengan data yang seimbang. Namun, sering kali kita diharuskan untuk melakukan
proses analisis data menggunakan dataset yang tidak seimbang. Cara yang paling umum digunakan untuk
menangani permasalahan ketidakseimbangan data adalah dengan melakukan resampling untuk mengubah
jumlah data pada kelas mayoritas atau minoritas sehingga membentuk dataset yang seimbang. Beberapa teknik
resampling telah diajukan, baik oversampling, undersampling, maupun kombinasi dari keduanya. Teknik
resampling ini memungkinkan untuk meningkatkan atau menurunkan performa dari model klasifikasi. Teknik
resampling dengan data terstruktur sudah banyak diterapkan pada beberapa penelitian, namun penerapan
resampling pada data tidak terstruktur belum banyak dilakukan. Hal tersebut menimbulkan pertanyaan apakah
teknik resampling dapat diterapkan pada tidak terstruktur seperti teks yang memiliki dimensi yang banyak dan
karakter yang sangat beragam. Pada penelitian ini kami mencoba menerapkan teknik resampling pada dataset
artikel Indonesian Scientific Journal Database (ISJD) untuk memahami bagaimana pengaruhnya terhadap
beberapa model klasifikasi. Dari hasil eksperimen diketahui bahwa secara umum teknik resampling ini dapat
meningkatkan performa dari model klasifikasi, namun tidak memberikan hasil yang signifikan.
Keywords: Imbalanced data; Resampling techniques; Machine learning; Classification; Journal; ISJD
1. INTRODUCTION
The problem of imbalanced data has got more and more hot topics in recent years. Imbalance
data is the condition where the number of instances in one class is significantly lower than the other
classes. Imbalance data is a challenging problem in artificial intelligence, machine learning, and
data mining topic. Most machine learning algorithms are designated to work best with balanced
data that the target classes have similar prior probabilities. However, the real situation is often the
ratios of prior probabilities between classes are extremely skewed in the high dimensionality and
extremely sparse.
Submission: 02-06-2020; Review: 30-08-2020; Accepted: 07-09-2020; Revised: 30-10-2020
ISSN 0125-9008 (Print); ISSN 2301-8593 (Online)
DOI: https://dx.doi.org/10.14203/j.baca.v41i2.563
SK Dirjen Risbang -Kemristekdikti No 21/E/KPT/2018 (Peringkat 2 SINTA)
BACA: Jurnal Dokumentasi dan Informasi, 42 (2) Desember 2020, Halaman: 133-141
134
Typically, in the imbalanced dataset problem, it is more difficult to classify members of the
minority class than members of the majority class. This happens because machine learning
algorithms do not consider the class distribution, they are usually designed to improve accuracy by
reducing the error. Many researches have reported data mining with an imbalanced data distribution
often give misleading result, such as diagnostic of rare diseases, fraud detection, network intrusion
detection, detection of oil spills from radar images, text classification, marketing, etc.
The most popular technique for handling imbalanced data is resampling a training dataset in
order to balance the class distribution before the data used as input to the machine learning process.
Resampling is a process that modifies the number of instances in the majority and minority classes
into a standard balanced data. It will be much easier for machine learning to process the balanced
data.
There are 3 approaches resampling, under-sampling by reducing some samples from the
majority class, over-sampling by adding more samples to the minority class, or a combination of
both under-sampling and over-sampling. Many resampling methods have been proposed, Random
Over Sampling (ROS), Synthetic Minority Oversampling Technique (SMOTE) (Chawla, 2002),
Borderline SMOTE (Han, 2005), kMeans SMOTE (Last, 2017), Support Vector Machine SMOTE
(SVM-SMOTE) (Zhang, 2018), Adaptive Synthetic (ADASYN) (He, 2008), Random Under
Sampling (RUS), TomekLinks (Tomek, 1976) Edited Nearest Neighbors (ENN) (Wilson, 1972),
etc. Some previous studies have been implemented those resampling method to their cases. Batista,
Prati, and Monard analyze the behavior of several over-sampling and under-sampling methods to
deal with the problem of learning from imbalanced in thirteen UCI data sets which each dataset has
been collapse in 2 classes (positive and negative), they use C4.5 as classifier method (Batista,
2004). Xie, Hao, Liu, and Lin in 2019 have been proposed the fused case-control screening to
balancing the p53 mutant dataset before detecting the transcriptional activity (active or inactive)
(Xie, 2019). Padurariu and Breaban also dealing the imbalanced data text with oversampling
methods (Padurariu, 2019). Al-Azani and El-Alfy use SMOTE to highly imbalanced data sentiment
analysis in short Arabic text (Al-Azani, 2017). Suh, Kim, Song, Leegu, Yu, and Mo comparing of
oversampling methods on imbalanced topic classification of Korean news articles (Suh, 2017).
Fernandez, del Rio, Chawla, and Herra compared RUS, ROS, and SMOTE using MapReduce with
two subsets of the Evolutionary Computation for Big Data and Big Learning (ECBDL’14) dataset
(Fernández, 2017).
Loyola-González, Martínez-Trinidad, Carrasco-Ochoa, García-Borroto, have studied the use
of resampling methods combined with contrast pattern based classifiers for data mining and
classification tasks on imbalanced databases (Loyola-González, 2016). Krawczyk analyzed
different aspects of imbalanced learning s uch as classification, clustering, regression, data mining
and big data analytics (Krawczyk, 2016). Blagus and Lusa use SMOTE to balancing three breast
cancer gene expression data sets and classify each of them with kNN (Blagus, 2013). Li, Sun, and
Zhu study on data imbalance problem in text classification on several form such as text distribution,
class size, and overlapping class (Li, 2010). Yanminsun, Wong, and Kamel provides a review of
the classification of imbalanced data regarding: the appli- cation domains; the nature of the
problem; the learning difficulties with standard clas- sifier learning algorithms; the learning
objectives and evaluation measures; the reported research solutions; and the class imbalance
problem in the presence of multiple classes (Yanminsum, 2011).
In this research, we investigated the impact from 5 oversampling techniques are ROS,
SMOTE, Borderline SMOTE, KMeans SMOTE, SVM-SMOTE, and ADASYN, 3 undersampling
methods are RUS, Tomek, and ENN, However, oversampling and undersampling method has some
flaws. Oversampling can lead to model overfitting, since it will duplicate instances from minority
Analyzing the Impact of Resampling Method I Ariani Indrawati, dkk
135
class, while undersampling can end up leaving out important instances that provide important
differences in the majority class. We also tried combined oversampling and undersampling methods
are SMOTEENN and SMOTETomek, to Gaussian Naïve Bayes, Multinomial Naïve Bayes, SVM
with linear kernel, SVM with RBF kernel and k-NN for handle highly imbalanced data text in
Indonesian scientific articles categorization.
2. LITERATURE REVIEW
In this section, we briefly describe the basic idea to understand how each resampling methods
works to balancing the imbalanced data. Illustration before and after resampling data can be seen
in Figure 1.
Figure 1. Illustration before and after resampling
Random Over Sampling (ROS). ROS is simply duplicating the data samples in minority classes and
adding them to the training datasets. ROS increases the size of the training data set through repetition
of the original samples until the class distribution is balance.
Synthetic Minority Oversampling Technique (SMOTE). SMOTE was introduced by Chawla in 2002.
Similar to ROS, SMOTE is also increase the size of the training dataset and its variety by generating
artificial samples in the training dataset by interpolating between existing data points of the minority
class that are closer to each other. SMOTE algorithm is described in (Chawla, 2002).
Borderline SMOTE. Borderline SMOTE is variant of the original SMOTE, proposed by Han, Wen-
Yuan, and Bing-Huan in 2005. Borderline-SMOTE generate their synthetic samples along the
borderline of minority and majority classes. Figure 3 illustrates before and after resampling with
Borderline SMOTE, before resampling class 0 have 100 data but class 1 only have 10 data, after
resampling both class 0 and class 1 have 100 data.
KMeans SMOTE. Felix Last, Georgius Douzas, Fernando Bacao tried to apply KMeans to SMOTE in
their research (Last, 2017). KMeans SMOTE generating minority class samples in safe and crucial areas
of the input space.
SVM-SMOTE. This algorithm is a variant of SMOTE which use SVM to locate the decision boundary
defined by the support vectors and examples in the minority class that close to the support vectors
become the focus for generating synthetic examples (Zhang, 2018).
Adaptive Synthetic (ADASYN). ADASYN proposed by Haibo He, Yang Bai, Edwardo A. Garcia, and
Shutao Li in 2008. The essential idea of ADASYN reducing the bias and adaptively learning to generate
some synthetic data samples for the minority classes based on dynamic adjustment of weights and an
adaptive learning procedure according to data distributions. Algorithm ADAYSN is described in (He,
2008).
Original After Oversampling
After Undersampling After Combined (Ovesampling + Undersampling)
BACA: Jurnal Dokumentasi dan Informasi, 42 (2) Desember 2020, Halaman: 133-141
136
Random Under Sampling (RUS). RUS does the opposite from ROS, it removes some samples from the
majority class to balanced it with minority class.
Tomek Links. Tomek (1976) proposed an algorithm for resampling dataset, named Tomek Links. This
algorithm detects pairs of instances from the nearest opposite classes to determine borderline between
majority and minority classes.
Edited Nearest Neighbors (ENN). This method proposed by Zhang in 2008 which uses the edited
nearest neighbor algorithm to select some samples to be removed to balanced it with minority class.
SMOTEENN. SMOTEENN is a combined method between oversampling method using SMOTE and
undersampling using ENN.
SMOTETomek. SMOTETomek is a combined method between oversampling method using SMOTE
and undersampling using Tomek.
3. METHOD
Figure 2 illustrates 3 important stages in this research, are text processing, resampling and
categorization, and evaluation.
Figure 2. Methodology
a) Text Processing. In this stage, we have 2 tasks, which are text pre-processing and feature
weighting.
Text Pre-Processing
The raw textual data is mostly unstructured. So, before carrying out the categorization process,
it is necessary to processing the abstract to make it in a structured form and can enhance the
classifier’s performance significantly (Haddi, 2013). There are a few steps to take in the text
pre-processing phase, are:
Case folding: the entire text in the abstract will be converted to lowercase letters.
Stopwords removal: the process to remove stopwords so that only words that are
considered important will be used. Words such as conjunctions will be removed.
Stemming: change a word to the basic word form that builds it. In Indonesian texts all the
words added both suffixes and prefixes are also omitted
Analyzing the Impact of Resampling Method I Ariani Indrawati, dkk
137
Feature Extraction
When dealing with text, we should represent each document to a vector of word frequencies.
At this stage we are using Term Frequency - Inverse Document Frequency (TF-IDF) with
unigram and bigram. We also use Chi-Square for feature selection to select the important
features in each class.
b) Resampling and Categorization
In this research, we use 3 approaches in the resampling, are oversampling, undersampling, and
combined.
Oversampling: ROS, SMOTE, Borderline SMOTE, KMeans SMOTE, SVM SMOTE, and
ADASYN.
Undersampling: RUS, TomekLinks, and ENN.
Combined: SMOTEENN and SMOTETomek.
Resampling result from each method will be used to generated classification model. For
classification machines, we use SVM with Linear and RBF Kernel, Naïve Bayes both Gaussian
and Multinomial, K-NN.
c) Evaluation
We use precision, recall, and f-1 measure.
Precision
Recall
F1 Score
4. RESULTS AND DISCUSSIONS
Data in this research retrieved from the Indonesian Scientific Journal Database (ISJD) from
2013 until 2019. ISJD is a database containing journals published by journal publishers in
Indonesia. We retrieve abstract in Indonesian language and label category from each article. After
cleansing the data, we have 26708 data. The data is not well balanced as the distribution of the
dataset can be seen in Figure 3.
Figure 3. Data distribution
In this research, we investigated the over-sampling and under-sampling techniques to imbalance
data text. However, oversampling and undersampling method has some flaws. Oversampling can
BACA: Jurnal Dokumentasi dan Informasi, 42 (2) Desember 2020, Halaman: 133-141
138
lead to model overfitting, since it will duplicate instances from minority class, while undersampling
can end up leaving out important instances that provide important differences in the majority class.
So, we also tried combined oversampling and undersampling method to see how this combined
method affect the classifier performance.
a) Oversampling
The precision, recall, and f-measure can be seen in Figure 4, we can see the highest F-Measure
from oversampling techniques result is SMOTE with SVM-Linear classifier. It is known some
oversampling techniques can improve classifier performance, ROS, SMOTE, SVM SMOTE,
Borderline SMOTE, KMeans SMOTE outperformed the original model, only ADASYN has
decreased the classifier performance. In general, using SMOTE technique will improve the
result compare to using ROS technique. SMOTE technique and other modifications and
extension not only increase the size of the minority class but on the other hand, it also increases
the variety of your data. Variation in the training data set, help machine to not learning too
much specific from only a few examples. However, sometimes we need to careful with the
result, whether the variation of the data that has been generated by SMOTE is valid. From all
classification methods we used, MNB is the most affected from these oversampling
techniques, MNB F-Measure increase by 0.19 to 0.21. Interesting things here are that the
oversampling method techniques give a negative impact on kNN, especially with SMOTE,
SVM SMOTE, Borderline SMOTE, KMeans SMOTE. It happens because SMOTE generates
their synthetic samples by interpolating between existing data points of the minority class that
are closer to each other. It is very possible that the data generated is so close to other classes,
so that it is difficult for KNN to classify new data from resampling results.
Figure 4. Precision, recall, and F-Measure oversampling techniques result
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
None ROS ADASYN SMOTE SVM SMOTE Borderline SMOTE KMeans SMOTE
Oversampling
Precision Recall F-M easure
Analyzing the Impact of Resampling Method I Ariani Indrawati, dkk
139
b) Undersampling
The precision, recall, and f-measure can be seen in figure 5. We can see the highest F-
Measure from undersampling techniques result is TomekLinks with SVM-Linear
classifier. In contrast to oversampling, undersampling techniques mostly decrease the
classifier performance, except for RUS in SVM-RBF and MNB, and TomekLinks in
SVM-RBF which undersampling techniques can increase the classifier performance
by 0.01 to 0.09. The most affected classifier is SVM-Linear. This technique has the
advantage in terms of times and memory complexity compare to oversampling
because we are decreasing the size of the data. While in the process we may remove
some potential data that could be important for the learning process. Another
undersampling technique we use in this research has overcome the problem by
removing data that has been identified as redundant or get a high score in similarity.
Figure 5. Precision, recall, and F-Measure undersampling techniques result
c) Combined
In this section, we combined the two techniques described before. it is typically the
better approach in the resampling method compare to used only oversampling or
undersampling individually. First, we could remove some redundant data in the
majority class so it will decrease the size with the hope to improve the times and
memory complexity, in the other hand for the minority class we increase the data using
appropriate oversampling techniques until all the classes in data set is balance. The
precision, recall, and f-measure can be seen in figure 6. We can see the highest F-
Measure from the combined methods between oversampling and undersampling
techniques result is SMOTENN with SVM-Linear classifer. The combination
resampling method increases the classifier performance by 0.01 to 0.21, except for
kNN decrease the classifer by 0.18 to 0.23. Same as the oversampling method, the
most affected classifier is MNB.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
None RUS TomekLinks ENN
Undersampling
Precision Recall F-Measure
BACA: Jurnal Dokumentasi dan Informasi, 42 (2) Desember 2020, Halaman: 133-141
140
Figure 6. Precision, recall, and F-Measure combined techniques result
Results and discussion should be arranged in separate sub-headings. The subtitles in the
literature review are written with Times New Roman font 11.5, and the contents are Times New
Roman font 11 (1.15 spaces). The results are not raw data, but data that have been processed
and interpreted in the form of statistical data, either in the form of tables, graphs, charts,
sketches, and photographs combined with relevant theories. While the discussion is the result
of data analysis based on relevant theory. The content of the results and the discussion should
address the research issues and find the appropriate analysis for the solution/positive impact on
the development of science and technology in society.
5. CONCLUSION
Resampling is a simple way to handle imbalanced data, either by oversampling or
undersampling. Resampling allows us to create a balanced dataset to simplify the classification
process. However, resampling has some flaws. Oversampling can lead to model overfitting, since
it will duplicate instances from minority class, while undersampling can end up leaving out
important instances that provide important differences in the majority class. Ultimately, there is no
one-size-fits-all method for the imbalanced problems, we just have to try out each method and see
their effect on specific use cases and metrics. From this experiment, it is known resampling
techniques on imbalanced data text generally improve the classifier performance. The best
oversampling method is SMOTE, the best undersampling method is TomekLinks, and the best
combined resampling method is SMOTETomek. Interesting things here is that the oversampling
method techniques give negative impact on kNN, especially with SMOTE, SVM SMOTE,
Borderline SMOTE, KMeans SMOTE. It’s happens because SMOTE generate their synthetic
samples by interpolating between existing data points of the minority class that are closer to each
other. It is very possible that the data generated is so close to other classes, so that it is difficult for
KNN to classify new data from resampling results.
.
REFERENCES
Al-Azani, S. & El-Alfy, E. 2017. Using Word Embedding and Ensemble Learning for Highly
Imbalanced Data Sentiment Analysis in Short Arabic Text. Procedia Computer Science. 359-366.
doi: 109.10.1016/j.procs.2017.05.365.
Batista, G., et al. 2004. A Study of The Behavior of Several Methods for Balancing Machine Learning
Training Data. ACM SIGKDD Explorations, 6(1), 20-29. doi: 10.1145/1007730.1007735.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
SVM-RBF
SVM-Linear
GNB
MNB
kNN
None SMOTEENN SMOTETomek
Combined
Precision Recall F-Measure
Analyzing the Impact of Resampling Method I Ariani Indrawati, dkk
141
Blagus, R. & Lusa, L. 2013. SMOTE for High-Dimensional Class-Imbalanced Data. BMC
Bioinformatics, 14, 106. doi: 10.1186/1471-2105-14-106.
Chawla, et al. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321-357. doi: 10.1613/jair.953.
Fernández, A., et al. 2017. An Insight into Imbalanced Big Data Classification: Outcomes and
Challenges. Complex & Intelligent Systems. doi: 10.1007/s40747-017-0037-9.
Han, H., et al. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets
Learning. Advances in Intelligent Computing, 878-887. doi: 10.1007/11538059_91.
He, H, et al. 2008. Adasyn: Adaptive Synthetic Samplingapproach For Imbalanced Learning.
International Joint Conference on Neural Networks 10.1109/IJCNN.2008.4633969. , June .
Krawczyk, B. 2016. Learning from Imbalanced Data: Open Challenges and Future
Directions. Progress in Artificial Intelligence, 5, 221232. doi: 10.1007/s13748-016-0094-0.
Last, F., et al. 2017. Oversampling for Imbalanced Learning Based on K-Means and SMOTE.
Li, Y., et al. 2010. Data Imbalance Problem in Text Classification. Third International Symposium on
Information Processing, 301-305. doi: 10.1109/ISIP.2010.47.
Loyola-González, O. 2016. Study of the Impact of Resampling Methods for Contrast Pattern based
Classifiers in Imbalanced Databases. Neurocomputing, 175, 935-947. doi:
10.1016/j.neucom.2015.04.120.
Padurariu, Cristian & Breaban, Mihaela. 2019. Dealing with Data Imbalance in Text Classification.
Procedia Computer Science, 159, 736-745. doi: 10.1016/j.procs.2019.09.229.
Suh, Y, et al. 2017. A Comparison of Oversampling Methods on Imbalanced Topic Classification of
Korean News Articles. Journal of Cognitive Science, 18. 391-437. doi:
10.17791/jcs.2017.18.4.391.
Tomek, I. 1976. Two Modifications of CNN. iIEEE Transactions on Systems, Man, and Cybernetics,
SMC-6(11), 769-772. doi: 10.1109/TSMC.1976.4309452.
Wilson, D.L. 1972. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE
Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408-421, doi:
10.1109/TSMC.1972.4309137.
Xie, J., et al. 2020. Fused Variable Screening for Massive Imbalanced Data. Computational Statistics
& Data Analysis. 141. doi: 10.1016/j.csda.2019.06.013.
Yanminsun, Y. 2011. Classification of Imbalanced Data: A Review. International Journal of Pattern
Recognition and Artificial Intelligence, 23. doi: 10.1142/S0218001409007326.
Zhang, C., et al. 2018. A Cost-Sensitive Deep Belief Network for Imbalanced Classification.
... SMOTE-ENN is a combination of the Synthetic Minority Over-sampling Technique (SMOTE) and undersampling Edited Nearest Neighbors (ENN) methods [20]. SMOTE calculates the distance between random data and k-nearest neighbors taken from minority classes [21]. ...
Article
Full-text available
Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.
... The dataset used in this research is imbalanced, with 3810 data having positive and 366 negative labels. This could be a problem because most machine learning algorithms are designated to work best with balanced data that the target classes have similar prior probabilities [15]. ...
Article
The pandemic that hit the world has greatly impacted our life. But after some time, it seems that it will be going to end because the vaccine has already been made. In response to this, some people expressed their opinions about this vaccination on social media, for example, in the form of tweets on Twitter. The authors use those opinions or tweets as sentiment analysis material to determine the assessment of this vaccination. The tweet data in this study was obtained through data crawling using the Twitter API with the Python programming language. The variables used in this case are public tweets and their sentiments. This sentiment analysis process uses the Classification method with the Naive Bayes Classifier and will be compared with the XGBoost Classifier algorithm. The results of this study indicate that people are more likely to respond positively to this vaccination. In this case, the Naive Bayes Classifier got better performance with 0.95 from ROC - AUC Score and 134 ms in runtime compared to the XGBoost Classifier algorithm with 0.882 in ROC - AUC Score and 1 minute and 59 seconds in runtime.
... SMOTE telah dilakukan sebelumnya untuk meningkatkan kinerja classifier dikarenakan data tidak seimbang yaitu pada teks pendek bahasa Arab [25]. Beberapa penelitian menyimpulkan bahwa SMOTE merupakan salah satu teknik yang menghasilkan performa terbaik untuk menangani keseimbangan data dalam klasifikasi teks [26], [27]. Oleh karena itu, pada penelitian ini digunakan teknik SMOTE untuk menyeimbangkan dataset. ...
Article
Full-text available
Pulau Jawa merupakan pulau terpadat di Indonesia dan memiliki keragaman dialek yang tinggi. Berdasarkan peta bahasa yang dikeluarkan oleh KEMDIKBUD, Pulau Jawa memiliki 12 dialek utama yang tersebar di Jawa Timur, Jawa Barat dan Jawa Tengah. Dari hasil survei yang telah dilakukan, dialek yang digunakan sebagai dataset hanya dibatasi menjadi 3 dialek terpopuler dari setiap provinsi yaitu Dialek Cirebon, Dialek Tegal dan Dialek Jawa Timur. Penyediaan data dilakukan dengan metode studi literatur yang bersumber dari buku dan dokumen tertulis yang tersedia di internet. Data akan diolah dan dianalisis menggunakan algoritma Multinomial Naives Bayes karena cepat dalam proses perhitungan, sederhana dan memiliki akurasi yang tinggi. Algoritma akan diuji menggunakan K-fold Cross Validation untuk mengetahui performa algoritma Multinomial Naives Bayes dalam melakukan klasifikasi dialek di Pulau Jawa. Metode Synthetic Minority Over-Sampling Technique (SMOTE) juga digunakan dalam penelitian ini untuk mengetahui pengaruh teknik oversampling terhadap performa algoritma. Dari penelitian in dihasilkan performa terbaik dengan akurasi sebesar 96,97%, presisi sebesar 97,53% dan recall sebesar 96,83%.
Article
Full-text available
Imbalanced data presents significant challenges in machine learning, leading to biased classification outcomes that favor the majority class. This issue is especially pronounced in the classification of financial distress, where data imbalance is common due to the scarcity of such instances in real-world datasets. This study aims to mitigate data imbalance in financial distress companies using the Kmeans-SMOTE method by combining Kmeans clustering and the synthetic minority oversampling technique (SMOTE). Various classification approaches, including Nave Bayes and support vector machine (SVM), are implemented on a Kaggle financial distress data set to evaluate the effectiveness of Kmeans-SMOTE. Experimental results show that SVM outperforms Nave Bayes with impressive accuracy (99.1%), f1-score (99.1%), area under precision recall (AUPRC) (99.1%), and geometric mean (Gmean) (98.1%). On the basis of these results, Kmeans-SMOTE can balance the data effectively, leading to a quite significant improvement in performance.
Article
Full-text available
Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.
Article
Full-text available
Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.
Article
Full-text available
Despite more than two decades of continuous development learning from imbalanced data is still a focus of intense research. Starting as a problem of skewed distributions of binary tasks, this topic evolved way beyond this conception. With the expansion of machine learning and data mining, combined with the arrival of big data era, we have gained a deeper insight into the nature of imbalanced learning, while at the same time facing new emerging challenges. Data-level and algorithm-level methods are constantly being improved and hybrid approaches gain increasing popularity. Recent trends focus on analyzing not only the disproportion between classes, but also other difficulties embedded in the nature of data. New real-life problems motivate researchers to focus on computationally efficient, adaptive and real-time methods. This paper aims at discussing open issues and challenges that need to be addressed to further develop the field of imbalanced learning. Seven vital areas of research in this topic are identified, covering the full spectrum of learning from imbalanced data: classification, regression, clustering, data streams, big data analytics and applications, e.g., in social media and computer vision. This paper provides a discussion and suggestions concerning lines of future research for each of them.
Article
Full-text available
Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.
Article
Full-text available
Background Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. Results While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. Conclusions In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.
Article
Imbalanced data, in which the data exhibit an unequal or highly-skewed distribution between its classes/categories, are pervasive in many scientific fields, with application range from bioinformatics, text classification, face recognition, fraud detection, etc. Imbalanced data in modern science are often of massive size and high dimensionality, for example, gene expression data for diagnosing rare diseases. To address this issue, a fused screening procedure is proposed for dimension reduction with large-scale high dimensional imbalanced data under repeated case-control samplings. There are several advantages of the proposed method: it is model-free without any model specification for the underlying distribution; it is relatively inexpensive in computational cost by using the subsampling technique; it is robust to outliers in the predictors. The theoretical properties are established under regularity conditions. Numerical studies including extensive simulations and a real data example confirm that the proposed method performs well in practical settings.
Article
Imbalanced data with a skewed class distribution are common in many real-world applications. Deep Belief Network (DBN) is a machine learning technique that is effective in classification tasks. However, conventional DBN does not work well for imbalanced data classification because it assumes equal costs for each class. To deal with this problem, cost-sensitive approaches assign different misclassification costs for different classes without disrupting the true data sample distributions. However, due to lack of prior knowledge, the misclassification costs are usually unknown and hard to choose in practice. Moreover, it has not been well studied as to how cost-sensitive learning could improve DBN performance on imbalanced data problems. This paper proposes an evolutionary cost-sensitive deep belief network (ECS-DBN) for imbalanced classification. ECS-DBN uses adaptive differential evolution to optimize the misclassification costs based on the training data that presents an effective approach to incorporating the evaluation measure (i.e., G-mean) into the objective function. We first optimize the misclassification costs, and then apply them to DBN. Adaptive differential evolution optimization is implemented as the optimization algorithm that automatically updates its corresponding parameters without the need of prior domain knowledge. The experiments have shown that the proposed approach consistently outperforms the state of the art on both benchmark data sets and real-world data set for fault diagnosis in tool condition monitoring.
Article
Machine learning has progressed to match human performance, including the field of text classification. However, when training data are imbalanced, classifiers do not perform well. Oversampling is one way to overcome the problem of imbalanced data and there are many oversampling methods that can be conveniently implemented. While comparative researches of oversampling methods on non-text data have been conducted, studies comparing oversampling methods under a unifying framework on text data are scarce. This study finds that while oversampling methods generally improve the performance of classifiers, similarity is an important factor that influences the performance of classifiers on imbalanced and resampled data.
Article
The class imbalance problem is a challenge in supervised classification, since many classifiers are sensitive to class distribution, biasing their prediction towards the majority class. Usually, in imbalanced databases, contrast pattern miners extract a very large collection of patterns from the majority class but only a few patterns (or none) from the minority class. It causes that minority class objects have low support and they could be identified as noise and consequently discarded by the contrast pattern based classifier biasing the results towards the majority class. In the literature, the class imbalance problem is commonly faced by applying resampling methods. Therefore, in this paper, we present a study about the impact of using resampling methods for improving the performance of contrast pattern based classifiers in class imbalance problems. Experimental results using standard imbalanced databases show there are statistically significant differences between using the classifier before and after applying resampling methods. Moreover, from this study, we provide a guide based on the class imbalance ratio for selecting a resampling method that jointly with a contrast pattern based classifier allows to have good results in a class imbalance problem.
Article
The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a) retention of unnecessary samples and b) occasional retention of internal rather than boundary samples. Two modifications of CNN are presented which remove these disadvantages by considering only points close to the boundary. Performance is illustrated by an example.