Conference Paper

Margin-Based Random Forest for Imbalanced Land Cover Classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Furthermore, according to Breiman (2001), utilizing more than the needed number of trees may be redundant, although it does not impact the model. Furthermore, Feng et al. (2019) claimed that RF could get accurate findings with ntree = 200. ...
... Many research employs the default value for the mtry parameter, which is mtry = p, where p is the number of predictor variables (Feng et al., 2019). However, in this work, we implemented the RF model with ntree = 200. ...
Article
Full-text available
Rapid urban land use and land cover changes have become a major environmental issue because of their ecological effects, including loss of green space and urban heat islands. Effective monitoring and management techniques are required. The Saudi Arabian twin city of Abha-Khamis Mushyet was selected as a case study for this research. As a result, the current study aimed to statistically and spatially investigate the relationship between land surface temperature (LST) and land use land cover based urban biophysical parameters such as normalized difference built-up index (NDBI), normalized difference vegetation index (NDVI), and normalized difference water index (NDWI). This study used random forest (RF) to classify LULC in 1990, 2000, and 2018. We also validated the LULC maps in a novel way. Using mono window algorithm techniques, we extracted LST for three periods. The dynamics of LULC, LST, and biophysical parameters were investigated using standard statistical graphs such as the heat map and the Sankey diagram. The correlation coefficient and the global bivariate Moran’ I approach were used to determine the association between LST and urban biophysical parameters. The relationship was then established in greater detail by categorizing the entire pixel into percentile classes and employing parallel coordinate plots. Finally, the association was built using GeoDA software and a conditional map. The LULC maps revealed a 334.4 percent increase in urban areas between 1990 and 2018. The built-up region is the largest stable LULC, with an 83.6 percent transitional probability matrix between 1990 and 2018. While 17.9%, 21.8%, 12.4%, and 10.5% of agricultural land, scrubland, exposed rocks, and water bodies were converted to built-up areas, respectively. The LST has increased rapidly over time because of LULC changes. The link between LST and urban biophysical parameters revealed that NDBI had a positive relationship, whereas NDWI and NDVI had a negative relationship. As a result, this study could be very important because it could help decision makers figure out how to lessen the effects of urban heat islands because of changes in LULC.
... The semantic segmentation of remote sensing data has been an important topic for decades and applied in many fields [7], such as environmental monitoring [8,9], crop cover and analysis [10][11][12], the detection of land cover and land use changes [13], the inventory and management of natural resources [14,15], etc. The complexity of the geographical scene has considerably affected the accuracy of geographic feature classification [16][17][18][19], and the representativeness and quality of training samples have an important role in the performance of deep learning models for the semantic segmentation of remote sensing images [20][21][22]. ...
Article
Full-text available
Challenges in enhancing the multiclass segmentation of remotely sensed data include expensive and scarce labeled samples, complex geo-surface scenes, and resulting biases. The intricate nature of geographical surfaces, comprising varying elements and features, introduces significant complexity to the task of segmentation. The limited label data used to train segmentation models may exhibit biases due to imbalances or the inadequate representation of certain surface types or features. For applications like land use/cover monitoring, the assumption of evenly distributed simple random sampling may be not satisfied due to spatial stratified heterogeneity, introducing biases that can adversely impact the model’s ability to generalize effectively across diverse geographical areas. We introduced two statistical indicators to encode the complexity of geo-features under multiclass scenes and designed a corresponding optimal sampling scheme to select representative samples to reduce sampling bias during machine learning model training, especially that of deep learning models. The results of the complexity scores showed that the entropy-based and gray-based indicators effectively detected the complexity from geo-surface scenes: the entropy-based indicator was sensitive to the boundaries of different classes and the contours of geographical objects, while the Moran’s I indicator had a better performance in identifying the spatial structure information of geographical objects in remote sensing images. According to the complexity scores, the optimal sampling methods appropriately adapted the distribution of the training samples to the geo-context and enhanced their representativeness relative to the population. The single-score optimal sampling method achieved the highest improvement in DeepLab-V3 (increasing pixel accuracy by 0.3% and MIoU by 5.5%), and the multi-score optimal sampling method achieved the highest improvement in SegFormer (increasing ACC by 0.2% and MIoU by 2.4%). These findings carry significant implications for quantifying the complexity of geo-surface scenes and hence can enhance the semantic segmentation of high-resolution remote sensing images with less sampling bias.
... Therefore, three kinds of nonlinear models were established by RF, SVM and PSO-SVM. Random forest is a supervised machine learning algorithm based on ensemble learning (Feng et al., 2019). It can effectively reduce the risk of overfitting and is more conducive to obtain a robust model. ...
Article
Full-text available
A quantitative structure-activity relationship (QSAR) study was conducted to predict the anti-colon cancer and HDAC inhibition of triazole-containing compounds. Four descriptors were selected from 579 descriptors which have the most obvious effect on the inhibition of histone deacetylase (HDAC). Four QSAR models were constructed using heuristic algorithm (HM), random forest (RF), radial basis kernel function support vector machine (RBF-SVM) and support vector machine optimized by particle swarm optimization (PSO-SVM). Furthermore, the robustness of four QSAR models were verified by K-fold cross-validation method, which was described by Q 2 . In addition, the R 2 of the four models are greater than 0.8, which indicates that the four descriptors selected are reasonable. Among the four models, model based on PSO-SVM method has the best prediction ability and robustness with R 2 of 0.954, root mean squared error (RMSE) of 0.019 and Q 2 of 0.916 for the training set and R 2 of 0.965, RMSE of 0.017 and Q 2 of 0.907 for the test set. In this study, four key descriptors were discovered, which will help to screen effective new anti-colon cancer drugs in the future.
... An important source of high uncertainty in remote sensing information extraction is sampling bias, where unrepresentative samples are selected for training and/or testing, especially since the majority of the bias usually arise from complex contexts [68,69]. Therefore, we used the average multiscale surface complexity as the stratifying factor and the sampling weight to increase the representativeness of the samples. ...
Article
Full-text available
Recognizing and classifying natural or artificial geo-objects under complex geo-scenes using remotely sensed data remains a significant challenge due to the heterogeneity in their spatial distribution and sampling bias. In this study, we propose a deep learning method of surface complexity analysis based on multiscale entropy. This method can be used to reduce sampling bias and preserve entropy-based invariance in learning for the semantic segmentation of land use and land cover (LULC) images. Our quantitative models effectively identified and extracted local surface complexity scores, demonstrating their broad applicability. We tested our method using the Gaofen-2 image dataset in mainland China and accurately estimated multiscale complexity. A downstream evaluation revealed that our approach achieved similar or better performance compared to several representative state-of-the-art deep learning methods. This highlights the innovative and significant contribution of our entropy-based complexity analysis and its applicability in improving LULC semantic segmentations through optimal stratified sampling and constrained optimization, which can also potentially be used to enhance semantic segmentation under complex geo-scenes using other machine learning methods.
... Different from undersampling, oversampling achieves rational distribution of samples by increasing the number of minority class samples of the imbalanced training set [4,32]. Chawla et al. [33] proposed the synthetic minority oversampling technique (SMOTE), which is a powerful algorithm and has enjoyed great success in various applications [4,34]. Engelmann et al. present a conditional Wasserstein Generative Adversarial Network-based oversampling method for imbalanced learning [35]. ...
Article
Full-text available
The class imbalance problem has been reported to exist in remote sensing and hinders the classification performance of many machine learning algorithms. Several technologies, such as data sampling methods, feature selection-based methods, and ensemble-based methods, have been proposed to solve the class imbalance problem. However, these methods suffer from the loss of useful information or from artificial noise, or result in overfitting. A novel double ensemble algorithm is proposed to deal with the multi-class imbalance problem of the hyperspectral image in this paper. This method first computes the feature importance values of the hyperspectral data via an ensemble model, then produces several balanced data sets based on oversampling and builds a number of classifiers. Finally, the classification results of these diversity classifiers are combined according to a specific ensemble rule. In the experiment, different data-handling methods and classification methods including random undersampling (RUS), random oversampling (ROS), Adaboost, Bagging, and random forest are compared with the proposed double random forest method. The experimental results on three imbalanced hyperspectral data sets demonstrate the effectiveness of the proposed algorithm.
... Therefore, they are at low risk of being discarded while the mislabelling problem is alleviated effectively by targeting high margin misclassified instances. Different studies proved the robustness of the unsupervised ensemble margin to noise, several experiments have been conducted using corrupted versions of the original data (Guo and Boukir, 2013;Mellor et al., 2014;Feng and Boukir, 2015;Saidi et al., 2017;Feng et al., 2018;Feng et al., 2019;Boukir and Feng, 2019). The class labels of fixed percentages of examples selected at random are modified to evaluate the impact of noise on the performances of classifiers. ...
... Although a few studies could effectively handle the class imbalance [61,62], their performances were lower than the proposed method. Additionally, since multiple studies [63,64] did not apply the geometric mean metric in the validation step, the findings of the present work cannot be statistically compared with those of the other works. ...
Article
Full-text available
Timely and accurate Land Cover (LC) information is required for various applications, such as climate change analysis and sustainable development. Although machine learning algorithms are most likely successful in LC mapping tasks, the class imbalance problem is known as a common challenge in this regard. This problem occurs during the training phase and reduces classification accuracy for infrequent and rare LC classes. To address this issue, this study proposes a new method by integrating random under-sampling of majority classes and an ensemble of Support Vector Machines, namely Random Under-sampling Ensemble of Support Vector Machines (RUESVMs). The performance of RUESVMs for LC classification was evaluated in Google Earth Engine (GEE) over two different case studies using Sentinel-2 time-series data and five well-known spectral indices, including the Normalized Difference Vegetation Index (NDVI), Green Normalized Difference Vegetation Index (GNDVI), Soil-Adjusted Vegetation Index (SAVI), Normalized Difference Built-up Index (NDBI), and Normalized Difference Water Index (NDWI). The performance of RUESVMs was also compared with the traditional SVM and combination of SVM with three benchmark data balancing techniques namely the Random Over-Sampling (ROS), Random Under-Sampling (RUS), and Synthetic Minority Over-sampling Technique (SMOTE). It was observed that the proposed method considerably improved the accuracy of LC classification, especially for the minority classes. After adopting RUESVMs, the overall accuracy of the generated LC map increased by approximately 4.95 percentage points, and this amount for the geometric mean of producer's accuracies was almost 3.75 percentage points, in comparison to the most accurate data balancing method (i.e., SVM-SMOTE). Regarding the geometric mean of users' accuracies, RUESVMs also outperformed the SVM-SMOTE method with an average increase of 6.45 percentage points.
... Regarding the balancing rules, although it was argued that fully rebalancing original data might lead to a decrease in Overall Accuracy (OA) [25], partial balancing of datasets has been rarely considered by the RS community. Additionally, the role of different balancing ratios (fractions) to balance imbalanced datasets has been ignored in most data balancing studies [26][27][28]. However, this is important because datasets are different in terms of imbalance ratio, number of classes, and number of samples per class [25]. ...
Article
Full-text available
Distribution of Land Cover (LC) classes is mostly imbalanced with some majority LC classes dominating against minority classes in mountainous areas. Although standard Machine Learning (ML) classifiers can achieve high accuracies for majority classes, they largely fail to provide reasonable accuracies for minority classes. This is mainly due to the class imbalance problem. In this study, a hybrid data balancing method, called the Partial Random Over-Sampling and Random Under-Sampling (PROSRUS), was proposed to resolve the class imbalance issue. Unlike most data balancing techniques which seek to fully balance datasets, PROSRUS uses a partial balancing approach with hundreds of fractions for majority and minority classes to balance datasets. For this, time-series of Landsat-8 and SRTM topographic data along with various spectral indices and topographic data were used over three mountainous sites within the Google Earth Engine (GEE) cloud platform. It was observed that PROSRUS had better performance than several other balancing methods and increased the accuracy of minority classes without a reduction in overall classification accuracy. Furthermore, adopting complementary information, particularly topographic data, considerably increased the accuracy of minority classes in mountainous areas. Finally, the obtained results from PROSRUS indicated that every imbalanced dataset requires a specific fraction(s) for addressing the class imbalance problem, because different datasets contain various characteristics.
Article
Accurate estimation of PM2.5 concentrations is critical to understanding and counteracting air pollution. In the past decade, various machine learning models, especially deep learning models, have been widely used in PM2.5 remote sensing estimation and have achieved remarkable performance. However, a typical pitfall of deep learning models is the problem of high-value underestimation, i.e., the models often underestimate high-level PM2.5 concentrations. Alleviating the problem of high-value underestimation and improving the estimation accuracy of high PM2.5 concentrations are crucial. This study developed a new deep learning model that combines data augmentation and a particle size constraint to improve the high-level PM2.5 estimation. Based on the residual neural network (ResNet), this study used random oversampling to construct a data-augmented deep residual learning model (AugResNet). In addition, a deep residual neural network model with a particle size constraint was established and called ConResNet. Then, the data augmentation and particle size constraint were incorporated into the deep residual neural network model (denoted as Aug_ConResNet). We evaluated the above four models across China with the 10-fold site-based cross-validation approach. In terms of the estimation accuracy for high PM2.5 concentrations (>75 μg/m³), AugResNet (R² = 0.813, RMSE = 19.152 μg/m³), ConResNet (R² = 0.796, RMSE = 20.841 μg/m³) and Aug_ConResNet (R² = 0.820, RMSE = 18.810 μg/m³) outperformed ResNet (R² = 0.780, RMSE = 21.628 μg/m³). Results showed that the data augmentation and particle size constraint alleviated the problem of high-value underestimation and improved the estimation accuracy for high PM2.5 concentrations. The accurate estimation of high PM2.5 concentrations has important application potential for remote sensing monitoring of polluted weather.
Article
Full-text available
This paper presents a new unsupervised classification method which aims to effectively and efficiently map remote sensing data. The Mean-Shift (MS) algorithm, a non parametric density-based clustering technique, is at the core of our method. This powerful clustering algorithm has been successfully used for both the classification and the segmentation of gray scale and color images during the last decade. However, very little work has been reported regarding the performance of this technique on remotely sensed images. The main disadvantage of the MS algorithm lies on its high computational costs. Indeed, it is based on an optimization procedure to determine the modes of the pixels density. To investigate the MS algorithm in the difficult context of very high resolution remote sensing imagery, we use a fast version of this algorithm which has been recently proposed, namely the Path-Assigned Mean Shift (PAMS). This algorithm is up to 5 times faster than other fast MS algorithms while inducing a low loss in quality compared to the original MS version. To compensate for this loss, we propose to use the K modes (cluster centroids) obtained after convergence of the PAMS algorithm as an initialization of a K-means clustering algorithm. The latter converges very quickly to a refined solution to the underlying clustering problem. Furthermore, it does not suffer the main drawback of the classic K-means algorithm (the number of clusters K needs to be specified) as K is automatically determined via the MS mode-seeking procedure. We demonstrate the effectiveness of this two-stage clustering method in performing automatic classification of aerial forest images. Both individual bands and band combination trails are presented. When compared to the classical PAMS algorithm, our technique is better in terms of classification quality. The improvement in classification is significant both visually and statistically. The whole classification process is performed in a few seconds on image tiles of around 1000 x 1000 pixels making this technique a viable alternative to traditional classifiers.
Conference Paper
Full-text available
This work exploits the margin theory to design better ensemble classifiers for remote sensing data. The margin paradigm is at the core of a new bagging algorithm. This method increases the classification accuracy, particularly in case of difficult classes, and significantly reduces the training set size. The same margin framework is used to derive a novel ensemble pruning algorithm. This method not only highly reduces the complexity of ensemble methods but also performs better than complete bagging in handling minority classes. Our techniques have been successfully used for the classification of remote sensing data.
Conference Paper
Full-text available
The main goal of this paper is to investigate the relationship between two theories widely applied to explain the success of classifiers fusion: diversity measures and margin theory. In order to achieve this, we realized an empirical study which evaluates some classical measures related to these two theories with respect to ensembles accuracy. In particular, this study revealed valuable insights on how these two theories can influence each other, and how the application of margin based measures can be useful for the evaluation and selection of ensembles of classifiers with majority voting.
Article
Full-text available
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Book
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Thesis
Classification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers.
Article
In this letter, we propose a new weight-based rotation forest (WRoF) induction algorithm for the classification of hyperspectral image. The main idea of the new method is to guide the growth of trees adaptively via exploring the potential of important instances. The importance of a training instance is reflected by a dynamic weight function. The higher the weight of an instance, the more the next tree will have to focus on the instance. Experimental results on two real hyperspectral data sets show that the WRoF algorithm results in significant classification improvement compared with random forests and rotation forest.
Article
Ensemble methods have been successfully used as a classification scheme. This work focuses on exploiting the margin theory to design better ensemble classifiers. We show that low margin instances have a major influence in building reliable classifiers. The margin paradigm is at the core of a new ordering-based mislabeled instance elimination method. The same margin framework, relying on an alternative definition of ensemble margin, is used to derive a novel ensemble diversity measure that has the property of revealing sources of diversity at data level. Our work has been successfully applied to image data.
Article
This work introduces new ensemble margin criteria, to evaluate the performance of Random Forests (RF), in the context of large area land cover classification, using imbalanced and noisy training data. Experiments using binary and multiclass classification problems reveal insights into the behaviour of RF over big data, in which training data contains noise and may not be evenly distributed among classes. The margin-based RF performance evaluation is conducted using remote sensing and ancillary spatial data, across a 7.2 million hectare study area.
Article
Fully Polarimetric Synthetic Aperture Radar (PolSAR) has the advantages of all-weather, day and night observation and high resolution capabilities. The collected data are usually sorted in Sinclair matrix, coherence or covariance matrices which are directly related to physical properties of natural media and backscattering mechanism. Additional information related to the nature of scattering medium can be exploited through polarimetric decomposition theorems. Accordingly, PolSAR image classification gains increasing attentions from remote sensing communities in recent years. However, the above polarimetric measurements or parameters cannot provide sufficient information for accurate PolSAR image classification in some scenarios, e.g. in complex urban areas where different scattering mediums may exhibit similar PolSAR response due to couples of unavoidable reasons. Inspired by the complementarity between spectral and spatial features bringing remarkable improvements in optical image classification, the complementary information between polarimetric and spatial features may also contribute to PolSAR image classification. Therefore, the roles of textural features such as contrast, dissimilarity, homogeneity and local range, morphological profiles (MPs) in PolSAR image classification are investigated using two advanced ensemble learning (EL) classifiers: Random Forest and Rotation Forest. Supervised Wishart classifier and support vector machines (SVMs) are used as benchmark classifiers for the evaluation and comparison purposes. Experimental results with three Radarsat-2 images in quad polarization mode indicate that classification accuracies could be significantly increased by integrating spatial and polarimetric features using ensemble learning strategies. Rotation Forest can get better accuracy than SVM and Random Forest, in the meantime, Random Forest is much faster than Rotation Forest.
Article
Hyperspectral remote sensing images own rich spectral information to distinguish different land-cover classes. Sometimes, it may encounter the case that some classes have much fewer pixels than other classes. In this case, traditional classification methods are not appropriate because they are prone to assign all the pixels to the classes with a large number of pixels. For such an imbalanced problem, ensemble learning is a good method by partitioning the majority classes into different groups with small sizes. However, the existing ensemble schemes are independent of classifiers, which will not get the best performance for a certain classifier. In this letter, the selected classifier, i.e., a support vector machine (SVM), is considered in an ensemble procedure to improve the classification accuracy. Specifically, the criterion of the SVM, i.e., the maximum margin, is adopted to guide the ensemble learning procedure for imbalanced hyperspectral image classification. Experiments state that our method obtains higher classification accuracy than the SVM and several representative imbalanced classification methods for hyperspectral images.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Random Forests are considered for classification of multisource remote sensing and geographic data. Various ensemble classification methods have been proposed in recent years. These methods have been proven to improve classification accuracy considerably. The most widely used ensemble methods are boosting and bagging. Boosting is based on sample re-weighting but bagging uses bootstrapping. The Random Forest classifier uses bagging, or bootstrap aggregating, to form an ensemble of classification and regression tree (CART)-like classifiers. In addition, it searches only a random subset of the variables for a split at each CART node, in order to minimize the correlation between the classifiers in the ensemble. This method is not sensitive to noise or overtraining, as the resampling is not based on weighting. Furthermore, it is computationally much lighter than methods based on boosting and somewhat lighter than simple bagging. In the paper, the use of the Random Forest classifier for land cover classification is explored. We compare the accuracy of the Random Forest classifier to other better-known ensemble methods on multisource remote sensing and geographic data.
Conference Paper
Many real-world applications have problems when learning from imbalanced data sets, such as medical diagnosis, fraud detection, and text classification. Very few minority class instances cannot provide sufficient information and result in performance degrading greatly. As a good way to improve the classification performance of weak learner, some ensemble-based algorithms have been proposed to solve class imbalance problem. However, it is still not clear that how diversity affects classification performance especially on minority classes, since diversity is one influential factor of ensemble. This paper explores the impact of diversity on each class and overall performance. As the other influential factor, accuracy is also discussed because of the trade-off between diversity and accuracy. Firstly, three popular re-sampling methods are combined into our ensemble model and evaluated for diversity analysis, which includes under-sampling, over-sampling, and SMOTE - a data generation algorithm. Secondly, we experiment not only on two-class tasks, but also those with multiple classes. Thirdly, we improve SMOTE in a novel way for solving multi-class data sets in ensemble model - SMOTEBagging.
Article
In machine learning problems, differences in prior class probabilities -- or class imbalances -- have been reported to hinder the performance of some standard classifiers, such as decision trees. This paper presents a systematic study aimed at answering three different questions. First, we attempt to understand the nature of the class imbalance problem by establishing a relationship between concept complexity, size of the training set and class imbalance level. Second, we discuss several basic re-sampling or cost-modifying methods previously proposed to deal with the class imbalance problem and compare their effectiveness. The results obtained by such methods on artificial domains are linked to results in real-world domains. Finally, we investigate the assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines.
Article
. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximumnumber of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the ...