ArticlePDF Available

Abstract and Figures

In this study, results of a variety of ML algorithms are tested against artificially polluted datasets with noise. Two noise models are tested, each of these studied on a range of noise levels from 0 to 50algorithm, a linear regres- sion algorithm, a decision tree, a M5 algorithm, a decision table classifier, a voting interval scheme as well as a hyper pipes classifier. The study is based on an environmental field of application employing data from two air quality prediction problems, a toxicity classification problem and four artificially pro- duced datasets. The results contain evaluation of classification criteria for every algorithm and noise level for the noise sensitivity study. The results suggest that the best algorithms per problem in terms of showing the lower RMS error are the decision table and the linear regression, for classification and regression problems respectively. Noise is a random error of variance of a measured variable (1). Real datasets coming from monitoring of environmental problems usually contain noisy data, mainly due to malfunctions, unfortunate calibrations of measurement equipment or network problems during the transport of sensor information to a central measurement collection unit. Types of noise are present to almost any real world problem, but not always known. Predictive algorithms have used synthetic datasets during their development stage. In order to cope with real world problems where the presence of noise in data is a common fact, the algorithms require the existence of a pre-processing module that would deter the impact of noise in data before they are processed. The way this module would calculate the data may significantly affect the performance of the constructed model. In this study many machine learning algorithms from the machine learning platform Weka (2) are examined in the presence of increasing levels of artificial noise. The gradi- ent impact of noise renders the initial problem from a deterministic one to a stochastic one. The aim of this study is to find machine learning algorithms that would exhibit a
Content may be subject to copyright.
Machine Learning algorithms: a study on noise
sensitivity
Elias Kalapanidas, Nikolaos Avouris1Marian Craciun2and Daniel Neagu3
1University of Patras, Rio Patras,
GR-265 00, GREECE
2University of Dunarea de Jos, Galati,
ROMANIA
3University of Bradford, Bradford,
UK
Abstract. In this study, results of a variety of ML algorithms are tested against
artificially polluted datasets with noise. Two noise models are tested, each of
these studied on a range of noise levels from 0 to 50algorithm, a linear regres-
sion algorithm, a decision tree, a M5 algorithm, a decision table classifier, a
voting interval scheme as well as a hyper pipes classifier. The study is based
on an environmental field of application employing data from two air quality
prediction problems, a toxicity classification problem and four artificially pro-
duced datasets. The results contain evaluation of classification criteria for every
algorithm and noise level for the noise sensitivity study. The results suggest
that the best algorithms per problem in terms of showing the lower RMS error
are the decision table and the linear regression, for classification and regression
problems respectively.
1 Introduction
Noise is a random error of variance of a measured variable [1]. Real datasets coming
from monitoring of environmental problems usually contain noisy data, mainly due to
malfunctions, unfortunate calibrations of measurement equipment or network problems
during the transport of sensor information to a central measurement collection unit.
Types of noise are present to almost any real world problem, but not always known.
Predictive algorithms have used synthetic datasets during their development stage.
In order to cope with real world problems where the presence of noise in data is a
common fact, the algorithms require the existence of a pre-processing module that
would deter the impact of noise in data before they are processed. The way this module
would calculate the data may significantly affect the performance of the constructed
model.
In this study many machine learning algorithms from the machine learning platform
Weka [2] are examined in the presence of increasing levels of artificial noise. The gradi-
ent impact of noise renders the initial problem from a deterministic one to a stochastic
one. The aim of this study is to find machine learning algorithms that would exhibit a
Research supported . . .
good fit to the noise inflicted datasets as well as a smooth degradation as the noise level
is increased. The results would be useful for any similar problem facing noise impurities
in its data. The algorithms engaged in our study are summed in table 1.
Three sources of data were exploited in this study; two problems of short-hand
prediction of daily maximum pollutant concentration, one problem of classification of
the toxicity level of various chemical substances, and four artificial datasets were the
basis of the research conducted in this paper.
Algorithm Weka Scheme TypeDescription
Zero Rule ZeroR R/C A very naive algorithm that classifies all
cases to the majority class. Used for refer-
ence reasons.
K-nn IB-k∗∗ R/C The well known Instance-Based algorithm
k-nearest neighbors, implemented in accor-
dance to Aha and Kibler [3]
Linear Regression LinearRegression RThe linear regression algorithm
M5 M5Prime RAn algorithm exploiting decision trees with
regression on the cases at each leaf
KStar KStar RAn Instance-based learner using an en-
tropic distance measure [4]
MLP Neural Network R/C∗∗∗ An implementation of the classical MLP
neural network trained by the feed-forward
back propagation algorithm
Decision Table CA scheme that produces rules formatted as
a table, from selected attributes (following
a wrapper-type feature selection prior to
the training phase)
Hyper Pipes HyperPipes CFor each class a HyperPipe is constructed
that contains all points of that class. Test
instances are classified according to the
class that most contains the instance.
C4.5 decision tree J48 CAn implementation of the C4.5 decision
tree [7]
C4.5 Rules J48.PART CA scheme for building rules from partial
decision trees
Voting Feature Interval VFI CA simple scheme that calculates the occur-
rences of feature intervals per class, and
classifies by voting on new cases [6]
Table 1 . A summary of the machine learning algorithms evaluated in the noise sensitivity
study.
: R for Regression type, C for Classification type of problems.
∗∗ : for this study 9 neighbors were chosen for the k parameter after preliminary
study with another sample of data not used in the final experiments.
∗∗∗: an MLP with fixed parameters was used, having 20 hidden neurons, sigmoid
activation function on each neuron, 500 epochs of training with a learning rate of 0.2
and a momentum of 0.2.
2 Past experience on noise sensitivity of machine learning
algorithms
At the past a small number of noise studies have been reported, as for the sensitivity of
machine learning algorithms against this problem. Noise models have been examined on
different variants of the TD reinforcement learning algorithm at 1994 [11], as well as on
induction learning programs by Chevaeyre and Zucker [5]. Following the tracks of the
pioneering idea of Kearns [9] about statistical query models, Teytaud [12] theoretically
explains the relation between some noise models and regression algorithms.
Li et al. [10] presented a study about four machine learning algorithms, i.e. a C4.5
type of decision tree, a naive bayes classifier, a decision rules classifier and the OneR
method of one rule, on a noise model. They consequently compared the results of the
algorithms before and after the wavelet denoising technique, for small levels of noise,
finding that the technique applied boosted the efficiency of the algorithms in almost
all of the noise levels.
3 Emulating noise: The Noise Model Examined
In the following study, a noise model is applied on the datasets at hand, introducing a
white noise type of deformation on the original data. Two assumptions are considered
to be true for all datasets:
1. The variables of the dataset (both the independent and the dependent variables)
are normally distributed
2. Noise is randomly distributed and independent from the data.
Then for every case (yi,x
i) in the dataset L: The pair (yi,x
i) of the dependent variable
Yand the matrix of independent variables Xis substituted by another (y
i,x
i)bya
probability of n,wherenis the noise level. The new pair is calculated by the following
formula:
x
ij =xij +σxjzjpij >=n,
xij pij <n. y
i=yi+σyzjpij >=n,
yipij <n. (1)
zij =norminv(pij ),j [1, .., k]
σxj is the standard deviation of xj,zjis a normally distributed random variable and
is calculated by the inverse function of density-probability of the normal distribution
for a value of pij, having a mean value of zero and a standard deviation equal to unity,
pij (0,1) is a probability variable produced by a random value generator following a
uniform distribution.
4 Experiments and discussion
Our experiments are based on two air quality problems, a toxicity classification prob-
lem and four artificial datasets. The dependent variables for the first two problems
correspond to the maximum daily concentration of nitrogen dioxide and of ozone, two
harmful aerial pollutants, after 10 o’ clock in the morning. The datasets under study
contain five and eight main input attributes respectively that were selected after a
feature-selection procedure using a genetic algorithm [8]. The toxicity problem refers
to the estimation of the toxicity index of several chemical substances, containing 20
features and 1 dependent variable. Finally, 4 artificial problems have been included in
the study as described in [13], consisted of four features and one output variable. Of
the four latter datasets, one implements a multivariate problem, another one a linear
function, while the third and the fourth ones refer to a non-linear function. Table 2
summarizes this information per problem studied.
Problem Code Description Dependent Variable
A1 Artificial problem Numerical/ Multivariate
A2 Artificial problem Numerical/ Linear
A3 Artificial problem Numerical/ Non linear of the form
x2
A4 Artificial problem Numerical/ Non linear of the form
x2
NO2 Daily Maximum Concentration Forecasting from
sensory data
Numerical/ Non linear
O3 Daily Maximum Concentration Forecasting from
sensory data
Numerical/ Non linear
TOX Classification of an index of toxicity for various
substances from chemical descriptors
Numerical
Table 2 . A description of the problems of the noise-sensitivity study.
The artificial problems A1-A4 are of the type y=f(x1,x
2,x
3,x
4) and have been
created using the formulas of table 3:
Problem x1= x2= x3= x4= y=
A1 zx10.8+z0.6x10.6+z0.8 z (x1x2x3+x4)/1.47
A2 z (x2
1+z0.5)/1.5x10.6+z0.8 z (x1x2x3+x4)/1.7
A3 zx10.8+z0.6x10.6+z0.8 z (x1x2
2x3+x4)/1.96
A4 zx10.8+z0.6x10.6+z0.8 z (x1x2x3+x2
4)/1.76
Table 3 . Definition of the variables for the four artificial problems A1-A4.
All cases containing missing values have been deleted. Although the resulting datasets
may already contain an amount of noise, for this study it should be considered as clean
data and this fact does not influence the experiments as the noise models studied refer
to additive noise.
Artificial noise was generated at random throughout the whole datasets. The reason
of ”polluting” both the training set and the evaluation set is that as noise in the data
has been emerged so far, it will be emerged in the future with the same probability
and the same patterns.
Repetitive experiments have been done on each of the polluted datasets. Eight
noise levels have been tested, ranging from 0 to 500.20, 0.30, 0.40, 0.50}. Five different
datasets were produced for every such noise level, while the results of the competing
machine learning schemes were averaged over these five datasets. Five-fold cross val-
idation experiments were carried out for each of these datasets. For each run, eleven
machine learning algorithms were trained and tested following the five-cross-validation
scheme. These are summed in table 1 along with a short description for each one of
them. Half of them are suitable for regression type of problems, while the other half is
classification algorithms. Since all the problems were regression problems, the numeri-
cal dependent variable for each of the problems was transformed into a categorical one
by dividing the initial range into 5 equally wide areas. Thus the seven initial regression
datasets were transformed into seven classification datasets, ready to be processed by
the classification type of algorithms.
From the variety of the collected data from this study the metric of RMS error was
chosen to judge the fit of each machine learning algorithm to every artificially polluted
dataset. Though other metrics as the prediction accuracy or the classification error are
also used in many publications, the RMS error is a stricter and more suitable evaluator
of the efficiency of a certain algorithm in terms of a comparison study.
Problem A1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Area 1: RMSE
Area 2: Noise-0
Relative RMSE
Fig. 1. Example of a noise sensitivity diagram: RMSE and Noise-0 Relative RMSE areas.
In the next diagrams the sensitivity of the algorithms to noise is depicted, by the
form of noise curves. Each of the 7 problems is represented by a pair of diagrams,
one for the regression type and one for the classification type of algorithms. For every
diagram the left vertical axis represents the RMS error of the algorithms while on the
right vertical axis the noise-0 relative RMS error is measured. The latter error metric
is the difference of RMS error after the application of noise in data minus the RMS
error at the zero noise level. On the horizontal axis the gradually increasing levels of
artificial noise reside. The less sensitive algorithms to the presence of noise are those
that follow a noise-0 relative RMSE line as close to the horizontal axis as possible.
For readability and space compactness purposes it was chosen to have both RMSE
and noise-0 relative RMSE curves in the same diagram. These two curves are utilizing
different areas, as the example of fig.1 indicates. Area 1 containing the RMSE curves
always takes over the upper - upper left part of the diagram, while area 2 where the
noise-0 relative RMSE fits in is contained in the lower - lower right part of the diagram.
4.1 Inquiring the regression results
It is clear from the diagrams referring to the noise curves of the regression type algo-
rithms that their behavior varies from the artificial problems to the real-world problems.
In all of the four artificial problems A1-A4 the best algorithms show a very good fit
to the noise free problems at 0 noise level, but after that level their RMS error jumps
up abruptly and then evolves almost linearly. This behavior is more visible by watching
the noise-0 relative RMSE curves. For the first two linear problems A1 and A2 linear
regression proves to be the best method as expected, followed closely by M5. For the
two non-linear problems A3 and A4 M5 and IB-9 fit better. Though these algorithms
appear to have the better RMSE curves, their corresponding noise-0 relative RMSE
curves are among the worst. It emerges as a conclusion from the four problems A1-A4
that the weaker algorithms appear as the less sensitive to noise, and vice versa.
The real-world problems NO2, O3 and TOX present a different image. All RMSE
curves are gathered inside a narrow band, close to each other. In all cases linear re-
gression fits better the datasets of these problems. In disagreement with the conclusion
from the artificial problems A1-A4, the noise-0 relative RMSE curves mirror the same
behavior of their RMSE counterparts.
4.2 Exploring the classification results
For all the problems except TOX, VFI and HyperPipes are the less fit algorithms,
having RMSE curves over the reference algorithm ZeroR. Since ZeroR is used as an
efficiency threshold, all algorithms exhibiting RMSE curves over its own are considered
unsuitable for solving the problem at hand. From the other three algorithms, Decision
Table is the dominant one, having the lowest RMSE curve and the lowest noise-0
relative RMSE for all the seven problems studied in this report.
Another interesting finding is that the classification algorithms have noise curves
much less sensitive than those of the algorithms of the regression type. This may be
a result of the discretization of the dependent (output) variable. We assume that as
the number of the discrete bins of the discretization process increases, the average
slope of the noise curves will also increase so as to match this of the regression type of
algorithms when the number of bins reaches the infinity.
Problem A1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem A1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.5
1
1.5
2
2.5
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 2. RMS and noise-0 relative RMS Error for the A1 Problem.
Problem A2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem A2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.5
1
1.5
2
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 3. RMS and noise-0 relative RMS Error for the A2 Problem.
Problem A3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem A3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 4. RMS and noise-0 relative RMS Error for the A3 Problem.
Problem A4
0
0.5
1
1.5
2
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem A4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 5. RMS and noise-0 relative RMS Error for the A4 Problem.
Problem NO 2
0
20
40
60
80
100
120
140
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
10
20
30
40
50
60
70
80
90
100
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem NO 2
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
10
20
30
40
50
60
70
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 6. RMS and noise-0 relative RMS Error for the NO2 Problem.
Problem O 3
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
5
10
15
20
25
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem O3
0
5
10
15
20
25
30
35
40
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
5
10
15
20
25
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 7. RMS and noise-0 relative RMS Error for the O3 Problem.
Problem O 3
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
5
10
15
20
25
Noise-0 Relative RMSE
Decision Table ZeroR HyperPipes
C4.5 C4.5Part VFI
Problem O3
0
5
10
15
20
25
30
35
40
0 5 10 15 20 30 40 50
Noise Level %
RMSE
0
5
10
15
20
25
Noise-0 Relative RMSE
Lin. Regression ZeroR IB-9
M5 Neural Net Kstar
Fig. 8. RMS and noise-0 relative RMS Error for the TOX Problem.
5 Conclusions
A study of machine learning algorithms on noise sensitivity is reported in this paper,
based on four artificial and three real world problems. A noise model has been tested,
for a noise level ranging from 0 to 0.5. The dependent variable was transformed from
numerical to nominal in order to test the classification algorithms. Thus a range of
regression type and classification type of algorithms have been examined by measuring
their sensitivity to noise, as artificially additive noise has been applied on the initial
datasets.
It has been showed that linear regression from the regression type of algorithms
adapts better to the gradually increasing noise levels. Also noticeable from the artificial
datasets A1-A4 is the fact that the better algorithms in terms of RMSE present the
greater noise sensitivity while the worst seem to be the less sensitive. Decision Table
seems to be the method the less sensitive to additive noise from the set of classification
learners. Not only does it show the best RMSE on all of the datasets, but exhibits a
good behavior in terms of the noise-0 relative RMSE.
Future work expanding the reported study includes further experiments on different
problems, and we believe that the forthcoming results will help in forming general
guidelines useful for the selection of the best machine learning algorithm for modeling
or prediction of problems prone to noise. At last it is worth noting that all data for
the examined datasets are originated from PERPA, the Greek air quality monitoring
authority.
References
1. Han J. And Kamber M., ”Data Mining: Concepts and Techniques”. Morgan Kaufmann
Publishers, 2000
2. Garner, S.R., (1995), WEKA: The Waikato Environment for Knowledge Analysis. In Proc.
of the New Zealand Computer Science Research Students Conference, pages 57-64.
3. Aha, D., and D. Kibler (1991), ”Instance-based learning algorithms”, Machine Learning,
vol.6, pp. 37-66
4. John, G. Cleary and Leonard, E. Trigg (1995) ”K*: An Instance- based Learner Using an
Entropic Distance Measure”, Proceedings of the 12th International Conference on Machine
learning, pp. 108-114.
5. Chevaleyre Y., Zucker J-D., (2000), ”Noise-Tolerant Rule Induction for Multi-Instance
Data”. In Proceedings of the ICML-2000 Workshop on ”Attribute-Value and Relational
Learning”.
6. Demiroz, G. and Guvenir, A. (1997) ”Classification by voting feature intervals”, ECML-97
7. Quinlan J.R., (1993), ”C4.5, Programs for Machine Learning”, Morgan Kauffman, San
Mateo, California, USA.
8. Kalapanidas E. and Avouris N., (2002), ”Feature Selection Using a Genetic Algorithm
Applied on an Air Quality Forecasting Problem”. In Proceedings of the BESAI (Binding
Environmental Sciences with AI) Workshop, ECAI 2002, Lyon, France.
9. Kearns M., (1993), ”Efficient noise-tolerant learning from statistical queries”. In Proceed-
ings of the twenty-fifth annual ACM symposium on Theory of computing, p.392-401, May
16-18, San Diego, California, United States .
10. Li Q., Li T., Zhu S., Kambhamettu C., (2002), ”Improving Medical/Biological Data Clas-
sification Performance by Wavelet Preprocessing”, In the Proceedings of ICDM 2002.
11. Pendrith M., (1994), ”On reinforcement learning of control actions in noisy and non-
Markovian domains”. Technical Report UNSW-CSE-TR-9410, School of Computer Science
and Engineering, The University of New South Wales, Sydney, Australia
12. Teytaud O., (2001), ”Robust Learning: Regression Noise”. In Proceedings of IJCNN
2001”. pp 1787-1792.
13. Sarle, W.S. (1998), ”Prediction with Missing Inputs,” in Wang, P.P.
(ed.), JCIS 98 Proceedings, Vol II, Research Triangle Park, NC, 399-402,
ftp://ftp.sas.com/pub/neural/JCIS98.ps.
... Sensitivity analysis. Following the study by Kalapanidas et al. [89] and Atla et al. [90], the noise sensitivity curve is employed to evaluate the drop in detection performance after injecting additive noise at different levels. From the curve, the average reduction in AUROC scores is used as a summary statistic of noise sensitivity, ...
... zero reduction in ac-curacy due to additive noise). Additive noise levels similar to extant studies [89], [90] are applied; the preprocessed test inputs (inliers and anomalies) which range in [0,1] are injected with random samples drawn from Gaussian N (0, σ +Gauss ) or Uniform distributions U(−σ +Uni , σ +Uni ) where σ +Gauss ∈ {0.1, 0.2, ..., 0.5} and σ +Uni ∈ {0.1, 0.2, ..., 0.5}. The sensitivity analysis is repeated for 5 times on CIFAR and Fashion-MNIST, and 10 times on remaining datasets to account for experimental uncertainty due to random sampling. ...
... The noise sensitivity curves [89], [90] depict the decrease in AUROC scores as the level of injected noise increases in Fig. 15. From these sensitivity curves, the average drop is summarised as a robustness measure, depicted in Fig. 16 for the deterministic AE and BAE. ...
Article
Full-text available
A common belief in designing deep autoencoders (AEs), a type of unsupervised neural network, is that a bottleneck is required to prevent learning the identity function. Learning the identity function renders the AEs useless for anomaly detection. In this work, we challenge this limiting belief and investigate the value of non-bottlenecked AEs. The bottleneck can be removed in two ways: (1) overparameterising the latent layer, and (2) introducing skip connections. However, limited works have reported on the use of one of the ways. For the first time, we carry out extensive experiments covering various combinations of bottleneck removal schemes and datasets using variants of Bayesian AEs. In addition, we propose the infinitely-wide AEs as an extreme example of non-bottlenecked AEs. Their improvement over the baseline implies learning the identity function is not trivial as previously assumed. Moreover, we find that non-bottlenecked architectures (highest AUROC=0.905) can outperform their bottlenecked counterparts (highest AUROC=0.714) on a recent benchmark of CIFAR (inliers) vs SVHN (anomalies), among other tasks, shedding light on the potential of developing non-bottlenecked AEs for improving anomaly detection.
... According to Kalapanidas et. al. [24], predictive techniques have used synthetic datasets within the development stage. These techniques necessitate the existence of a pre-processing module to deter the impact of noise in data before they are processed so as, as Garner [25] articulates the fact that data is common towards coping with real world problems where the presence of noise in data is a common fact. ...
... Tama et al. [24] combines particle swarm optimization algorithms, ant colony optimization algorithms, and genetic algorithms for feature selection to reduce the feature size of the training data, then a secondary classification method to detect abnormal behaviour in the network, to increase detection accuracy. Further to that, the same scholars, as indicated in [8], provides a hybrid feature selection and tree-based classifier ensemble for anomaly based IDS, where the suggested detection method achieves 92.77% accuracy while only utilizing a minimal feature set in the KDD dataset. ...
Thesis
Full-text available
The ensemble technique, or any combination model, trains several learners to solve classification or regression problems, rather than using traditional learning approaches that can only generate one learner from training data. Traditional techniques and other machine learning approaches have struggled to satisfy the detection of network intrusions due to the continual growth of digital technology and the more diversified ways of cyber-attacks. KDD-99 and NSL-KDD datasets are examined and analysed through experiments so as to determine the most influential features in determining network intrusions before an ensemble algorithm is constructed using the very-discovered influential features. MAXE ensemble algorithm is constructed from kNeighbors, Random Forest, Gaussian NB, Logistic Regression, XGB, Decision Tree, Extra Trees, and MPL classifier, and a series of experiments are performed for a detailed analysis of findings from other researchers versus the outcome of the ensemble algorithm to further improve the imbalanced sample of positive and negative classes in the dataset used to assess the model's efficacy, and the low intrusion detection models for multiple classifications of intrusions and low accuracy of class imbalance data detection disregards the spatial features in the process of detecting attacks. Based on the findings of the experiments, the thesis evaluates and tables the efficacy of the proposed assembled algorithm against the other state-of-the-art algorithm.
... This difference suggests that RF-PTFs are quite sensitive to measurement errors in the dependent variable, while the effect is limited for MLR-PTFs. Kalapanidas et al. (2003) found that MLR adapts better to increasing noise levels, even when overall model performance is low. On the contrary, better performing models, such as RF, are more susceptible to noise. ...
... The ubiquitous presence of noise can further complicate 2-D profile reconstructions from multichord line-integrated measurements for both traditional and ML methods [196]. A related problem in tomography is to use 2-D projections to reconstruct 3-D volumes [197]. ...
Article
Full-text available
Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required.
... On the skewed data, a hybrid undersampling [26]/over-sampling technique is used. Accuracy [27], sensitivity [28], specificity [29], precision, Matthews correlation coefficient [30], and balanced classification rate [31] are utilised in order to evaluate the effectiveness of the approaches. Maes, S., et al., [13] presents two machine learning algorithms that are used when there is uncertainty. ...
Article
Purpose: To understand the algorithms used in Credit Card Fraud Detection (CCFD) using Machine Learning (ML) and Data Mining (DM) techniques, Review key findings in the area and come up with research gaps or unresolved problem. To become knowledgeable about the current discussions in the area of ML and DM. Design/Methodology/Approach: The survey on CCFD using ML and DM was conducted based on data from academic papers, web articles, conference proceedings, journals and other sources. Information is reviewed and analysed. Results/Findings: Identification of credit card fraud is essential for protecting a person's or an organization's assets. Even though we have various safeguards in place to prevent fraudulent activity, con artists may develop a method to get around the checkpoints. We must create straightforward and efficient algorithms employing ML and DM to anticipate fraudulent activities in advance. Originality/Value: Study of ML and DM algorithms in CCFD from diverse sources is done. This area needs study due to recent methods by fraudsters in digital crime have developed. The information acquired will be helpful for creating new methodologies or improving the outcomes of current algorithms. Type of Paper: Literature Review.
... The regression-based techniques work on real-world data, but as real-world data are mostly taken from different sensors, so it contains noise in it. This makes regression models to give erroneous results as the regression-based techniques are noise sensitive [28,29]. Other than simulation or regression-based techniques, some techniques were developed using Neural Network (NN) [9,10], Convolutional Neural Network (CNN) [22], and Neuro Fuzzy [30,31]. ...
Article
Full-text available
To tackle the problem of range anxiety of a driver of an electric vehicle (EV), it is necessary to accurately estimate the power/energy consumption of EVs in real time, so that drivers can get real-time information about the vehicle’s remaining range. In addition, it can be used for energy-aware routing, i.e., the driver can be provided with information that on which route less energy consumption will take place. In this paper, an integrated system has been proposed which can provide reliable and real-time estimate of the energy consumption for an EV. The approach uses Deep Auto-Encoders (DAE), cross-connected using latent space mapping, which consider historical traffic speed to predict the traffic speed at multiple time steps in future. The predicted traffic speed is used to calculate the future vehicle speed. The vehicle speed, acceleration along with wind speed, road elevation, temperature, battery’s SOC, and auxiliary loads are used as input to a multi-channel Convolutional Neural Network (CNN) to predict the energy consumption. The prediction is further fine-tuned using a Bagged Decision Tree (BDT). Unlike other existing techniques, the proposed system can be easily generalized for other vehicles as it is independent of internal vehicle parameters. Comparison with other benchmark techniques shows that the proposed system performs better and has a least mean absolute percentage error of 1.57%.
... They provided a state-of-the-art reinforcement learning framework, but they did not consider the irregularity [21] of spoken language, which would bring ambiguity [22] to the dialogue system. Besides, they might think about introducing some noise [23] in the original data, enhancing generalization ability rather than overfitting [24]. At the same time, the paper for " HappyBot " [9] utilized a "Carry-lookahead adder" structure [25], which essentially enhanced the memory capacity [26] primarily in their dialogue system. ...
Conference Paper
Full-text available
Currently, the emotional research of dialogue systems is a hot topic. However, several works mainly focused on acquiring state-of-the-art performance in a dialogue system and paid less attention to the inner emotions' response, and lacked interpretability of emotional response mechanism within a dialogue system. Hence, this work proposed an empathic protocol to address this issue via introducing an innovative element (Mirror neuron) from connectionism and neuroscience to gradually design an AMNN (Artificial mirror neuron network) in the dialogue system for clear interpretability firstly. Subsequently, this paper described an empathic protocol to produce and analyze responses between a user and an agent via the self-defined neural network that served as the Central Nervous System of a dialogue agent. By employing this protocol in a traffic-service application, users felt that their emotions were resonated with and understood and communicated with the dialogue agent proactively.
Article
Full-text available
The superior multi-functional properties of polymer composites have made them an ideal choice for aerospace, automobile, marine, civil, and many other technologically demanding industries. The increasing demand of these composites calls for an extensive investigation of their physical, chemical and mechanical behavior under different exposure conditions. Machine learning (ML) has been recognized as a powerful predictive tool for data-driven multi-physical modeling, leading to unprecedented insights and exploration of the system properties beyond the capability of traditional computational and experimental analyses. Here we aim to abridge the findings of the large volume of relevant literature and highlight the broad spectrum potential of ML in applications like prediction, optimization, feature identification, uncertainty quantification, reliability and sensitivity analysis along with the framework of different ML algorithms concerning polymer composites. Challenges like the curse of dimensionality, overfitting, noise and mixed variable problems are discussed, including the latest advancements in ML that have the potential to be integrated in the field of polymer composites. Based on the extensive literature survey, a few recommendations on the exploitation of various ML algorithms for addressing different critical problems concerning polymer composites are provided along with insightful perspectives on the potential directions of future research.
Chapter
Full-text available
Cancer is a critical disease in medical science. Breast cancer is the most common type of cancer. At early stage, medical science predicts the main cause of cancer and the chances of patient survival. Machine learning is the latest approach for cancer prediction. To identify the possible factors from the breast cancer dataset, the classification data mining techniques are used. Data mining is the most popular approach in the data science field and machine learning technique that maps into the specific data and group data. This research paper elaborates on the definitions, models, parameters, and algorithms of machine learning and highlights a special role of the machine learning approach in data mining. Therefore, we focus on the analysis and discussions regarding the issue of data mining using the learning approach and provide the comparative analysis of classification algorithms which are used in data mining.
Article
Full-text available
Feature selection is a process followed in order to improve the generalization and the performance of several classification and/or regression algorithms. Feature selection processes are divided in two categories, the filter and the wrapper approach. The formal is performed independently of the learning algorithm while the later makes use of the algorithm in an iterative way. As (1) describe, the feature weighting algorithms are divided into two categories: the filtering methods and the wrapper methods. The former is a no-feedback, pre-selection approach where the selection of the feature subset is performed independently of the learning algorithm. The later is an iterative method that encapsulates the learning algorithm in the feature selection process. This paper focuses on the exploitation of a genetic algorithm used to extract an optimal feature subset of a large database containing pollutant concentration measurements, following the wrapper approach. The feature subset feeds a nearest neighbor algorithm in order to predict the daily maximum concentration for two pollutants. The encoding problem of the complexity of representation of the features in the genomes is tackled. Results of the experimentation on an air quality forecasting problem will be presented, as well as slight alterations on the standard simple genetic algorithm paradigm that guided the algorithm to a mature convergence and gave good solutions.
Article
Full-text available
Storing and using specific instances improves the performance of several supervised learning algorithms. These include algorithms that learn decision trees, classification rules, and distributed networks. However, no investigation has analyzed algorithms that use only specific instances to solve incremental learning tasks. In this paper, we describe a framework and methodology, called instance-based learning, that generates classification predictions using only specific instances. Instance-based learning algorithms do not maintain a set of abstractions derived from specific instances. This approach extends the nearest neighbor algorithm, which has large storage requirements. We describe how storage requirements can be significantly reduced with, at most, minor sacrifices in learning rate and classification accuracy. While the storage-reducing algorithm performs well on several real-world databases, its performance degrades rapidly with the level of attribute noise in training instances. Therefore, we extended it with a significance test to distinguish noisy instances. This extended algorithm's performance degrades gracefully with increasing noise levels and compares favorably with a noise-tolerant decision tree algorithm.
Article
Full-text available
WEKA is a workbench designed to aid in the application of machine learning technology to real world data sets, in particular, data sets from New Zealand's agricultural sector. In order to do this a range of machine learning techniques are presented to the user in such a way as to hide the idiosyncrasies of input and output formats, as well as allow an exploratory approach in applying the technology. The system presented is a component based one that also has application in machine learning research and education. 1. Introduction The WEKA machine learning workbench has grown out of the need to be able to apply machine learning to real world data sets in a way that promotes a "what if?..." or exploratory approach. Each machine learning algorithm implementation requires the data to be present in its own format, and has its own way of specifying parameters and output. The WEKA system was designed to bring a range of machine learning techniques or schemes under a common interface so that t...
Article
Full-text available
A new classification algorithm called VFI (for Voting Feature Intervals) is proposed. A concept is represented by a set of feature intervals on each feature dimension separately. Each feature participates in the classification by distributing real-valued votes among classes. The class receiving the highest vote is declared to be the predicted class. VFI is compared with the Naive Bayesian Classifier, which also considers each feature separately. Experiments on real-world datasets show that VFI achieves comparably and even better than NBC in terms of classification accuracy. Moreover, VFI is faster than NBC on all datasets.
Conference Paper
Many real-world datasets contain noise which could degrade the performances of learning algorithms. Motivated from the success of wavelet denoising techniques in image data, we explore a general solution to alleviate the effect of noisy data by wavelet preprocessing for medical/biological data classification. Our experiments are divided into two categories: one is of different classification algorithms on a specific database, and the other is of a specific classification algorithm (decision tree) on different databases. The experiment results show that the wavelet denoising of noisy data is able to improve the accuracies of those classification methods, if the localities of the attributes are strong enough.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
In this paper, we study the problem of learning in the presence of classification noise in the probabilistic learning model of Valiant and its variants. In order to identify the class of “robust” learning algorithms in the most general way, we formalize a new but related model of learning from statistical queries . Intuitively, in this model a learning algorithm is forbidden to examine individual examples of the unknown target function, but is given acess to an oracle providing estimates of probabilities over the sample space of random examples. One of our main results shows that any class of functions learnable from statistical queries is in fact learnable with classification noise in Valiant's model, with a noise rate approaching the information-theoretic barrier of 1/2. We then demonstrate the generality of the statistical query model, showing that practically every class learnable in Valiant's model and its variants can also be learned in the new model (and thus can be learned in the presence of noise). A notable exception to this statement is the class of parity functions, which we prove is not learnable from statistical queries, and for which no noise-tolerant algorithm is known.
Article
The use of entropy as a distance measure has several benefits. Amongst other things it provides a consistent approach to handling of symbolic attributes, real valued attributes and missing values. The approach of taking all possible transformation paths is discussed. We describe K*, an instance-based learner which uses such a measure, and results are presented which compare favourably with several machine learning algorithms. Introduction The task of classifying objects is one to which researchers in artificial intelligence have devoted much time and effort. The classification problem is hard because often the data available may be noisy or have irrelevant attributes, there may be few examples to learn from or simply because the domain is inherently difficult. Many different approaches have been tried with varying success. Some well known schemes and their representations include: ID3 which uses decision trees (Quinlan 1986), FOIL which uses rules (Quinlan 1990), PROTOS which is a case...