ArticlePDF Available

Machine Learning algorithms: a study on noise sensitivity

January 2003

January 2003

Authors:

Elias Kalapanidas

Nikolaos M. Avouris

University of Patras

Marian Viorel Craciun

Universitatea Dunarea de Jos Galati

Daniel Neagu

University of Bradford

In this study, results of a variety of ML algorithms are tested against artificially polluted datasets with noise. Two noise models are tested, each of these studied on a range of noise levels from 0 to 50algorithm, a linear regres- sion algorithm, a decision tree, a M5 algorithm, a decision table classifier, a voting interval scheme as well as a hyper pipes classifier. The study is based on an environmental field of application employing data from two air quality prediction problems, a toxicity classification problem and four artificially pro- duced datasets. The results contain evaluation of classification criteria for every algorithm and noise level for the noise sensitivity study. The results suggest that the best algorithms per problem in terms of showing the lower RMS error are the decision table and the linear regression, for classification and regression problems respectively. Noise is a random error of variance of a measured variable (1). Real datasets coming from monitoring of environmental problems usually contain noisy data, mainly due to malfunctions, unfortunate calibrations of measurement equipment or network problems during the transport of sensor information to a central measurement collection unit. Types of noise are present to almost any real world problem, but not always known. Predictive algorithms have used synthetic datasets during their development stage. In order to cope with real world problems where the presence of noise in data is a common fact, the algorithms require the existence of a pre-processing module that would deter the impact of noise in data before they are processed. The way this module would calculate the data may significantly affect the performance of the constructed model. In this study many machine learning algorithms from the machine learning platform Weka (2) are examined in the presence of increasing levels of artificial noise. The gradi- ent impact of noise renders the initial problem from a deterministic one to a stochastic one. The aim of this study is to find machine learning algorithms that would exhibit a

Example of a noise sensitivity diagram: RMSE and Noise-0 Relative RMSE areas.

…

Figures - uploaded by Daniel Neagu

Content may be subject to copyright.

Content uploaded by Daniel Neagu

Content may be subject to copyright.

Machine Learning algorithms: a study on noise

sensitivity

Elias Kalapanidas, Nikolaos Avouris1Marian Craciun2and Daniel Neagu3

1University of Patras, Rio Patras,

GR-265 00, GREECE

2University of Dunarea de Jos, Galati,

ROMANIA

3University of Bradford, Bradford,

Abstract. In this study, results of a variety of ML algorithms are tested against

artiﬁcially polluted datasets with noise. Two noise models are tested, each of

these studied on a range of noise levels from 0 to 50algorithm, a linear regres-

sion algorithm, a decision tree, a M5 algorithm, a decision table classiﬁer, a

voting interval scheme as well as a hyper pipes classiﬁer. The study is based

on an environmental ﬁeld of application employing data from two air quality

prediction problems, a toxicity classiﬁcation problem and four artiﬁcially pro-

duced datasets. The results contain evaluation of classiﬁcation criteria for every

algorithm and noise level for the noise sensitivity study. The results suggest

that the best algorithms per problem in terms of showing the lower RMS error

are the decision table and the linear regression, for classiﬁcation and regression

problems respectively.

1 Introduction

Noise is a random error of variance of a measured variable [1]. Real datasets coming

from monitoring of environmental problems usually contain noisy data, mainly due to

malfunctions, unfortunate calibrations of measurement equipment or network problems

during the transport of sensor information to a central measurement collection unit.

Types of noise are present to almost any real world problem, but not always known.

Predictive algorithms have used synthetic datasets during their development stage.

In order to cope with real world problems where the presence of noise in data is a

common fact, the algorithms require the existence of a pre-processing module that

would deter the impact of noise in data before they are processed. The way this module

would calculate the data may signiﬁcantly aﬀect the performance of the constructed

model.

In this study many machine learning algorithms from the machine learning platform

Weka [2] are examined in the presence of increasing levels of artiﬁcial noise. The gradi-

ent impact of noise renders the initial problem from a deterministic one to a stochastic

one. The aim of this study is to ﬁnd machine learning algorithms that would exhibit a

Research supported . . .

good ﬁt to the noise inﬂicted datasets as well as a smooth degradation as the noise level

is increased. The results would be useful for any similar problem facing noise impurities

in its data. The algorithms engaged in our study are summed in table 1.

Three sources of data were exploited in this study; two problems of short-hand

prediction of daily maximum pollutant concentration, one problem of classiﬁcation of

the toxicity level of various chemical substances, and four artiﬁcial datasets were the

basis of the research conducted in this paper.

Algorithm Weka Scheme Type∗Description

Zero Rule ZeroR R/C A very naive algorithm that classiﬁes all

cases to the majority class. Used for refer-

ence reasons.

K-nn IB-k∗∗ R/C The well known Instance-Based algorithm

k-nearest neighbors, implemented in accor-

dance to Aha and Kibler [3]

Linear Regression LinearRegression RThe linear regression algorithm

M5 M5Prime RAn algorithm exploiting decision trees with

regression on the cases at each leaf

KStar KStar RAn Instance-based learner using an en-

tropic distance measure [4]

MLP Neural Network R/C∗∗∗ An implementation of the classical MLP

neural network trained by the feed-forward

back propagation algorithm

Decision Table CA scheme that produces rules formatted as

a table, from selected attributes (following

a wrapper-type feature selection prior to

the training phase)

Hyper Pipes HyperPipes CFor each class a HyperPipe is constructed

that contains all points of that class. Test

instances are classiﬁed according to the

class that most contains the instance.

C4.5 decision tree J48 CAn implementation of the C4.5 decision

tree [7]

C4.5 Rules J48.PART CA scheme for building rules from partial

decision trees

Voting Feature Interval VFI CA simple scheme that calculates the occur-

rences of feature intervals per class, and

classiﬁes by voting on new cases [6]

Table 1 . A summary of the machine learning algorithms evaluated in the noise sensitivity

study.

∗: R for Regression type, C for Classiﬁcation type of problems.

∗∗ : for this study 9 neighbors were chosen for the k parameter after preliminary

study with another sample of data not used in the ﬁnal experiments.

∗∗∗: an MLP with ﬁxed parameters was used, having 20 hidden neurons, sigmoid

activation function on each neuron, 500 epochs of training with a learning rate of 0.2

and a momentum of 0.2.

2 Past experience on noise sensitivity of machine learning

algorithms

At the past a small number of noise studies have been reported, as for the sensitivity of

machine learning algorithms against this problem. Noise models have been examined on

diﬀerent variants of the TD reinforcement learning algorithm at 1994 [11], as well as on

induction learning programs by Chevaeyre and Zucker [5]. Following the tracks of the

pioneering idea of Kearns [9] about statistical query models, Teytaud [12] theoretically

explains the relation between some noise models and regression algorithms.

Li et al. [10] presented a study about four machine learning algorithms, i.e. a C4.5

type of decision tree, a naive bayes classiﬁer, a decision rules classiﬁer and the OneR

method of one rule, on a noise model. They consequently compared the results of the

algorithms before and after the wavelet denoising technique, for small levels of noise,

ﬁnding that the technique applied boosted the eﬃciency of the algorithms in almost

all of the noise levels.

3 Emulating noise: The Noise Model Examined

In the following study, a noise model is applied on the datasets at hand, introducing a

white noise type of deformation on the original data. Two assumptions are considered

to be true for all datasets:

1. The variables of the dataset (both the independent and the dependent variables)

are normally distributed

2. Noise is randomly distributed and independent from the data.

Then for every case (yi,x

i) in the dataset L: The pair (yi,x

i) of the dependent variable

Yand the matrix of independent variables Xis substituted by another (y

i,x



i)bya

probability of n,wherenis the noise level. The new pair is calculated by the following

formula:



ij =xij +σxjzjpij >=n,

xij pij <n. y



i=yi+σyzjpij >=n,

yipij <n. (1)

zij =norminv(pij ),j ∈[1, .., k]

σxj is the standard deviation of xj,zjis a normally distributed random variable and

is calculated by the inverse function of density-probability of the normal distribution

for a value of pij, having a mean value of zero and a standard deviation equal to unity,

pij ∈(0,1) is a probability variable produced by a random value generator following a

uniform distribution.

4 Experiments and discussion

Our experiments are based on two air quality problems, a toxicity classiﬁcation prob-

lem and four artiﬁcial datasets. The dependent variables for the ﬁrst two problems

correspond to the maximum daily concentration of nitrogen dioxide and of ozone, two

harmful aerial pollutants, after 10 o’ clock in the morning. The datasets under study

contain ﬁve and eight main input attributes respectively that were selected after a

feature-selection procedure using a genetic algorithm [8]. The toxicity problem refers

to the estimation of the toxicity index of several chemical substances, containing 20

features and 1 dependent variable. Finally, 4 artiﬁcial problems have been included in

the study as described in [13], consisted of four features and one output variable. Of

the four latter datasets, one implements a multivariate problem, another one a linear

function, while the third and the fourth ones refer to a non-linear function. Table 2

summarizes this information per problem studied.

Problem Code Description Dependent Variable

A1 Artiﬁcial problem Numerical/ Multivariate

A2 Artiﬁcial problem Numerical/ Linear

A3 Artiﬁcial problem Numerical/ Non linear of the form

A4 Artiﬁcial problem Numerical/ Non linear of the form

NO2 Daily Maximum Concentration Forecasting from

sensory data

Numerical/ Non linear

O3 Daily Maximum Concentration Forecasting from

sensory data

Numerical/ Non linear

TOX Classiﬁcation of an index of toxicity for various

substances from chemical descriptors

Numerical

Table 2 . A description of the problems of the noise-sensitivity study.

The artiﬁcial problems A1-A4 are of the type y=f(x1,x

2,x

3,x

4) and have been

created using the formulas of table 3:

Problem x1= x2= x3= x4= y=

A1 zx1∗0.8+z∗0.6x1∗0.6+z∗0.8 z (x1−x2−x3+x4)/1.47

A2 z (x2

1+z∗0.5)/1.5x1∗0.6+z∗0.8 z (x1−x2−x3+x4)/1.7

A3 zx1∗0.8+z∗0.6x1∗0.6+z∗0.8 z (x1−x2

2−x3+x4)/1.96

A4 zx1∗0.8+z∗0.6x1∗0.6+z∗0.8 z (x1−x2−x3+x2

4)/1.76

Table 3 . Deﬁnition of the variables for the four artiﬁcial problems A1-A4.

All cases containing missing values have been deleted. Although the resulting datasets

may already contain an amount of noise, for this study it should be considered as clean

data and this fact does not inﬂuence the experiments as the noise models studied refer

to additive noise.

Artiﬁcial noise was generated at random throughout the whole datasets. The reason

of ”polluting” both the training set and the evaluation set is that as noise in the data

has been emerged so far, it will be emerged in the future with the same probability

and the same patterns.

Repetitive experiments have been done on each of the polluted datasets. Eight

noise levels have been tested, ranging from 0 to 500.20, 0.30, 0.40, 0.50}. Five diﬀerent

datasets were produced for every such noise level, while the results of the competing

machine learning schemes were averaged over these ﬁve datasets. Five-fold cross val-

idation experiments were carried out for each of these datasets. For each run, eleven

machine learning algorithms were trained and tested following the ﬁve-cross-validation

scheme. These are summed in table 1 along with a short description for each one of

them. Half of them are suitable for regression type of problems, while the other half is

classiﬁcation algorithms. Since all the problems were regression problems, the numeri-

cal dependent variable for each of the problems was transformed into a categorical one

by dividing the initial range into 5 equally wide areas. Thus the seven initial regression

datasets were transformed into seven classiﬁcation datasets, ready to be processed by

the classiﬁcation type of algorithms.

From the variety of the collected data from this study the metric of RMS error was

chosen to judge the ﬁt of each machine learning algorithm to every artiﬁcially polluted

dataset. Though other metrics as the prediction accuracy or the classiﬁcation error are

also used in many publications, the RMS error is a stricter and more suitable evaluator

of the eﬃciency of a certain algorithm in terms of a comparison study.

Problem A1

0.2

0.4

0.6

0.8

1.2

1.4

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Area 1: RMSE

Area 2: Noise-0

Relative RMSE

Fig. 1. Example of a noise sensitivity diagram: RMSE and Noise-0 Relative RMSE areas.

In the next diagrams the sensitivity of the algorithms to noise is depicted, by the

form of noise curves. Each of the 7 problems is represented by a pair of diagrams,

one for the regression type and one for the classiﬁcation type of algorithms. For every

diagram the left vertical axis represents the RMS error of the algorithms while on the

right vertical axis the noise-0 relative RMS error is measured. The latter error metric

is the diﬀerence of RMS error after the application of noise in data minus the RMS

error at the zero noise level. On the horizontal axis the gradually increasing levels of

artiﬁcial noise reside. The less sensitive algorithms to the presence of noise are those

that follow a noise-0 relative RMSE line as close to the horizontal axis as possible.

For readability and space compactness purposes it was chosen to have both RMSE

and noise-0 relative RMSE curves in the same diagram. These two curves are utilizing

diﬀerent areas, as the example of ﬁg.1 indicates. Area 1 containing the RMSE curves

always takes over the upper - upper left part of the diagram, while area 2 where the

noise-0 relative RMSE ﬁts in is contained in the lower - lower right part of the diagram.

4.1 Inquiring the regression results

It is clear from the diagrams referring to the noise curves of the regression type algo-

rithms that their behavior varies from the artiﬁcial problems to the real-world problems.

In all of the four artiﬁcial problems A1-A4 the best algorithms show a very good ﬁt

to the noise free problems at 0 noise level, but after that level their RMS error jumps

up abruptly and then evolves almost linearly. This behavior is more visible by watching

the noise-0 relative RMSE curves. For the ﬁrst two linear problems A1 and A2 linear

regression proves to be the best method as expected, followed closely by M5. For the

two non-linear problems A3 and A4 M5 and IB-9 ﬁt better. Though these algorithms

appear to have the better RMSE curves, their corresponding noise-0 relative RMSE

curves are among the worst. It emerges as a conclusion from the four problems A1-A4

that the weaker algorithms appear as the less sensitive to noise, and vice versa.

The real-world problems NO2, O3 and TOX present a diﬀerent image. All RMSE

curves are gathered inside a narrow band, close to each other. In all cases linear re-

gression ﬁts better the datasets of these problems. In disagreement with the conclusion

from the artiﬁcial problems A1-A4, the noise-0 relative RMSE curves mirror the same

behavior of their RMSE counterparts.

4.2 Exploring the classiﬁcation results

For all the problems except TOX, VFI and HyperPipes are the less ﬁt algorithms,

having RMSE curves over the reference algorithm ZeroR. Since ZeroR is used as an

eﬃciency threshold, all algorithms exhibiting RMSE curves over its own are considered

unsuitable for solving the problem at hand. From the other three algorithms, Decision

Table is the dominant one, having the lowest RMSE curve and the lowest noise-0

relative RMSE for all the seven problems studied in this report.

Another interesting ﬁnding is that the classiﬁcation algorithms have noise curves

much less sensitive than those of the algorithms of the regression type. This may be

a result of the discretization of the dependent (output) variable. We assume that as

the number of the discrete bins of the discretization process increases, the average

slope of the noise curves will also increase so as to match this of the regression type of

algorithms when the number of bins reaches the inﬁnity.

Problem A1

0.2

0.4

0.6

0.8

1.2

1.4

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem A1

0.2

0.4

0.6

0.8

1.2

1.4

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.5

1.5

2.5

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 2. RMS and noise-0 relative RMS Error for the A1 Problem.

Problem A2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

1.2

1.4

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem A2

0.2

0.4

0.6

0.8

1.2

1.4

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.5

1.5

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 3. RMS and noise-0 relative RMS Error for the A2 Problem.

Problem A3

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

1.2

1.4

1.6

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem A3

0.2

0.4

0.6

0.8

1.2

1.4

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 4. RMS and noise-0 relative RMS Error for the A3 Problem.

Problem A4

0.5

1.5

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

1.2

1.4

1.6

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem A4

0.2

0.4

0.6

0.8

1.2

1.4

0 5 10 15 20 30 40 50

Noise Level %

RMSE

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 5. RMS and noise-0 relative RMS Error for the A4 Problem.

Problem NO 2

100

120

140

0 5 10 15 20 30 40 50

Noise Level %

RMSE

100

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem NO 2

100

0 5 10 15 20 30 40 50

Noise Level %

RMSE

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 6. RMS and noise-0 relative RMS Error for the NO2 Problem.

Problem O 3

0 5 10 15 20 30 40 50

Noise Level %

RMSE

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem O3

0 5 10 15 20 30 40 50

Noise Level %

RMSE

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 7. RMS and noise-0 relative RMS Error for the O3 Problem.

Problem O 3

0 5 10 15 20 30 40 50

Noise Level %

RMSE

Noise-0 Relative RMSE

Decision Table ZeroR HyperPipes

C4.5 C4.5Part VFI

Problem O3

0 5 10 15 20 30 40 50

Noise Level %

RMSE

Noise-0 Relative RMSE

Lin. Regression ZeroR IB-9

M5 Neural Net Kstar

Fig. 8. RMS and noise-0 relative RMS Error for the TOX Problem.

5 Conclusions

A study of machine learning algorithms on noise sensitivity is reported in this paper,

based on four artiﬁcial and three real world problems. A noise model has been tested,

for a noise level ranging from 0 to 0.5. The dependent variable was transformed from

numerical to nominal in order to test the classiﬁcation algorithms. Thus a range of

regression type and classiﬁcation type of algorithms have been examined by measuring

their sensitivity to noise, as artiﬁcially additive noise has been applied on the initial

datasets.

It has been showed that linear regression from the regression type of algorithms

adapts better to the gradually increasing noise levels. Also noticeable from the artiﬁcial

datasets A1-A4 is the fact that the better algorithms in terms of RMSE present the

greater noise sensitivity while the worst seem to be the less sensitive. Decision Table

seems to be the method the less sensitive to additive noise from the set of classiﬁcation

learners. Not only does it show the best RMSE on all of the datasets, but exhibits a

good behavior in terms of the noise-0 relative RMSE.

Future work expanding the reported study includes further experiments on diﬀerent

problems, and we believe that the forthcoming results will help in forming general

guidelines useful for the selection of the best machine learning algorithm for modeling

or prediction of problems prone to noise. At last it is worth noting that all data for

the examined datasets are originated from PERPA, the Greek air quality monitoring

authority.

References

1. Han J. And Kamber M., ”Data Mining: Concepts and Techniques”. Morgan Kaufmann

Publishers, 2000

2. Garner, S.R., (1995), WEKA: The Waikato Environment for Knowledge Analysis. In Proc.

of the New Zealand Computer Science Research Students Conference, pages 57-64.

3. Aha, D., and D. Kibler (1991), ”Instance-based learning algorithms”, Machine Learning,

vol.6, pp. 37-66

4. John, G. Cleary and Leonard, E. Trigg (1995) ”K*: An Instance- based Learner Using an

Entropic Distance Measure”, Proceedings of the 12th International Conference on Machine

learning, pp. 108-114.

5. Chevaleyre Y., Zucker J-D., (2000), ”Noise-Tolerant Rule Induction for Multi-Instance

Data”. In Proceedings of the ICML-2000 Workshop on ”Attribute-Value and Relational

Learning”.

6. Demiroz, G. and Guvenir, A. (1997) ”Classiﬁcation by voting feature intervals”, ECML-97

7. Quinlan J.R., (1993), ”C4.5, Programs for Machine Learning”, Morgan Kauﬀman, San

Mateo, California, USA.

8. Kalapanidas E. and Avouris N., (2002), ”Feature Selection Using a Genetic Algorithm

Applied on an Air Quality Forecasting Problem”. In Proceedings of the BESAI (Binding

Environmental Sciences with AI) Workshop, ECAI 2002, Lyon, France.

9. Kearns M., (1993), ”Eﬃcient noise-tolerant learning from statistical queries”. In Proceed-

ings of the twenty-ﬁfth annual ACM symposium on Theory of computing, p.392-401, May

16-18, San Diego, California, United States .

10. Li Q., Li T., Zhu S., Kambhamettu C., (2002), ”Improving Medical/Biological Data Clas-

siﬁcation Performance by Wavelet Preprocessing”, In the Proceedings of ICDM 2002.

11. Pendrith M., (1994), ”On reinforcement learning of control actions in noisy and non-

Markovian domains”. Technical Report UNSW-CSE-TR-9410, School of Computer Science

and Engineering, The University of New South Wales, Sydney, Australia

12. Teytaud O., (2001), ”Robust Learning: Regression Noise”. In Proceedings of IJCNN

2001”. pp 1787-1792.

13. Sarle, W.S. (1998), ”Prediction with Missing Inputs,” in Wang, P.P.

(ed.), JCIS 98 Proceedings, Vol II, Research Triangle Park, NC, 399-402,

ftp://ftp.sas.com/pub/neural/JCIS98.ps.

Do Autoencoders Need a Bottleneck for Anomaly Detection?

Article

Full-text available

Jan 2022

A common belief in designing deep autoencoders (AEs), a type of unsupervised neural network, is that a bottleneck is required to prevent learning the identity function. Learning the identity function renders the AEs useless for anomaly detection. In this work, we challenge this limiting belief and investigate the value of non-bottlenecked AEs. The bottleneck can be removed in two ways: (1) overparameterising the latent layer, and (2) introducing skip connections. However, limited works have reported on the use of one of the ways. For the first time, we carry out extensive experiments covering various combinations of bottleneck removal schemes and datasets using variants of Bayesian AEs. In addition, we propose the infinitely-wide AEs as an extreme example of non-bottlenecked AEs. Their improvement over the baseline implies learning the identity function is not trivial as previously assumed. Moreover, we find that non-bottlenecked architectures (highest AUROC=0.905) can outperform their bottlenecked counterparts (highest AUROC=0.714) on a recent benchmark of CIFAR (inliers) vs SVHN (anomalies), among other tasks, shedding light on the potential of developing non-bottlenecked AEs for improving anomaly detection.

Network Intrusion Detection Model Based on an ensemble algorithm

Thesis

Full-text available

Mar 2024

The ensemble technique, or any combination model, trains several learners to solve classification or regression problems, rather than using traditional learning approaches that can only generate one learner from training data. Traditional techniques and other machine learning approaches have struggled to satisfy the detection of network intrusions due to the continual growth of digital technology and the more diversified ways of cyber-attacks. KDD-99 and NSL-KDD datasets are examined and analysed through experiments so as to determine the most influential features in determining network intrusions before an ensemble algorithm is constructed using the very-discovered influential features. MAXE ensemble algorithm is constructed from kNeighbors, Random Forest, Gaussian NB, Logistic Regression, XGB, Decision Tree, Extra Trees, and MPL classifier, and a series of experiments are performed for a detailed analysis of findings from other researchers versus the outcome of the ensemble algorithm to further improve the imbalanced sample of positive and negative classes in the dataset used to assess the model's efficacy, and the low intrusion detection models for multiple classifications of intrusions and low accuracy of class imbalance data detection disregards the spatial features in the process of detecting attacks. Based on the findings of the experiments, the thesis evaluates and tables the efficacy of the proposed assembled algorithm against the other state-of-the-art algorithm.

Effect of measurement error in wet chemistry soil data on the calibration and model performance of pedotransfer functions

Article

Feb 2024
GEODERMA

2022 Review of Data-Driven Plasma Science

Article

Full-text available

Jul 2023

Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required.

Credit Card Fraud Detection using Machine Learning and Data Mining Techniques - a Literature Survey

Article

Jul 2023

Purpose: To understand the algorithms used in Credit Card Fraud Detection (CCFD) using Machine Learning (ML) and Data Mining (DM) techniques, Review key findings in the area and come up with research gaps or unresolved problem. To become knowledgeable about the current discussions in the area of ML and DM. Design/Methodology/Approach: The survey on CCFD using ML and DM was conducted based on data from academic papers, web articles, conference proceedings, journals and other sources. Information is reviewed and analysed. Results/Findings: Identification of credit card fraud is essential for protecting a person's or an organization's assets. Even though we have various safeguards in place to prevent fraudulent activity, con artists may develop a method to get around the checkpoints. We must create straightforward and efficient algorithms employing ML and DM to anticipate fraudulent activities in advance. Originality/Value: Study of ML and DM algorithms in CCFD from diverse sources is done. This area needs study due to recent methods by fraudsters in digital crime have developed. The information acquired will be helpful for creating new methodologies or improving the outcomes of current algorithms. Type of Paper: Literature Review.

A system for electric vehicle’s energy-aware routing in a transportation network through real-time prediction of energy consumption

Article

Full-text available

Apr 2022

To tackle the problem of range anxiety of a driver of an electric vehicle (EV), it is necessary to accurately estimate the power/energy consumption of EVs in real time, so that drivers can get real-time information about the vehicle’s remaining range. In addition, it can be used for energy-aware routing, i.e., the driver can be provided with information that on which route less energy consumption will take place. In this paper, an integrated system has been proposed which can provide reliable and real-time estimate of the energy consumption for an EV. The approach uses Deep Auto-Encoders (DAE), cross-connected using latent space mapping, which consider historical traffic speed to predict the traffic speed at multiple time steps in future. The predicted traffic speed is used to calculate the future vehicle speed. The vehicle speed, acceleration along with wind speed, road elevation, temperature, battery’s SOC, and auxiliary loads are used as input to a multi-channel Convolutional Neural Network (CNN) to predict the energy consumption. The prediction is further fine-tuned using a Bagged Decision Tree (BDT). Unlike other existing techniques, the proposed system can be easily generalized for other vehicles as it is independent of internal vehicle parameters. Comparison with other benchmark techniques shows that the proposed system performs better and has a least mean absolute percentage error of 1.57%.

Resonating response makes people feel better An empathetic protocol in dialogue system

Conference Paper

Full-text available

Aug 2021

Mingwei Shi

Currently, the emotional research of dialogue systems is a hot topic. However, several works mainly focused on acquiring state-of-the-art performance in a dialogue system and paid less attention to the inner emotions' response, and lacked interpretability of emotional response mechanism within a dialogue system. Hence, this work proposed an empathic protocol to address this issue via introducing an innovative element (Mirror neuron) from connectionism and neuroscience to gradually design an AMNN (Artificial mirror neuron network) in the dialogue system for clear interpretability firstly. Subsequently, this paper described an empathic protocol to produce and analyze responses between a user and an agent via the self-defined neural network that served as the Central Nervous System of a dialogue agent. By employing this protocol in a traffic-service application, users felt that their emotions were resonated with and understood and communicated with the dialogue agent proactively.

Noise-Tolerant Self-Embedded LSTM for Seismic Event Classification

Conference Paper

Sep 2023

Advances in Computational Intelligence of Polymer Composite Materials: Machine Learning Assisted Modeling, Analysis and Design

Article

Full-text available

Dec 2021

The superior multi-functional properties of polymer composites have made them an ideal choice for aerospace, automobile, marine, civil, and many other technologically demanding industries. The increasing demand of these composites calls for an extensive investigation of their physical, chemical and mechanical behavior under different exposure conditions. Machine learning (ML) has been recognized as a powerful predictive tool for data-driven multi-physical modeling, leading to unprecedented insights and exploration of the system properties beyond the capability of traditional computational and experimental analyses. Here we aim to abridge the findings of the large volume of relevant literature and highlight the broad spectrum potential of ML in applications like prediction, optimization, feature identification, uncertainty quantification, reliability and sensitivity analysis along with the framework of different ML algorithms concerning polymer composites. Challenges like the curse of dimensionality, overfitting, noise and mixed variable problems are discussed, including the latest advancements in ML that have the potential to be integrated in the field of polymer composites. Based on the extensive literature survey, a few recommendations on the exploitation of various ML algorithms for addressing different critical problems concerning polymer composites are provided along with insightful perspectives on the potential directions of future research.

A Comparative Study on Data Mining Approach Using Machine Learning Techniques: Prediction Perspective

Chapter

Full-text available

Jan 2022

Cancer is a critical disease in medical science. Breast cancer is the most common type of cancer. At early stage, medical science predicts the main cause of cancer and the chances of patient survival. Machine learning is the latest approach for cancer prediction. To identify the possible factors from the breast cancer dataset, the classification data mining techniques are used. Data mining is the most popular approach in the data science field and machine learning technique that maps into the specific data and group data. This research paper elaborates on the definitions, models, parameters, and algorithms of machine learning and highlights a special role of the machine learning approach in data mining. Therefore, we focus on the analysis and discussions regarding the issue of data mining using the learning approach and provide the comparative analysis of classification algorithms which are used in data mining.

Feature Selection using a Genetic Algorithm applied on an Air Quality Forecasting Problem

Article

Full-text available

Jan 2003

Feature selection is a process followed in order to improve the generalization and the performance of several classification and/or regression algorithms. Feature selection processes are divided in two categories, the filter and the wrapper approach. The formal is performed independently of the learning algorithm while the later makes use of the algorithm in an iterative way. As (1) describe, the feature weighting algorithms are divided into two categories: the filtering methods and the wrapper methods. The former is a no-feedback, pre-selection approach where the selection of the feature subset is performed independently of the learning algorithm. The later is an iterative method that encapsulates the learning algorithm in the feature selection process. This paper focuses on the exploitation of a genetic algorithm used to extract an optimal feature subset of a large database containing pollutant concentration measurements, following the wrapper approach. The feature subset feeds a nearest neighbor algorithm in order to predict the daily maximum concentration for two pollutants. The encoding problem of the complexity of representation of the features in the genomes is tackled. Results of the experimentation on an air quality forecasting problem will be presented, as well as slight alterations on the standard simple genetic algorithm paradigm that guided the algorithm to a mature convergence and gave good solutions.

Classification by voting feature intervals

Chapter

Full-text available

Instance-Based Learning Algorithms

Article

Full-text available

Jan 1991

Storing and using specific instances improves the performance of several supervised learning algorithms. These include algorithms that learn decision trees, classification rules, and distributed networks. However, no investigation has analyzed algorithms that use only specific instances to solve incremental learning tasks. In this paper, we describe a framework and methodology, called instance-based learning, that generates classification predictions using only specific instances. Instance-based learning algorithms do not maintain a set of abstractions derived from specific instances. This approach extends the nearest neighbor algorithm, which has large storage requirements. We describe how storage requirements can be significantly reduced with, at most, minor sacrifices in learning rate and classification accuracy. While the storage-reducing algorithm performs well on several real-world databases, its performance degrades rapidly with the level of attribute noise in training instances. Therefore, we extended it with a significance test to distinguish noisy instances. This extended algorithm's performance degrades gracefully with increasing noise levels and compares favorably with a noise-tolerant decision tree algorithm.

WEKA: The Waikato Environment for Knowledge Analysis

Article

Full-text available

May 1995

Stephen Garner

WEKA is a workbench designed to aid in the application of machine learning technology to real world data sets, in particular, data sets from New Zealand's agricultural sector. In order to do this a range of machine learning techniques are presented to the user in such a way as to hide the idiosyncrasies of input and output formats, as well as allow an exploratory approach in applying the technology. The system presented is a component based one that also has application in machine learning research and education. 1. Introduction The WEKA machine learning workbench has grown out of the need to be able to apply machine learning to real world data sets in a way that promotes a "what if?..." or exploratory approach. Each machine learning algorithm implementation requires the data to be present in its own format, and has its own way of specifying parameters and output. The WEKA system was designed to bring a range of machine learning techniques or schemes under a common interface so that t...

Classification by Voting Feature Intervals

Article

Full-text available

Apr 1999

A new classification algorithm called VFI (for Voting Feature Intervals) is proposed. A concept is represented by a set of feature intervals on each feature dimension separately. Each feature participates in the classification by distributing real-valued votes among classes. The class receiving the highest vote is declared to be the predicted class. VFI is compared with the Naive Bayesian Classifier, which also considers each feature separately. Experiments on real-world datasets show that VFI achieves comparably and even better than NBC in terms of classification accuracy. Moreover, VFI is faster than NBC on all datasets.

Efficient noise-tolerant learning from statistical queries

Article

M. Kearns

Improving Medical/Biological Data Classification Performance by Wavelet Preprocessing.

Conference Paper

Jan 2002

Many real-world datasets contain noise which could degrade the performances of learning algorithms. Motivated from the success of wavelet denoising techniques in image data, we explore a general solution to alleviate the effect of noisy data by wavelet preprocessing for medical/biological data classification. Our experiments are divided into two categories: one is of different classification algorithms on a specific database, and the other is of a specific classification algorithm (decision tree) on different databases. The experiment results show that the wavelet denoising of noisy data is able to improve the accuracies of those classification methods, if the localities of the attributes are strong enough.

Data Mining: Concepts and Techniques

Book

Jan 2000

This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.

Efficient noise-tolerant learning from statistical queries

Article

Mar 2001

Michael Kearns

In this paper, we study the problem of learning in the presence of classification noise in the probabilistic learning model of Valiant and its variants. In order to identify the class of “robust” learning algorithms in the most general way, we formalize a new but related model of learning from statistical queries . Intuitively, in this model a learning algorithm is forbidden to examine individual examples of the unknown target function, but is given acess to an oracle providing estimates of probabilities over the sample space of random examples. One of our main results shows that any class of functions learnable from statistical queries is in fact learnable with classification noise in Valiant's model, with a noise rate approaching the information-theoretic barrier of 1/2. We then demonstrate the generality of the statistical query model, showing that practically every class learnable in Valiant's model and its variants can also be learned in the new model (and thus can be learned in the presence of noise). A notable exception to this statement is the class of parity functions, which we prove is not learnable from statistical queries, and for which no noise-tolerant algorithm is known.

K*: An Instance-based Learner Using an Entropic Distance Measure

Article

Aug 1996

The use of entropy as a distance measure has several benefits. Amongst other things it provides a consistent approach to handling of symbolic attributes, real valued attributes and missing values. The approach of taking all possible transformation paths is discussed. We describe K*, an instance-based learner which uses such a measure, and results are presented which compare favourably with several machine learning algorithms. Introduction The task of classifying objects is one to which researchers in artificial intelligence have devoted much time and effort. The classification problem is hard because often the data available may be noisy or have irrelevant attributes, there may be few examples to learn from or simply because the domain is inherently difficult. Many different approaches have been tried with varying success. Some well known schemes and their representations include: ID3 which uses decision trees (Quinlan 1986), FOIL which uses rules (Quinlan 1990), PROTOS which is a case...

Machine Learning algorithms: a study on noise sensitivity

Abstract and Figures

Recommended publications

Machine Learning algorithms: a study on noise

Air Quality Prediction Using Neuro-Fuzzy Tools

Neural and Neuro-Fuzzy Integration in a Knowledge-Based System for Air Quality Prediction

Applying Machine Learning Techniques in Air Quality Prediction