ResearchPDF Available

SVM Classification with Linear and RBF kernels

Authors:

Abstract and Figures

The paper attempts to survey the existing research and development efforts involving the use of Matlab for classification. In particular, it aims at providing a representative view of support vector machines and a way they can be trained and learn from all kinds of data. Two kind of algorithms are presented with short overviews, then discussed separately and finally in comparison with the results, including a few figures. Finally, a summary of the considered systems is presented together with the experimental results.
Content may be subject to copyright.
1
SVM CLASSIFICATION WITH LINEAR AND RBF KERNELS
Vasileios Apostolidis-Afentoulis
Department of Information Technology
Alexander TEI of Thessaloniki
P.O. Box 141, 574 00,
Thessaloniki, Greece
vapostolidis@gmail.com
Konstantina-Ina Lioufi
Department of Information Technology
Alexander TEI of Thessaloniki
P.O. Box 141, 574 00,
Thessaloniki, Greece
ntinaki_l@hotmail.com
ABSTRACT
The paper attempts to survey the existing research and
development efforts involving the use of Matlab for
classification. In particular, it aims at providing a
representative view of support vector machines and a way
they can be trained and learn from all kinds of data. Two
kind of algorithms are presented with short overviews, then
discussed separately and finally in comparison with the
results, including a few figures. Finally, a summary of the
considered systems is presented together with the
experimental results.
Index Terms SVM, Classification, Matlab, Linear,
RBF
1. INTRODUCTION
1.1. Support vector machines
The support vector machines (SVMs) technique was
introduced by Vapnik [1] and developed fast in recent years.
Several studies reported that SVMs, generally, are able to
deliver higher classification accuracy than the other existing
classification algorithms [2], [3]. In the last decade Support
Vector Machines (SVMs) have emerged as an important
learning technique for solving classification and regression
problems in various fields, most notably in computational
biology, finance and text categorization. This is due in part
to built-in mechanisms, to ensure good generalization which
leads to accurate prediction, the use of kernel functions to
model non-linear distributions, the ability to train relatively
quickly on large datasets, using novel mathematical
optimization techniques and most significantly the
possibility of theoretical analysis, using computational
learning theory.[5] The main objective of statistical learning
is to find a description of an unknown dependency between
measurements of objects and certain properties of these
objects. The measurements, also known as "input variables",
are assumed to be observable in all objects of interest. On
the contrary, the properties of the objects, or "output
variables", are in general available only for a small subset of
objects known as examples. The purpose of estimating the
dependency between the input and output variables is to be
able to determine the values of output variables for any
object of interest. In pattern recognition, this relates to
trying to estimate a function f: RN {±1} that can correctly
classify new examples based on past observations.[4] The
SVM software that has been used, is LIBSVM [6] with the
Linear Kernel and the RBF(radial basis function) kernel. As
the execution time of model selection is such an important
issue for practical applications of SVM, a number of studies
have been conducted on this paper [7], [8], [9], [10]. The
basic approach employed by these recent studies is to reduce
the search space of the parameter combinations.[11]
1.2. Linear SVMs
1.2.1 Separable case
In the binary classification setting, let ((x1, y1)...(xn,
yn)) be the training dataset where xi are the feature vectors
representing the instances (i.e. observations) and yi {-
1,+1} be the labels of the instances. Support vector learning
is the problem of finding a separating hyperplane that
separates the positive examples (labeled +1) from the
negative examples (labeled -1) with the largest margin. The
margin of the hyperplane is defined as the shortest distance
between the positive and negative instances that are closest
to the hyperplane. The intuition behind searching for the
hyperplane with a large margin is that a hyperplane with the
largest margin should be more resistant to noise than a
hyperplane with a smaller margin.
Formally, suppose that all the data satisfy the constraints
 
  (1),(2)
Where the w is the normal to the hyperplane, is
the perpendicular distance from the hyperplane to the origin,
and is the Euclidean norm of w.
2
Figure 1-A hyperplane separating two classes with the maximum margin.
The circled examples that lie on the canonical hyperplanes are called
support vectors.
Two constraints can be conveniently combined into the
following:
(3)
The training examples for which (3) holds lie on the
canonical hyperplanes (H1 and H2 in figure 1). The margin ρ
can be easily computed as the distance between H1 and H2 .

(4)
Hence, the maximum margin separating hyperplane be can
constructed by solving the following Primal optimization
problem:

subject to 
(5)
We switch to Lagrangian formulation of this problem for
two main reasons: i) the constraints are easier to handle, and
ii) training data only appears as a dot product between
vectors. This formulation introduces a new Lagrange
multiplier ai for each constraint and the formulation of the
minimization problem then becomes,

 


(6)
with Lagrange multipliers for each constraint in (5).
The objective is then to minimize (6) with respect to w and
b simultaneously require that the derivatives of 
with respect to all the vanish.
1.2.2 Non-Separable case
The previous section discussed the case where it is
possible to linearly separate the training instances that
belong to different classes. Obviously this SVM formulation
will not find a solution if the data cannot be separated by a
hyperplane. Even in the cases where the data is linearly
separable, SVM may overfit to the training data in its search
for the hyperplane that completely separates all of the
instances of both classes. For instance, an individual outlier
in a dataset, such as a pattern which is mislabeled, can
crucially affect the hyperplane. These concerns prompted
the development of soft margin SVMs [10], which can
handle linearly non-separable data by introducing positive
slack variables that relax the constraints in (2.1) and (2.2)
at a cost proportional to the value of . Based on this new
criteria, the relaxed constraints with the slack variables then
become:
 

(7)
which permits some instances to lie inside the margin or
even cross further among the instance of the opposite class
(see figure 2). While this relaxation gives SVM flexibility to
decrease the influence of outliers, from an optimization
perspective, it is not desirable to have arbitrarily large
values for as that would cause the SVM to obtain trivial
and sub-optimal solutions.
Figure 2 -Soft margin SVM
Thus, the relaxation is constrained by making the slack
variables part of the objective function (5) yielding,



(8)
3
subject to the constraints in (7). The cost coefficient C>0 is
a hyperparameter that specifies the misclassification penalty
and is tuned by the user based on the classification task and
dataset characteristics. As in the separable case, the solution
to (8) can be shown to have an expansion

 
(9)
where the training instances with are the support
vectors of the SVM solution. Note, that the penalty function
related to the slack variables is linear, which disappears
when (8) is being transformed into the dual formulation


subject to,

(10)
The dual formulation is conveniently very similar to the
linearly separable case, with the only difference being the
extra upper bound of C for the coefficients . Obviously as
the misclassification penalty C (10) converges to the
linearly separable case.
1.3 RBF SVMs
1.3.1 General
In general, the RBF kernel is a reasonable first choice.
This kernel nonlinearly maps samples into a higher
dimensional space, so it, unlike the linear kernel, can handle
the case when the relation between class labels and
attributes is nonlinear. Furthermore, the linear kernel is a
special case of RBF [13] since the linear kernel with a
penalty parameter has the same performance as the RBF
kernel with some parameters (C, γ). In addition, the sigmoid
kernel behaves like RBF for certain parameters [14].
The second reason is the number of hyperparameters
which influences the complexity of model selection. The
polynomial kernel has more hyperparameters than the RBF
kernel. Finally, the RBF kernel has fewer numerical
difficulties. One key point is 0 < Kij 1 in contrast to
polynomial kernels of which kernel values may go to
infinity (γxiT xj + r > 1) or zero (γxiT xj + r < 1) while the
degree is large. Moreover, we must note that the sigmoid
kernel is not valid (i.e. not the inner product of two vectors)
under some parameters [15]. There are some situations
where the RBF kernel is not suitable. In particular, when the
number of features is very large, one may just use the linear
kernel.
1.3.2 Cross-validation and Grid-search
There are two parameters for an RBF kernel: C and γ. It is
not known beforehand which C and γ are best for a given
problem; consequently some kind of model selection
(parameter search) must be done. The goal is to identify
good (C, γ) so that the classifier can accurately predict
unknown data (i.e. testing data). Note that it may not be
useful to achieve high training accuracy (i.e. a classifier
which accurately predicts training data whose class labels
are indeed known). As discussed above, a common strategy
is to separate the data set into two parts, of which one is
considered unknown. The prediction accuracy obtained
from the “unknown” set more precisely reflects the
performance on classifying an independent data set. An
improved version of this procedure is known as cross-
validation [17].
In v-fold cross-validation, we first divide the training set
into v subsets of equal size. Sequentially one subset is tested
using the classifier trained on the remaining v 1 subsets.
Thus, each instance of the whole training set is predicted
once so the cross-validation accuracy is the percentage of
data which are correctly classified.
The cross-validation procedure can prevent the overfitting
problem. Figure 3 represents a binary classification problem
to illustrate this issue. Filled circles and triangles are the
training data while hollow circles and triangles are the
testing data. The testing accuracy of the classifier in Figures
3a and 3b is not good since it overfits the training data. If
we think of the training and testing data in Figure 3a and 3b
as the training and validation sets in cross-validation, the
accuracy is not good. On the other hand, the classifier in 3c
and 3d does not overfit the training data and gives better
cross-validation as well as testing accuracy.
It is recommend a “grid-search” on C and γ using cross-
validation. Various pairs of (C, γ) values are tried and the
one with the best cross-validation accuracy is picked. It is
found that trying exponentially growing sequences of C and
γ is a practical method to identify good parameters (for
example, C = 25, 23, . . . , 215, γ = 215, 213, . . . , 23). The
grid-search is straightforward but seems naive. In fact, there
are several advanced methods which can save computational
cost by, for example, approximating the cross-validation
rate. However, there are two motivations why we prefer the
simple grid-search approach [17].
Figure 3- (a) Training data and an overfitting classifier (b) Applying an
overfitting classifier on testing data
4
Figure 3 (c) Training data and a better classifier, (d) Applying a better
classifier on testing data - An overfitting classifier and a better classifier (●
and ▲: training data; O and testing data).
One is that, psychologically, we may not feel safe to
use methods which avoid doing an exhaustive parameter
search by approximations or heuristics. The other reason
is that the computational time required to find good
parameters by grid-search is not much more than that by
advanced methods since there are only two parameters.
Furthermore, the grid-search can be easily parallelized
because each (C, γ) is independent. Many of advanced
methods are iterative processes, e.g. walking along a path,
which can be hard to parallelize [17].
Since doing a complete grid-search may still be time-
consuming, we recommend using a coarse grid first. After
identifying a “better” region on the grid, a finer grid search
on that region can be conducted. To illustrate this, we do an
experiment on the problem german from the Statlog
collection [16]. After scaling this set, we first use a coarse
grid (Figure 5) and find that the best (C, γ) is (23, 25) with
the cross-validation rate 77.5%. Next we conduct a finer
grid search on the neighborhood of (23, 25) (Figure 6) and
obtain a better cross-validation rate 77.6% at (23.25, 25.25).
After the best (C, γ) is found, the whole training set is
trained again to generate the final classifier.
The above approach works well for problems with
thousands or more data points. For very large data sets a
feasible approach is to randomly choose a subset of the data
set, conduct grid-search on them, and then do a better-
region-only grid-search on the complete data set.
Figure 5 - Loose grid search on C = 25, 23, . . . , 215 and γ = 215, 213, . . .
,23.[17]
Figure 6 - Fine grid-search on C = 21, 21.25, . . . , 25 and γ = 27, 26.75, . . . ,
23.[17]
1.4 Dataset Description
Title of Dataset: ISOLET (Isolated Letter Speech
Recognition) [12]
This data set was generated as follows: 150 subjects spoke
the name of each letter of the alphabet twice. Hence, we
have 52 training examples from each speaker. The speakers
are grouped into sets of 30 speakers each, and are referred to
as isolet1, isolet2, isolet3, isolet4, and isolet5. The data
appears in isolet1+2+3+4.data in sequential order, first the
speakers from isolet1, then isolet2, and so on. The test set,
isolet5, is a separate file. Note, that 3 examples are missing.
They were dropped due to difficulties in recording. This is a
good domain for a noisy, perceptual task. It is also a very
good domain for testing the scaling abilities of algorithms.
We have formatted the two separate files into one data file
(isolet12345.data) for convenience and we will provide it, as
well.
The number of instances from isolet1+2+3+4.data is 6238
and from isolet5.data is 1559. The total number of instances
is 7797. The number of attributes is 617 plus 1 for the class
which is the last column. All attributes are continuous, real-
valued attributes scaled into the range of -1.0 to 1.0 . The
features include spectral coefficients, contour features,
sonorant features, pre-sonorant and post-sonorant features.
There are no missing attribute values.
2. EXPERIMENTS
2.1 General Explanation - Linear Experiments
First of all, we have to bring to mind that the given dataset
was split into two separate files, so we had to combine them
into one. The process of the file combination was verified
with a program called WinMerge, which is an open source
differencing and merging tool for Windows.
Cross-validation Method has been used to split dataset
into random pieces of data. For this implementation in
5
Matlab, «Holdout» parameter is used and sets the amount of
data that have been left out of the training procedure. While
keeping only the 10% of the data for training, two kinds of
indexes have been created, one for training and one for
testing data. So far, all this process was made to find the
sum of the total instances number. The next part, contains
the selection of the svm training model, where the parameter
C (or ξ as already has been explained in introduction) is also
included and then a vector is created to store the values
needed for the experiments.
After setting the «svmtrain» , its «svm predictions»| turn
to take the lead. As returned values, we get the labels and
the accuracy and ignore the last attribute which is
information about the training model. This situation is split
into 2 parts, one for training and one for testing data. Also,
the results of training and testing predictions are printed into
a plot with two subplots. The first subplot presents the
training data while the other one, the testing data.
The whole procedure is encapsulated into a loop of one
hundred iterations, so it’s very important to capture an
average accuracy value for the training and testing
prediction.
Finally, there is one more plot, a special one, a semi
logarithmic plot which, because of the nature of C
parameter, it helps us to comprehend the results more easily.
This plot, makes a comparison of the mean value Accuracy
with C parameter.
In the following figures, you can see the visualization of
the training and testing procedure for some specific values
of C. The X axis, represents the number of classes and the Y
axis represents the total number of instances that have been
used. From the results, we can realize that for C=10-4 we get
very low accuracy 0,85 % (figure 7). For C=10-3, accuracy
raises to 8,28% (figure 8). For C=101 and for higher values,
our system performs its highest peak (figure 9), while the
accuracy reaches 92,61% (figure 10).
Figure 7 - Training & Testing Data for C=10-4
Figure 8 - Training & Testing Data for C=10-3
Figure 9 - Training & Testing Data for C=101 & for higher values
Figure 10 Final plot, comparison of Accuracy with C
6
2.2 RBF Experiments
The first step, in the SVM-RBF code, is to load the dataset
and also to set values for the parameters (C and G). Both (C
& G), initial values are set at 100. The C was kept stable,
while G was changing constantly. Then, the initialization of
the data begins and the path is defined, where the graphs
will be stored. A big part of the program are the iterations.
In this section, cross validation is implemented. Two pairs
of indexes have been made:
The first pair of indexes concludes the instances
and the labels, in order to store the training data.
The second pair of indexes concludes the instances
and the labels, in order to store the testing data.
The training of RBF kernel starts and then, the
development of the parameters that have already been set,
follow up. A number of series test the model, while the
values of the prediction, the accuracy and the labels’
description is returned. There is also, an inspection of the
elements of the accuracy’s vector and its conversion into
string format. The next part of the program refers to the
graphs. A plot is created with two subplots. Green colors
for circles and red dots have been selected to be displayed in
the subplots. Moreover, a legend has been created, including
the titles of the graphs, the number of each iteration, the
accuracy amount and the parameter values about C and G.
After that, the output of the data is exported temporarily
into an xls file, the saving type of images is set and the kind
of information that is going to appear in each graph is
selected (minimum/ maximum accuracy). The most
important part of this section is the reduction of the G, by
dividing it with 1.1 in each rerun. The X axis, represents the
number of classes and the Y axis represents the total number
of instances that have been used.
Finally, there is a procedure about comparing the G’s
values with accuracy and that appears at the final graph.
Taking a quick peek on this graph (figure 14), it turns out
that accuracy gets to really high values, at 92,71%.
Figure 11 - Training & Testing Data for C=102 & G=3*10-1
Figure 12 - Training & Testing Data for C=102 & for G=8*10-2
Figure 13- Training & Testing Data for C=102 & for G=9*10-3 and lower
Figure 14 - Final plot, Comparison of Accuracy with G
3. COMPARISON OF EXPERIMENTAL RESULTS
In order to show the validity and the accuracy of
classification of our algorithms, we performed a series of
experiments on standard benchmark data-sets. In this series
of experiments, the data were split into training and test sets.
The differences between the algorithms, are: in Linear
kernel, four values have been used for C parameter, so that
the output can be checked. On the other hand, in Radial
Basis Function kernel, C parameter is being kept stable in
102, while Gamma parameter has been constantly changing
from 102 to 8*10-2 .
As far as we can perceive from the 2 final graphical
representations, the achievement results are almost the same
7
from the two kernels. More specifically, both kernels
achieve the same level of accuracy, almost 93%.
In the current dataset, data are linear elements, so we
cannot make a real comparison of the 2 kernels. But in a
different dataset with non-linear data, radial basis function
kernel, would generalize much better, unlike Linear kernel.
In the table below, we can observe the elements from
each algorithm, separately and compare the results.
Table 1- Linear Kernel in comparison with RBF kernel results
Isolet Dataset
Linear Kernel
RBF Kernel
Instances
7797
7797
Attributes
617
617
Train Data
780
780
Test Data
7017
7017
Iterations
100
100
C
10-4 /10-3/101/102
102
G
-
8*10-2 - 102
Accuracy
0,85/8,28/92,61
0,85/8,49/92,73
Decision
boundary
Linear
Nonlinear
Related distance
function
Euclidian distance
Euclidian distance
Regularization
[18]
Training-set,
cross-validation
to select C
(defining
misclassification
penalty
Training-set,
cross-validation
to select C and γ
(defining RBF
width)
4. REFERENCES
[1] C. Cortes and V. Vapnik, “Support-vector network,” Machine
Learning, vol. 20, pp. 273297, 1995.
[2] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-
class support vector machines,” IEEE Transactions on Neural
Networks, vol. 13, no. 2, pp. 415425, 2002.
[3] T. Joachims, “Text categorization with support vector
machines: learning with many relevant features,” in Proceedings of
ECML-98, 10th European Conference on Machine Learning, 1998,
number 1398, pp. 137142.
[4] S. Ertekin, “Learning in Extreme Conditions: Online and
Active Learning with Massive, Imbalanced and Noisy Data,”
Citeseer, 2009.
[5] R. S. Shah, “Support Vector Machines for Classification and
Regression,” McGill University, 2007.
[6] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support
vector machines, 2001, Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[7] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee,
“Choosing multiple parameters for support vector machines,”
Machine Learning, vol. 46, pp. 131159, 2002.
[8] S. Sathiya Keerthi, “Efficient tuning of SVM hyperparameters
using radius/margin bound and iterative algorithms,” IEEE
Transactions on Neural Networks, 2002.
[9] K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple
performance measures for tuning SVM hyperparameters,”
Neurocomputing, 2002.
[10] D. DeCoste and K. Wagstaff, “Alpha seeding for support
vector machines,” in Proceedings of International Conference on
Knowledge Discovery and Data Mining (KDD-2000), 2000.
[11] Y.-Y. Ou, C.-Y. Chen, S.-C. Hwang, and Y.-J. Oyang,
“Expediting model selection for support vector machines based on
data reduction,” inSystems, Man and Cybernetics, 2003. IEEE
International Conference on, 2003, vol. 1, pp. 786791.
[12] UCI Repository, Uci machine learning repository,
http://www.ics.uci.edu/mlearn/MLRepository.html
[13] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support
vector machines with Gaussian kernel. Neural Computation,
15(7):1667Ð1689, 2003.
[14] H.-T. Lin and C.-J. Lin. A study on sigmoid kernels for SVM
and the training of non-PSD kernels by SMO-type methods.
Technical report, Department of Computer Science, National
Taiwan University, 2003.
[15] V. Vapnik. The Nature of Statistical Learning Theory.
Springer-Verlag, New York, NY, 1995.
[16] D. Michie, D. J. Spiegelhalter, C. C. Taylor, and J. Campbell,
editors. Machine learning, neural and statistical classification.
Ellis Horwood, Upper Saddle River, NJ, USA, 1994. ISBN 0-13-
106360-X. Data available at http://archive.ics.uci.edu/ml/machine-
learning-databases/statlog/
[17] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A
Practical Guide to Support Vector Classification. National Taiwan
University, Taipei 106, Taiwan. Last updated: April 15, 2010.
[18] M. Misaki, Y. Kim, P. A. Bandettini, and N. Kriegeskorte,
“Comparison of multivariate classifiers and response
normalizations for pattern-information fMRI,” Neuroimage, vol.
53, no. 1, pp. 103118, Oct. 2010.
... The Gamma distribution, the distribution frequency g(x) or probability density function is defined by the following equation [18]. (2) where α is the shape parameter, β is the scale parameter, x is the precipitation amount and Γ(α) is the Gamma function. Maximum likelihood functions are used to estimate α and β as follow: ...
... The Lagrange multipliers are then applied to the RBF kernel by the means of kernel trick. The kernel trick is a non-linear mapping of instances to a higher dimensional space and unlike the linear kernel, it can deal with situations where the relationship between class labels and attributes is non-linear [2,6]. The RBF kernel is given below: (14) where is the bandwith parameter of Gaussian RBF kernel and ||x i -x j || 2 defines the squared Euclidean distance. ...
... The RBF kernel provides a nonlinear mapping of data points in a higher dimensional space. It is an effective kernel method when there is a nonlinear relationship between class labels and attributes (Apostolidis-Afentoulis, 2015). In addition, the RBF kernel also provides simplified tuning by using only two parameters: gamma (γ), which adjusts the smoothness of the hyperplane by changing its flexibility (Shadeed et al., 2020), and the penalty parameter (C), which adjusts the tolerance to data points shifted from their sides . ...
... Supervised classification may not require all features in large datasets, as not all features may contain sufficient information. For instance, RBF-SVM methods are most effective when the number of features is limited (Apostolidis-Afentoulis, 2015). Irrelevant features in highdimensional datasets can negatively impact the efficiency of the ML process. ...
Article
Full-text available
Background/aim: The complicated nature of tumor formation makes it difficult to identify discriminatory genes. Recently, transcriptome-based supervised classification methods using support vector machines (SVMs) have become popular in this field. However, the inclusion of less significant variables in the construction of classification models can lead to misclassification. To improve model performance, feature selection methods such as enrichment analysis can be used to extract useful variable sets. The detection of genes that can discriminate between normal and tumor samples in the association of cancer and disease remains an area of limited information. We therefore aimed to discover novel and practical sets of discriminatory biomarkers by utilizing the association of cancer and disease. Materials and methods: In this study, we employed an SVM classification method for differentially expressed genes enriched by Disease Ontology and filtered nondiscriminatory features using Wilk’s lambda criterion prior to classification. Our approach uses the discovery of disease-associated genes as a viable strategy to identify gene sets that discriminate between tumor and normal states. We analyzed the performance of our algorithm using comprehensive RNA-Seq data for adenocarcinoma of the colon, squamous cell carcinoma of the lung, and adenocarcinoma of the lung. The classification performance of the obtained gene sets was analyzed by comparison with different expression datasets and previous studies using the same datasets. Results: It was found that our algorithm extracts stable small gene sets that provide high accuracy in predicting cancer status. In addition, the gene sets generated by our method perform well in survival analyses, indicating their potential for prognosis. Conclusion: By combining gene sets for both diagnosis and prognosis, our method can improve clinical applications in cancer research. Our algorithm is available as an R package with a graphical user interface in Bioconductor (https://doi.org/10.18129/B9.bioc.SVMDO) and GitHub (https://github.com/robogeno/SVMDO).
... In various cases, the RBF kernel is not sufficient. For example, only linear kernels can be used if the number of features is large [25]. This kernel can be mathematically represented in (3) [24], [25]: ...
... For example, only linear kernels can be used if the number of features is large [25]. This kernel can be mathematically represented in (3) [24], [25]: ...
Article
Full-text available
span lang="EN-US">This paper presents kaffir lime oil quality grading using the intelligent system classification method, a non-linear support vector machine (NSVM). This method classifies the quality kaffir lime oil into two groups: high and low quality, based on their significant chemical compounds. The 90 data of kaffir lime oil were used in this project from high to low quality. The abundance (%) of significant chemical compounds will act as the input and high or low quality as an output. The 90 data will be divided into two sets: training and testing data sets with a ratio of 8:2. The radial basis function (RBF) optimization kernel parameters in NSVM. Using the implementation of MATLAB software version R2020a, all data and analysis work was performed automatically. The results showed that the NSVM model met all performance criteria for 100% accuracy, sensitivity, specificity, and precision.</span
... The kernel is used to reduce the dimension of the problem, passing from high dimensional data to one-dimension data linearly separable. In general the Radial Basis Function (RBF) kernel is a reasonable first choice [183]. The RBF on two input vectors x 1 and x 2 has the form of (6): ...
Thesis
Full-text available
Europe is at the beginning of an energy revolution. The old energy paradigm where big, centralized power plants provide energy to passive consumers is ending. Distributed renewable energy resources and electrification of mobility, industrial processes or building heating and cooling devices are changing the way in which electricity is produced and consumed. In addition, the current socio-political tensions and the Ukrainian war boosts European countries to reach the energy independency from other regions and from fossil fuels. In this moment, energy is more expensive than ever, and citizens are every time more conscious about environmental issues and desire to be an active part of the revolution. In this context, advances in the Information Technology (IT) allows to gather much more information from devices and allow to control them remotely, although their potential in energy-related topics is still untapped. At the same time, European Directives are incentivizing consumers to play an active role in the electricity system and national regulators are transposing the European Directives to enable consumers to actively participate in electricity markets. However, it is not easy for the System Operator to handle the coexistence of consumers and generators in the same markets’ mechanisms, due to their intrinsic differences. Demand aggregators are the new market actors that promise to put the consumers at the center of the energy system. Demand aggregator’s role is to aggregate, trade, and coordinately manage the flexibility of multiple consumers in electricity markets. Demand-side flexibility is the ability of a consumer to modify its consumption depending on external factors, such as electricity prices or electricity’s grid conditions. However, it is not clear what is the best strategy to adopt for Demand Aggregators. Under this new paradigm, this PhD thesis prospects the optimal strategies of a demand aggregator that manages the flexibility of different types of assets in energy and balancing markets, from the optimal bidding strategies in a day-ahead horizon to the execution and control of the devices in real time. To achieve this main objective, the thesis firstly analyzes the regulation of the four main European balancing markets, with special attention in finding barriers and enablers for a commercial scale development of Demand Aggregators. Once analyzed the framework, the thesis aims to cover the optimal operation of the Demand Aggregator: (1) the thesis explores and proposes three different algorithms to predict the electricity’s consumption of different type of consumers, proposing a methodology to compare them; (2) the work proposes a methodology to predict the flexibility of several type of consumption and allow to trade this flexibility in electricity and balancing markets and (3) the thesis proposes two novel mathematical programming models to allow the participation in the Iberian secondary reserve market and the joint participation in short-term energy and tertiary reserve markets. Results demonstrate the technical and economic viability for Demand Aggregators to participate in the selected markets and the novelty of the proposed methodologies. Further research topics are individuated due to the complexity of the problem, including electricity market’s regulatory issues or economical and physicals restrictions to consider when a change in the consumer’s behavior is needed. Despite the challenging framework, from the algorithms and knowledge developed during this thesis, the author, with its thesis director, funded in 2020 Bamboo Energy. Bamboo Energy is a company created to commercialize the software developed withing this thesis, making demand aggregation a reality in Spain. Consequently, this thesis presents a success story of how what began, more than five years ago, with the initial steps of energy flexibility in a research environment, ended up with a spin-off that tackles real market business on energy management in demand aggregation applications. Keywords: Demand aggregators; flexibility; energy markets; Demand Response; balancing services; bidding strategies; optimization; forecast, business models.
... This is a linear SVM. Unlike the linear kernel, which cannot handle the situation where the relationship between class labels and attributes is nonlinear, Radial Basis Function (Rbf) kernel nonlinearly maps samples into a higher dimensional space [40]. SVM is basically used with binary classi cation but the same method is applied for multiclass classi cation when the multi classi cation problem is divided into various binary classi cation problems [42]. ...
Preprint
Full-text available
Depression, a common mental health issue, significantly disrupting an individual's daily functioning and increasing premature Depression, a common mental health issue, significantly disrupting an individual's daily functioning and increasing premature mortality risk. The ubiquitous use of social media platforms for expressing sentiments and sharing daily activities provides a fertile ground for early detection of depression. This paper makes significant contributions in utilizing online platforms for depression detection. Firstly, we introduce five machine-learning models to detect depression in Arabic and English text from Twitter. For Arabic text, our optimal model achieved a high accuracy with an F1-score of 96.6% for binary classification of depressed and non-depressed tweets. For English text, excluding negations, the model accomplished an F1-score of 92% for binary classification and 88% for multi-classification (depressed, indifferent, happy). When considering negations, the model demonstrated a slightly lower performance with an 87% and 85% F1-score for binary and multi-classification respectively. Secondly, we present three unique corpora: one manually annotated Arabic corpus, and two automatically annotated English corpora—with and without negation. These corpora encompass a broad spectrum of emotional sentiments, enhancing the depth of our analysis. Lastly, the paper presents a novel web application for depression detection, implementing our refined models. This application enables the identification of depression symptoms in tweets and prediction of an individual's depression trends, supporting both English and Arabic languages. This research represents a significant stride forward in mental health detection leveraging the widespread use of social media.
... So, a C value must be tuned in such a way that an optimum value is obtained for a bias-variance trade-off. In this study, radial basis function (RBF) [43] was selected as the kernel function after tuning the related hyperparameters. RBF mostly performs well when the features have a non-linear relationship with the output label, as a decision function shape one-vs-one approach was selected. ...
Article
Full-text available
The principal objective of this study is to employ non-destructive broadband dielectric spectroscopy/impedance spectroscopy and machine learning techniques to estimate the moisture content in FRP composites under hygrothermal aging. Here, classification and regression machine learning models that can accurately predict the current moisture saturation state are developed using the frequency domain dielectric response of the composite, in conjunction with the time domain hygrothermal aging effect. First, to categorize the composites based on the present state of the absorbed moisture supervised classification learning models (i.e., quadratic discriminant analysis (QDA), support vector machine (SVM), and artificial neural network-based multilayer perceptron (MLP) classifier) have been developed. Later, to accurately estimate the relative moisture absorption from the dielectric data, supervised regression models (i.e., multiple linear regression (MLR), decision tree regression (DTR), and multi-layer perceptron (MLP) regression) have been developed, which can effectively estimate the relative moisture absorption from the dielectric response of the material with an R2 value greater than 0.95. The physics behind the hygrothermal aging of the composites has then been interpreted by comparing the model attributes to see which characteristics most strongly influence the predictions.
Article
Full-text available
Major Depressive Disorder (MDD) is a neurohormonal disorder that causes persistent negative thoughts, mood and feelings, often accompanied with suicidal ideation (SI). Current clinical diagnostic approaches are solely based on psychiatric interview questionnaires. Thus, a computational intelligence tool for the automated detection of MDD with and without suicidal ideation is presented in this study. Since MDD is proven to affect cardiovascular and respiratory systems, the aim of the study is to automatically identify the disorder severity in MDD patients using corresponding multi-modal physiological signals, including electrocardiogram (ECG), finger photoplethysmography (PPG) and respiratory signals (RSP). Data from 88 subjects were used in this study, out of which 25 were MDD patients without SI (MDDSI−), 18 MDD patients with SI (MDDSI+), and 45 normal subjects. Multi-modal physiological signals were acquired from each subject, including ECG, RSP, and PPG signals, and then pre-processed. Discrete wavelet transform (DWT) was applied to the signals, which were decomposed up to six levels, and then eleven nonlinear features were extracted. The features were ranked according to the analysis of variance test and Marginal Fisher Analysis was employed to reduce the feature set, after which the reduced features were ranked again to select the most discriminatory features. Support vector machine with polynomial radial basis function (SVM-RBF) as well as k-nearest neighbor (KNN) classifiers were used to classify the significant features. The performance of the classifiers was evaluated in a 10-fold cross validation scheme. The best performance achieved for the classification of MDDSI+ patients was up to 85.2%, by using selected features from the obtained multi-modal signals with SVM-RBF, while it was up to 96.6% for the detection of MDD patients against healthy subjects. This work is a step toward the utilization of automated tools in diagnostics and monitoring of MDD patients in a personalized and wearable healthcare system.
Chapter
In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.
Article
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft-margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims’ Xi-Alpha bound and the GACV of Wahba et al. come next and perform reasonably well. For SVMs with L2 soft-margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Conference Paper
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.