Content uploaded by Vasileios Apostolidis-Afentoulis
Author content
All content in this area was uploaded by Vasileios Apostolidis-Afentoulis on Jul 08, 2015
Content may be subject to copyright.
1
SVM CLASSIFICATION WITH LINEAR AND RBF KERNELS
Vasileios Apostolidis-Afentoulis
Department of Information Technology
Alexander TEI of Thessaloniki
P.O. Box 141, 574 00,
Thessaloniki, Greece
vapostolidis@gmail.com
Konstantina-Ina Lioufi
Department of Information Technology
Alexander TEI of Thessaloniki
P.O. Box 141, 574 00,
Thessaloniki, Greece
ntinaki_l@hotmail.com
ABSTRACT
The paper attempts to survey the existing research and
development efforts involving the use of Matlab for
classification. In particular, it aims at providing a
representative view of support vector machines and a way
they can be trained and learn from all kinds of data. Two
kind of algorithms are presented with short overviews, then
discussed separately and finally in comparison with the
results, including a few figures. Finally, a summary of the
considered systems is presented together with the
experimental results.
Index Terms— SVM, Classification, Matlab, Linear,
RBF
1. INTRODUCTION
1.1. Support vector machines
The support vector machines’ (SVMs) technique was
introduced by Vapnik [1] and developed fast in recent years.
Several studies reported that SVMs, generally, are able to
deliver higher classification accuracy than the other existing
classification algorithms [2], [3]. In the last decade Support
Vector Machines (SVMs) have emerged as an important
learning technique for solving classification and regression
problems in various fields, most notably in computational
biology, finance and text categorization. This is due in part
to built-in mechanisms, to ensure good generalization which
leads to accurate prediction, the use of kernel functions to
model non-linear distributions, the ability to train relatively
quickly on large datasets, using novel mathematical
optimization techniques and most significantly the
possibility of theoretical analysis, using computational
learning theory.[5] The main objective of statistical learning
is to find a description of an unknown dependency between
measurements of objects and certain properties of these
objects. The measurements, also known as "input variables",
are assumed to be observable in all objects of interest. On
the contrary, the properties of the objects, or "output
variables", are in general available only for a small subset of
objects known as examples. The purpose of estimating the
dependency between the input and output variables is to be
able to determine the values of output variables for any
object of interest. In pattern recognition, this relates to
trying to estimate a function f: RN {±1} that can correctly
classify new examples based on past observations.[4] The
SVM software that has been used, is LIBSVM [6] with the
Linear Kernel and the RBF(radial basis function) kernel. As
the execution time of model selection is such an important
issue for practical applications of SVM, a number of studies
have been conducted on this paper [7], [8], [9], [10]. The
basic approach employed by these recent studies is to reduce
the search space of the parameter combinations.[11]
1.2. Linear SVMs
1.2.1 Separable case
In the binary classification setting, let ((x1, y1)...(xn,
yn)) be the training dataset where xi are the feature vectors
representing the instances (i.e. observations) and yi {-
1,+1} be the labels of the instances. Support vector learning
is the problem of finding a separating hyperplane that
separates the positive examples (labeled +1) from the
negative examples (labeled -1) with the largest margin. The
margin of the hyperplane is defined as the shortest distance
between the positive and negative instances that are closest
to the hyperplane. The intuition behind searching for the
hyperplane with a large margin is that a hyperplane with the
largest margin should be more resistant to noise than a
hyperplane with a smaller margin.
Formally, suppose that all the data satisfy the constraints
(1),(2)
Where the w is the normal to the hyperplane, is
the perpendicular distance from the hyperplane to the origin,
and is the Euclidean norm of w.
2
Figure 1-A hyperplane separating two classes with the maximum margin.
The circled examples that lie on the canonical hyperplanes are called
support vectors.
Two constraints can be conveniently combined into the
following:
(3)
The training examples for which (3) holds lie on the
canonical hyperplanes (H1 and H2 in figure 1). The margin ρ
can be easily computed as the distance between H1 and H2 .
(4)
Hence, the maximum margin separating hyperplane be can
constructed by solving the following Primal optimization
problem:
subject to
(5)
We switch to Lagrangian formulation of this problem for
two main reasons: i) the constraints are easier to handle, and
ii) training data only appears as a dot product between
vectors. This formulation introduces a new Lagrange
multiplier ai for each constraint and the formulation of the
minimization problem then becomes,
(6)
with Lagrange multipliers for each constraint in (5).
The objective is then to minimize (6) with respect to w and
b simultaneously require that the derivatives of
with respect to all the vanish.
1.2.2 Non-Separable case
The previous section discussed the case where it is
possible to linearly separate the training instances that
belong to different classes. Obviously this SVM formulation
will not find a solution if the data cannot be separated by a
hyperplane. Even in the cases where the data is linearly
separable, SVM may overfit to the training data in its search
for the hyperplane that completely separates all of the
instances of both classes. For instance, an individual outlier
in a dataset, such as a pattern which is mislabeled, can
crucially affect the hyperplane. These concerns prompted
the development of soft margin SVMs [10], which can
handle linearly non-separable data by introducing positive
slack variables that relax the constraints in (2.1) and (2.2)
at a cost proportional to the value of . Based on this new
criteria, the relaxed constraints with the slack variables then
become:
(7)
which permits some instances to lie inside the margin or
even cross further among the instance of the opposite class
(see figure 2). While this relaxation gives SVM flexibility to
decrease the influence of outliers, from an optimization
perspective, it is not desirable to have arbitrarily large
values for as that would cause the SVM to obtain trivial
and sub-optimal solutions.
Figure 2 -Soft margin SVM
Thus, “the relaxation is constrained” by making the slack
variables part of the objective function (5) yielding,
(8)
3
subject to the constraints in (7). The cost coefficient C>0 is
a hyperparameter that specifies the misclassification penalty
and is tuned by the user based on the classification task and
dataset characteristics. As in the separable case, the solution
to (8) can be shown to have an expansion
(9)
where the training instances with are the support
vectors of the SVM solution. Note, that the penalty function
related to the slack variables is linear, which disappears
when (8) is being transformed into the dual formulation
subject to,
(10)
The dual formulation is conveniently very similar to the
linearly separable case, with the only difference being the
extra upper bound of C for the coefficients . Obviously as
the misclassification penalty C (10) converges to the
linearly separable case.
1.3 RBF SVMs
1.3.1 General
In general, the RBF kernel is a reasonable first choice.
This kernel nonlinearly maps samples into a higher
dimensional space, so it, unlike the linear kernel, can handle
the case when the relation between class labels and
attributes is nonlinear. Furthermore, the linear kernel is a
special case of RBF [13] since the linear kernel with a
penalty parameter has the same performance as the RBF
kernel with some parameters (C, γ). In addition, the sigmoid
kernel behaves like RBF for certain parameters [14].
The second reason is the number of hyperparameters
which influences the complexity of model selection. The
polynomial kernel has more hyperparameters than the RBF
kernel. Finally, the RBF kernel has fewer numerical
difficulties. One key point is 0 < Kij ≤ 1 in contrast to
polynomial kernels of which kernel values may go to
infinity (γxiT xj + r > 1) or zero (γxiT xj + r < 1) while the
degree is large. Moreover, we must note that the sigmoid
kernel is not valid (i.e. not the inner product of two vectors)
under some parameters [15]. There are some situations
where the RBF kernel is not suitable. In particular, when the
number of features is very large, one may just use the linear
kernel.
1.3.2 Cross-validation and Grid-search
There are two parameters for an RBF kernel: C and γ. It is
not known beforehand which C and γ are best for a given
problem; consequently some kind of model selection
(parameter search) must be done. The goal is to identify
good (C, γ) so that the classifier can accurately predict
unknown data (i.e. testing data). Note that it may not be
useful to achieve high training accuracy (i.e. a classifier
which accurately predicts training data whose class labels
are indeed known). As discussed above, a common strategy
is to separate the data set into two parts, of which one is
considered unknown. The prediction accuracy obtained
from the “unknown” set more precisely reflects the
performance on classifying an independent data set. An
improved version of this procedure is known as cross-
validation [17].
In v-fold cross-validation, we first divide the training set
into v subsets of equal size. Sequentially one subset is tested
using the classifier trained on the remaining v −1 subsets.
Thus, each instance of the whole training set is predicted
once so the cross-validation accuracy is the percentage of
data which are correctly classified.
The cross-validation procedure can prevent the overfitting
problem. Figure 3 represents a binary classification problem
to illustrate this issue. Filled circles and triangles are the
training data while hollow circles and triangles are the
testing data. The testing accuracy of the classifier in Figures
3a and 3b is not good since it overfits the training data. If
we think of the training and testing data in Figure 3a and 3b
as the training and validation sets in cross-validation, the
accuracy is not good. On the other hand, the classifier in 3c
and 3d does not overfit the training data and gives better
cross-validation as well as testing accuracy.
It is recommend a “grid-search” on C and γ using cross-
validation. Various pairs of (C, γ) values are tried and the
one with the best cross-validation accuracy is picked. It is
found that trying exponentially growing sequences of C and
γ is a practical method to identify good parameters (for
example, C = 2−5, 2−3, . . . , 215, γ = 2−15, 2−13, . . . , 23). The
grid-search is straightforward but seems naive. In fact, there
are several advanced methods which can save computational
cost by, for example, approximating the cross-validation
rate. However, there are two motivations why we prefer the
simple grid-search approach [17].
Figure 3- (a) Training data and an overfitting classifier (b) Applying an
overfitting classifier on testing data
4
Figure 3 (c) Training data and a better classifier, (d) Applying a better
classifier on testing data - An overfitting classifier and a better classifier (●
and ▲: training data; O and ∆ testing data).
One is that, psychologically, we may not feel safe to
use methods which avoid doing an exhaustive parameter
search by approximations or heuristics. The other reason
is that the computational time required to find good
parameters by grid-search is not much more than that by
advanced methods since there are only two parameters.
Furthermore, the grid-search can be easily parallelized
because each (C, γ) is independent. Many of advanced
methods are iterative processes, e.g. walking along a path,
which can be hard to parallelize [17].
Since doing a complete grid-search may still be time-
consuming, we recommend using a coarse grid first. After
identifying a “better” region on the grid, a finer grid search
on that region can be conducted. To illustrate this, we do an
experiment on the problem german from the Statlog
collection [16]. After scaling this set, we first use a coarse
grid (Figure 5) and find that the best (C, γ) is (23, 2−5) with
the cross-validation rate 77.5%. Next we conduct a finer
grid search on the neighborhood of (23, 2−5) (Figure 6) and
obtain a better cross-validation rate 77.6% at (23.25, 2−5.25).
After the best (C, γ) is found, the whole training set is
trained again to generate the final classifier.
The above approach works well for problems with
thousands or more data points. For very large data sets a
feasible approach is to randomly choose a subset of the data
set, conduct grid-search on them, and then do a better-
region-only grid-search on the complete data set.
Figure 5 - Loose grid search on C = 2−5, 2−3, . . . , 215 and γ = 2−15, 2−13, . . .
,23.[17]
Figure 6 - Fine grid-search on C = 21, 21.25, . . . , 25 and γ = 2−7, 2−6.75, . . . ,
2−3.[17]
1.4 Dataset Description
Title of Dataset: ISOLET (Isolated Letter Speech
Recognition) [12]
This data set was generated as follows: 150 subjects spoke
the name of each letter of the alphabet twice. Hence, we
have 52 training examples from each speaker. The speakers
are grouped into sets of 30 speakers each, and are referred to
as isolet1, isolet2, isolet3, isolet4, and isolet5. The data
appears in isolet1+2+3+4.data in sequential order, first the
speakers from isolet1, then isolet2, and so on. The test set,
isolet5, is a separate file. Note, that 3 examples are missing.
They were dropped due to difficulties in recording. This is a
good domain for a noisy, perceptual task. It is also a very
good domain for testing the scaling abilities of algorithms.
We have formatted the two separate files into one data file
(isolet12345.data) for convenience and we will provide it, as
well.
The number of instances from isolet1+2+3+4.data is 6238
and from isolet5.data is 1559. The total number of instances
is 7797. The number of attributes is 617 plus 1 for the class
which is the last column. All attributes are continuous, real-
valued attributes scaled into the range of -1.0 to 1.0 . The
features include spectral coefficients, contour features,
sonorant features, pre-sonorant and post-sonorant features.
There are no missing attribute values.
2. EXPERIMENTS
2.1 General Explanation - Linear Experiments
First of all, we have to bring to mind that the given dataset
was split into two separate files, so we had to combine them
into one. The process of the file combination was verified
with a program called WinMerge, which is an open source
differencing and merging tool for Windows.
Cross-validation Method has been used to split dataset
into random pieces of data. For this implementation in
5
Matlab, «Holdout» parameter is used and sets the amount of
data that have been left out of the training procedure. While
keeping only the 10% of the data for training, two kinds of
indexes have been created, one for training and one for
testing data. So far, all this process was made to find the
sum of the total instances’ number. The next part, contains
the selection of the svm training model, where the parameter
C (or ξ as already has been explained in introduction) is also
included and then a vector is created to store the values
needed for the experiments.
After setting the «svmtrain» , it’s «svm prediction’s»| turn
to take the lead. As returned values, we get the labels and
the accuracy and ignore the last attribute which is
information about the training model. This situation is split
into 2 parts, one for training and one for testing data. Also,
the results of training and testing predictions are printed into
a plot with two subplots. The first subplot presents the
training data while the other one, the testing data.
The whole procedure is encapsulated into a loop of one
hundred iterations, so it’s very important to capture an
average accuracy value for the training and testing
prediction.
Finally, there is one more plot, a special one, a semi
logarithmic plot which, because of the nature of C
parameter, it helps us to comprehend the results more easily.
This plot, makes a comparison of the mean value Accuracy
with C parameter.
In the following figures, you can see the visualization of
the training and testing procedure for some specific values
of C. The X axis, represents the number of classes and the Y
axis represents the total number of instances that have been
used. From the results, we can realize that for C=10-4 we get
very low accuracy 0,85 % (figure 7). For C=10-3, accuracy
raises to 8,28% (figure 8). For C=101 and for higher values,
our system performs its highest peak (figure 9), while the
accuracy reaches 92,61% (figure 10).
Figure 7 - Training & Testing Data for C=10-4
Figure 8 - Training & Testing Data for C=10-3
Figure 9 - Training & Testing Data for C=101 & for higher values
Figure 10 – Final plot, comparison of Accuracy with C
6
2.2 RBF Experiments
The first step, in the SVM-RBF code, is to load the dataset
and also to set values for the parameters (C and G). Both (C
& G), initial values are set at 100. The C was kept stable,
while G was changing constantly. Then, the initialization of
the data begins and the path is defined, where the graphs
will be stored. A big part of the program are the iterations.
In this section, cross validation is implemented. Two pairs
of indexes have been made:
The first pair of indexes concludes the instances
and the labels, in order to store the training data.
The second pair of indexes concludes the instances
and the labels, in order to store the testing data.
The training of RBF kernel starts and then, the
development of the parameters that have already been set,
follow up. A number of series test the model, while the
values of the prediction, the accuracy and the labels’
description is returned. There is also, an inspection of the
elements of the accuracy’s vector and its conversion into
string format. The next part of the program refers to the
graphs. A plot is created with two subplots. Green colors
for circles and red dots have been selected to be displayed in
the subplots. Moreover, a legend has been created, including
the titles of the graphs, the number of each iteration, the
accuracy amount and the parameter values about C and G.
After that, the output of the data is exported temporarily
into an xls file, the saving type of images is set and the kind
of information that is going to appear in each graph is
selected (minimum/ maximum accuracy). The most
important part of this section is the reduction of the G, by
dividing it with 1.1 in each rerun. The X axis, represents the
number of classes and the Y axis represents the total number
of instances that have been used.
Finally, there is a procedure about comparing the G’s
values with accuracy and that appears at the final graph.
Taking a quick peek on this graph (figure 14), it turns out
that accuracy gets to really high values, at 92,71%.
Figure 11 - Training & Testing Data for C=102 & G=3*10-1
Figure 12 - Training & Testing Data for C=102 & for G=8*10-2
Figure 13- Training & Testing Data for C=102 & for G=9*10-3 and lower
Figure 14 - Final plot, Comparison of Accuracy with G
3. COMPARISON OF EXPERIMENTAL RESULTS
In order to show the validity and the accuracy of
classification of our algorithms, we performed a series of
experiments on standard benchmark data-sets. In this series
of experiments, the data were split into training and test sets.
The differences between the algorithms, are: in Linear
kernel, four values have been used for C parameter, so that
the output can be checked. On the other hand, in Radial
Basis Function kernel, C parameter is being kept stable in
102, while Gamma parameter has been constantly changing
from 102 to 8*10-2 .
As far as we can perceive from the 2 final graphical
representations, the achievement results are almost the same
7
from the two kernels. More specifically, both kernels
achieve the same level of accuracy, almost 93%.
In the current dataset, data are linear elements, so we
cannot make a real comparison of the 2 kernels. But in a
different dataset with non-linear data, radial basis function
kernel, would generalize much better, unlike Linear kernel.
In the table below, we can observe the elements from
each algorithm, separately and compare the results.
Table 1- Linear Kernel in comparison with RBF kernel results
Isolet Dataset
Linear Kernel
RBF Kernel
Instances
7797
7797
Attributes
617
617
Train Data
780
780
Test Data
7017
7017
Iterations
100
100
C
10-4 /10-3/101/102
102
G
-
8*10-2 - 102
Accuracy
0,85/8,28/92,61
0,85/8,49/92,73
Decision
boundary
Linear
Nonlinear
Related distance
function
Euclidian distance
Euclidian distance
Regularization
[18]
Training-set,
cross-validation
to select C
(defining
misclassification
penalty
Training-set,
cross-validation
to select C and γ
(defining RBF
width)
4. REFERENCES
[1] C. Cortes and V. Vapnik, “Support-vector network,” Machine
Learning, vol. 20, pp. 273–297, 1995.
[2] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-
class support vector machines,” IEEE Transactions on Neural
Networks, vol. 13, no. 2, pp. 415–425, 2002.
[3] T. Joachims, “Text categorization with support vector
machines: learning with many relevant features,” in Proceedings of
ECML-98, 10th European Conference on Machine Learning, 1998,
number 1398, pp. 137–142.
[4] S. Ertekin, “Learning in Extreme Conditions: Online and
Active Learning with Massive, Imbalanced and Noisy Data,”
Citeseer, 2009.
[5] R. S. Shah, “Support Vector Machines for Classification and
Regression,” McGill University, 2007.
[6] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support
vector machines, 2001, Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[7] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee,
“Choosing multiple parameters for support vector machines,”
Machine Learning, vol. 46, pp. 131–159, 2002.
[8] S. Sathiya Keerthi, “Efficient tuning of SVM hyperparameters
using radius/margin bound and iterative algorithms,” IEEE
Transactions on Neural Networks, 2002.
[9] K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple
performance measures for tuning SVM hyperparameters,”
Neurocomputing, 2002.
[10] D. DeCoste and K. Wagstaff, “Alpha seeding for support
vector machines,” in Proceedings of International Conference on
Knowledge Discovery and Data Mining (KDD-2000), 2000.
[11] Y.-Y. Ou, C.-Y. Chen, S.-C. Hwang, and Y.-J. Oyang,
“Expediting model selection for support vector machines based on
data reduction,” inSystems, Man and Cybernetics, 2003. IEEE
International Conference on, 2003, vol. 1, pp. 786–791.
[12] UCI Repository, Uci machine learning repository,
http://www.ics.uci.edu/mlearn/MLRepository.html
[13] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support
vector machines with Gaussian kernel. Neural Computation,
15(7):1667Ð1689, 2003.
[14] H.-T. Lin and C.-J. Lin. A study on sigmoid kernels for SVM
and the training of non-PSD kernels by SMO-type methods.
Technical report, Department of Computer Science, National
Taiwan University, 2003.
[15] V. Vapnik. The Nature of Statistical Learning Theory.
Springer-Verlag, New York, NY, 1995.
[16] D. Michie, D. J. Spiegelhalter, C. C. Taylor, and J. Campbell,
editors. Machine learning, neural and statistical classification.
Ellis Horwood, Upper Saddle River, NJ, USA, 1994. ISBN 0-13-
106360-X. Data available at http://archive.ics.uci.edu/ml/machine-
learning-databases/statlog/
[17] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A
Practical Guide to Support Vector Classification. National Taiwan
University, Taipei 106, Taiwan. Last updated: April 15, 2010.
[18] M. Misaki, Y. Kim, P. A. Bandettini, and N. Kriegeskorte,
“Comparison of multivariate classifiers and response
normalizations for pattern-information fMRI,” Neuroimage, vol.
53, no. 1, pp. 103–118, Oct. 2010.