Conference PaperPDF Available

Support Vector Machines--An Overview

Authors:

Abstract

Support Vector Machines (SVM) are supervised machine learning algorithms used to classify featured objects. The objective is to find a hyperplane in an n-dimensional feature space that clearly classifies the data points representing objects in the feature space. This overview contains both hard and soft SVMs, application of kernel functions, and the use of Excel to build and apply modest size classifiers, and also numerical illustrations.
S
Support Vector Machines
Tapan P Bagchi
Online Workshop held January 27-30, 2022 on
Advanced Business Analytics
Instructor-in-Charge Sujoy Bhattacharya
IIT Kharagpur
1
12 graduates have applied for selection in IPS. We seek
two attributes—(a) physical fitness, and (b) leadership.
How would you classify them into accepted/rejected?
One solution is by SVM—a supervised ML method.
2EVHUYH 0RGHOE\$, 7UDLQOHDUQ
3UHGLFW$FW
Copyright Tapan P Bagchi 2022
H
How would you classify IPS applicants?
Stakeholders all want top performing officers
¾Aspiring candidates
¾Government administrators
¾Society
¾Judiciary
¾Serving and retired IPS officers can supply data on
features:
9Communication skills
9Judgement Skills
9Analytical skills
9Research skills
9People skills
9Fitness
9Perseverance
2
Copyright Tapan P Bagchi 2022
Latent factors and Features measured
in IPS Selection
Latent Factors
(can’t be observed
directly)
Observed
(measurable)
Features
Physical Fitness
Time to sprint
100 m
Standing jump
m
Bench press kg
Leadership
Represents college
All rounder,
IQ
Dean’s list, debater
3
In this lecture we use only two measurable features of each
candidate to judge (classify) them to be accepted/rejected.
120
125
130
135
140
145
150
155
160
810 12 14
16
IQ score
100 m (secs)
Scatter Plot of Weighted Scores
Data
Copyright Tapan P Bagchi 2022
S
SVMs Goal: Classifying sample vectors (x1, x2, etc.)
in the feature space as +or -
4
Inventor Vladimir Vapnik
(1995)
Instructor Late Prof Patrick Winston, MIT
Copyright Tapan P Bagchi 2022
C
Cancer Diagnostics:
Medical Science is a big area for SVM applications
5
Sample
data or
vectors
Copyright Tapan P Bagchi 2022
Pictorial display of SVM in the Feature Space
S
Sample vector points ,
are
linearly separable
into two classes
6
Copyright Tapan P Bagchi 2022
N
Notations from Linear Algebra used in SVM
Vector x= (x1, x2, …, xl)
Transpose of vector xT
Dot product
wTx= xwT= w1(x1x2x3) = w1 x1+ w2x2+ w3x3= ||w|| ||x|| Cos θ
w2
w3
w
Unit vector of w = -----
||w||
Magnitude ||w|| = ÷(w12+ w22+ ..)
7
Equations for Hyperplanes:
Y = mx + c … line
Y = w1x1+ w2x2+ b … plane
= w1[x1x2] + b
w2
=wTx + b
w~ orientation; b
~ offset from origin
Copyright Tapan P Bagchi 2022
AA Linear Separator
Binary classification can be viewed as the task of
separating two classes present in the feature space by f(x)
:
w: orientation;
b
: offset from origin
wTx+b= 0
wTx+b< 0
f(x)=sign(wTx+b)
Sample vector x= (x1, x2)
Each sample xhere
has 2 (=l) features;
whas ldimensions
wTx+b> 0
x1
x2
8
wi= coefficient of xi
in the equation
of the hyperplane
b= bias in the equation of the
hyperplane
Recall y = mx + c
Copyright Tapan P Bagchi 2022
M
Many Linear Separators are possible
Which of the linear separators is optimal?
The best separator should be farthest from both red
and
blue Æthis will minimize
misclassification of new points.
9
Training data
(points) in the
feature space
and various
possible linear
separators
Copyright Tapan P Bagchi 2022
CChallenge
: Find the hyperplane using
classified training
samples
by maximizing margin ρ(the distance
between the two classes). ρ maximization computes
the SVM’s optimum d
decision parameters (w,
b)
)
10
Copyright Tapan P Bagchi 2022
H
Hard Margin SVM Classification
Maximizing the margin is good according to intuition
and PAC (probably approximately correct) theory.
Implies that only support vectors matter; other
training examples are ignored the SVM’s hyperplane
construction. hyperplane
nsamples,
lfeatures in
each sample
11
Copyright Tapan P Bagchi 2022
The Classification Margin ρ
Distance rfrom example xito the separator is
Examples closest to the hyperplane are support vectors.
Margin ρ
of the separator is the distance between support vectors.
w
xw b
r
i
T
r
ρ
12
xiEach point in this
feature space is a
vector xi(x1, x2).
w1 and w2 are
optimum weights
that help orient
the hyperplane in
the feature space..
x1
x2
Copyright Tapan P Bagchi 2022
The w
width (=
ρ)
of the margin
needs a dot product
of x
x2
–x
x1
with the unit vector u
u
along the
perpendicular
13
Copyright Tapan P Bagchi 2022
D
Distance of line AB from the origin
Take a vector wthat is perpendicular to line AB.
Find the unit vector uof w.
Take an arbitrary point xon AB
Hence Distance of line AB from the origin
w
Note: Unit vector uof w= -----
||w||
14
= uTx= ux
x1
ux
x2
x
u
w
A
B
x
dot product
Copyright Tapan P Bagchi 2022
The classification rule uses the dot product of
perpendicular vector w
w
or its unit vector, and the
sample vector u.
The decision rule puts sample u
in class + or
as follows
:
15
Copyright Tapan P Bagchi 2022
To make the math for computing the margin ρ
ρ
easier, we define a variable yto indicate the class
of each sample x
x
: y= +1 or -1.
16
Copyright Tapan P Bagchi 2022
V
Variable
y
(= 1 or -1) is assigned to each sample
X
that is + or -
17
Copyright Tapan P Bagchi 2022
TTo train the SVM
w
we maximize ρthe width of the margin.
Now x2
lies on the “
+
” boundary; x
x1
lies on the “
boundary. Hence x2
w
+ b = 1 and x
x1
w
+ b =
-1 are two
equations giving vectors x1
and x
x2
.
18
Recall
ux=
perpendicular
distance of AB
from origin,
u = w/||w||
1
8
x2x1
Hence u(x2 x1) = ρ
Copyright Tapan P Bagchi 2022
SSVM method seeks to maximize margin width ρ
19
Width of margin = ρ
= u
u
(x
x2
–x
x1
) = (x
x2
–x
x1
)
w
/||w
w
||
Hence we want to optimize parameters w
* and b*
Copyright Tapan P Bagchi 2022
The expression of the width of the margin
I
In SVM
we maximize ρor2/||
w
|| or minimize ||
w||
or equivalently minimize ||
w
|| 2= w12+w
22+ .. + w
l
2
20
Copyright Tapan P Bagchi 2022
Optimization methods:
A Linear Programming (LP) Problem
21
Solution can be by Simplex.
Copyright Tapan P Bagchi 2022
A Quadratic Programming (QP) Problem
22
Objective is
Convex!
Solution to QP can
be obtained by
(1) Easily by greedy
algorithms,
(2) By QP Solver in
MATLAB
Copyright Tapan P Bagchi 2022
L
Linear SVMs Mathematical optimization
Minimizing ||w||2becomes a quadratic optimization
problem as follows:
which can be reformulated as a QP as:
Find wand bsuch that
is maximized
and for all (xi,yi), i=1..ns.t. yi(wTxi+b) 1
w
2
U
Find wand bsuch that
||w||2=w
Tw= w
12+w
22+.. + wl2is minimized
Subject to: for all (xi,yi), i=1..n, s.t. yi(wTxi+b) 1
23
Copyright Tapan P Bagchi 2022
Note: P
Primal
and D
Dual
Problem formulations
can help here (dual is often quicker to solve)
24
Duality in LP: Dimension of Xis lfeatures;
Problem has ntraining samples; each sample has a class constraint.
Duality Theory: Each Primal variable xleads to a dual Constraint
()and vice versa.
3xvariables, 2constraints
2yvariables, 3constraints
Copyright Tapan P Bagchi 2022
The constraints in minimizing ||w|| or
minimizing ||w
w||2
are as follows:
25
Copyright Tapan P Bagchi 2022
SVM’s margin Optimization Problem:
The P
Primal
(fundamental) F
Formulation
26
n= number of features present in every sample (x)
Copyright Tapan P Bagchi 2022
RRecall that we want to maximize ρρ,, the wwidth of the margin.
xx2
lies on the “+
+
boundary; x1 lies on the “
boundary.
Hence x2
w
+ b = 1 and x
x1
w
+ b =
-1 are two equations.
27
Recall ux=
distance of AB
from origin
27
(x2x1)
Hence
u(X2 X1) = ρ
-+
Copyright Tapan P Bagchi 2022
S
Solving the (w, b) SVM Parameter Optimization Problem
Note: Data
xi
is sample vector i, y
i
is class (1/-1) of
xi
Thus we need to optimize a quadratic function subject to linear
constraints.
Quadratic optimization problems are a well-known class of mathematical
programming problems for which several (non-trivial) algorithms exist.
The solution involves constructing a dual problem where a Lagrange
multiplier αiis associated with every inequality constraint in the primal
(original) problem. This dual becomes a quadratic programming problem:
Find wand b such that
Φ(w)=wTwis minimized
subject to all (xi,yi), i= 1..n samples, yi(wTxi+b) 1
Solve for α1αnsuch that
Q(α)=Σαi-½ΣΣαiαjyiyjxiTxjis maximized and
(1) Σαiyi= 0
(2) αi 0 for all αi
28
Solvable for {αi}
by Quadratic
Programming
d
u
a
l
p
ri
m
a
l
Copyright Tapan P Bagchi 2022
We use L
Lagrangian L
to find optimum b and
w
29
Copyright Tapan P Bagchi 2022
30
z}t˅GG¡GGG
G
UG{GGG
w 
UG~GGGGG

GG¢
¢
Ȼ
¤
UG{GGTG
G
UG{GG
GGGGGGG
O
P
UG
Copyright Tapan P Bagchi 2022
31
Copyright Tapan P Bagchi 2022
The optimization is a quadratic optimization
problem to find the αs and it is convex.
Æ
Æ
SOLUTION IS ALSO POSSIBLE BY EXCEL®SOLVER ®
32
Copyright Tapan P Bagchi 2022
The Lagrangian shows that solving for band w
w
depends only on the
dot product
of vectors
xi
and
xj
33
Copyright Tapan P Bagchi 2022
R
References
1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines
2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel
Methods, PHI Learning
3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect
Training Data, Nonlinear Studies, MESA
4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM),
Analytics Vidhya Medium Feb 23, 2020 · 8 min read.
5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar
https://www.youtube.com/watch?v=ikt7Qze0czE
6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion
7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-
ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql
8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015
9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization
Principle, CNEL, University of Florida
34
Copyright Tapan P Bagchi 2022
S
Soft Margin Classification
What if the training set is not linearly separable?
We allow here a few misclassifications with some penalty
Slack variables ξican be added to allow misclassification of difficult or
noisy examples, resulting margin called soft.
ξi
ξi
35
Slack variables
Copyright Tapan P Bagchi 2022
Soft Margin Classification Mathematically
The old (hard margin) formulation:
Modified primal formulation incorporates slack variables:
Penalty Parameter Ccan be viewed as a way to control overfitting: it
“trades off” the relative importance of maximizing the margin and fitting
the training data.
See also: Alternative model for Soft Support Vector Machine by Porwal
Find wand b such that
Φ(w)=wTwis minimized
and for all (xi,yi), i=1..n samples: yi(wTxi+b) 1
Find wand b such that
Φ(w)=wTw+CΣξiis minimized
and for all (xi,yi), i=1..n:yi(wTxi+b) 1ξi, ,ξi 0
36
Copyright Tapan P Bagchi 2022
S
Soft Margin Classification – Solution
The Dual problem used here is identical to separable case (would not
be identical if the 2-norm penalty for slack variables CΣξi2was used in
primal objective, we would need additional Lagrange multipliers for
slack variables):
Again, xiwith non-zero αiwill be support vectors.
Solution to the dual problem gives {αi}. Hence
Find α1αNsuch that
Q(α)=Σαi-½ΣΣαiαjyiyjxiTxjis maximized and
(1) Σαiyi= 0
(2) 0 αi Cfor all αiThis is again a quadratic problem!
w=Σαiyixi
b= yk(1- ξk)-ΣαiyixiTxkfor any ks.t. αk>0f(x) = Σαiy
i
xiTx + b
But again
, we don’t need to
compute wexplicitly
for
classification; only the “dot”:
37
Dot product
Copyright Tapan P Bagchi 2022
T
Theoretical Justification for Maximum Margins
FEEL FREE TO SKIP THIS SLIDE. Subject of Research.
Vapnik has proved the following:
The class of optimal linear separators has “VC” dimension h bounded from above as
where ρis the margin, Dis the diameter of the smallest sphere that can enclose all of
the training examples, and m0is the dimensionality (m0= number of features).
VC = largest set of points (vectors) that the classification algorithm can manage to
classify without harming generalization (how well the model is trained to classify
unseen data).
Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC
dimension by maximizing the margin ρ.
Thus, complexity of the SVM classifier remains small regardless of dimensionality
(number of features).
1,min
0
2
2
¿
¾
½
¯
®
»
»
º
«
«
ª
dm
D
h
U
38
Copyright Tapan P Bagchi 2022
Linear SVMs: Overview
1. The classifier is a separating hyperplane.
2. Most “important” training points are support vectors; they define the
hyperplane.
3. Quadratic optimization algorithms can identify which training points xi
are support vectors with non-zero Lagrangian multipliers αi.
4. Both in the dual formulation of the problem and in the solution,
training points appear only inside inner (dot) products:
Find α1αNsuch that
Q(α)=Σαi-½ΣΣαiαjyiyjxiTxjis maximized and
(1) Σαiyi= 0
(2) 0 αi Cfor all αi
f(x) = ΣαiyixiTx+b
39
(f(x) is the classifier function
see slides 30 and 37)
Copyright Tapan P Bagchi 2022
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x
x2is an added
feature that enables
the use of a linear
hyperplane to do
the classification
40
Copyright Tapan P Bagchi 2022
Non-linear SVMs: Feature spaces get added to
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ:x→φ(x)
2-d feature space 3-d feature space with
φ(x) feature added
φ(x)
41
x1
x2
Copyright Tapan P Bagchi 2022
The
Kernel Trick
The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
If every datapoint is mapped into high-dimensional space via some
transformation φ: xφ(x), the inner (dot) product becomes:
K(xi,xj)= φ(xi)Tφ(xj)
¾A kernel function is a function that is equivalent to an inner product in
some feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2, (3rd
feature added)
Need to show that K(xi,xj)= φ(xi)Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1 2xj2] =
= φ(xi)Tφ(xj), where φ(x) = [1 x12 2 x1x2 x22 2x1 2x2]
Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
42
Copyright Tapan P Bagchi 2022
What Functions are Kernels?
For some functions K(xi,xj) checking that K(xi,xj)= φ(xi)Tφ(xj) can be
cumbersome.
¾Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:
K(x1,x1)K(x1,x2)K(x1,x3) K(x1,xn)
K(x2,x1)K(x2,x2)K(x2,x3)K(x2,xn)
… … … … …
K(xn,x1)K(xn,x2)K(xn,x3) K(xn,xn)
K=
43
Eigenvalues are
nonnegative
Copyright Tapan P Bagchi 2022
Examples of K
Kernel Functions
– these help adding
features to x
x
that make samples linearly separable
Linear: K(xi,xj)= xiTxj
Mapping φ : x φ(x), where φ(x) is xitself
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
Mapping φ : x φ(x), where φ(x) has
dimensions
Gaussian (radial-basis function): K(xi,xj) =
Mapping φ : x φ(x), where φ(x) is infinite-
dimensional: every point is mapped to a function (a
Gaussian); combination of functions for support vectors is
the separator.
Higher-dimensional space still has intrinsic dimensionality d
(the mapping is not onto), but linear separators in it
correspond to non-linear separators in original space.
2
2
2
V
ji
exx
¸
¸
¹
·
¨
¨
©
§
p
pd
44
Copyright Tapan P Bagchi 2022
Non-linear SVM’s Mathematical formulation
Dual problem formulation:
The solution is:
Optimization techniques for finding αis remain the same!
Find α1αnsuch that
Q(α)=Σαi-½ΣσαiαjyiyjK(xi,xj) is maximized and
(1) Σαiyi= 0
(2) αi 0 for all αi
f(x) = ΣαiyiK(xi,xj)+b
45
ÅCheck its “sign” to find class of
x
Copyright Tapan P Bagchi 2022
An excellent review of Support Vector
Machines is by Gopal Prasad Malakar
An Intro to Support Vector Machines
https://www.youtube.com/watch?v=ikt7Qze0czE
46
Copyright Tapan P Bagchi 2022
C
Classification by SVM can be made more
efficient by Kernel Functions
A class is things that form a group by common attributes,
characteristics, qualities or traits, called features. Dogs by their
pedigree form classes called breeds. A breed’s features may be type,
height, skin color, body hair length, etc.
SVM, a supervisory learning classifier, was invented as a binary
classifier that comprises a separating hyperplane that helps separate
a collection of mathematical objects into two labeled classes.
Classification is a process based on the data observed, each sample
comprising attributes or features that are similar or different.
SVM learns from a handful of class-labeled training samples and then
is able to identify the class of yet unseen samples based on their
observed features.
47
Copyright Tapan P Bagchi 2022
SVM’s learning models and algorithms:
data Æfeatures Ælearning Æprediction of
classes of new objects.
Typically, objects in some environments may be linearly classifiable.
Here a linear hyperplane such as a one dimensional line (in a 2-d
space) or a two dimensional plane (in a 3-d space) can divide the
feature space to create the two classes.
Many objects, however, are not linearly separable in the feature
space. These may be mixed and distributed randomly.
A special mathematical devise or function, called kernel,may be used
in such cases to enable a linear separator such as SVM to be
effective.
This makes the original linearly inseparable data to become linearly
separable, thus greatly easing the classification task. In SVM world,
the procedure is called the kernel trick.
48
Copyright Tapan P Bagchi 2022
Non-linear SVMs: Feature spaces get added
General idea: the original feature space can always be mapped by
f(x) to some higher-dimensional feature space where the training set
is separable:
f:xf(x)
2-d feature space 3-d feature space with
f(x) feature added
f(x)
49
x1
x2
Copyright Tapan P Bagchi 2022
The transformation function f
f(x)
To use SVM, the transformation function f(x) is applied to each object
instance (containing particular values of the features in it) to map the
original non-linear observations into a higher dimensional space, thus
adding one or more extra features to each sample or object.
Mathematically the kernel function produces the equivalence of the dot
product of the transformed data vectors in the transformed higher
dimensional space.
Hence in this new transformed space the objects become linearly
separable by SVM. With the dot product computed, the kernel function
helps exploit similarity between objects that now have the added
features. This then analytically enables the emergence of classes that are
linearly separable.
Instead of defining a slew of features, you define a single kernel function
to compute similarity say in the data from the breeds of dogs. You
provide this kernel, together with the data and labels to the learning
algorithm, and out comes a classifier.
50
Copyright Tapan P Bagchi 2022
Non-linear SVM’s Mathematical formulation
K
Kernel is now used!
Dual problem formulation:
The solution is:
Optimization techniques for finding αis remain the same!
Find α1αnsuch that
Q(α)=Σαi-½ΣσαiαjyiyjK(xi,xj)is maximized and
(1) Σαiyi= 0
(2) αi 0 for all αi
f(x) = ΣαiyiK(xi,xj)+b
51
ÅCheck its “sign” to find class of x
Copyright Tapan P Bagchi 2022
A mathematical illustration of the r
reduced
computational effort
by use of Kernel
A key step in the application of SVM is the calculation of the inner or
dot product of vectors comprising feature assessments.
This calculation may become a huge load, voluminous, and
sometimes intractable.
The kernel function, if a useful one can be discovered, or created,
can greatly simplify this dot product computation. We later list some
very useful kernel functions that machine learning experts have
harnessed to get the SVM to help with practical problems such as
cancer detection or hand character recognition.
K(x, y) = <f(x), f(y)>
Many useful Kernels have been now devised
52
Copyright Tapan P Bagchi 2022
A mathematical illustration provided by Lili
Jiang.
We begin by defining
K(x, y) = <f(x), f(y)>
Here K is the kernel function, x, y are ndimensional input vectors
each comprising nfeatures. f is a map from n-dimension to m-
dimension space.
<x,y> denotes the dot product. In the applications of kernels, usually
mis much larger than n.
K(x, y) provides the equivalence of the dot product <f(x), f(y)>.
Generally easier to compute than the dot product, a well-researched
K thus brings efficiency.
For SVM to work as a classifier, it requires objects to be linearly
separable. This is achieved by transforming data from the n
dimensional space to the larger mdimensional space by using f(x).
53
Copyright Tapan P Bagchi 2022
Dot product <f(x), f(y)> is replaced by the
Kernel K! -
-- Vapnik
To achieve linear separation, one needs to calculate the dot product <f(x), f(y)>.
Normally this would require us to calculate f(x), f(y) first, and then do the dot
product.
These two computation steps can be quite expensive as they involve
manipulations in higher mdimensional space, where mcan be a large number.
But we recall that the result of the dot product is really a scalar. The question
Jiang raises is, do we really need to go through all the trouble to get this one
number?
¾Do we really have to go to the m-dimensional space? The answer is no, provided
you can find a clever kernel!
A couple of more points must be noted about kernels.
1. Under conditions given by Mercer, every kernel function can be expressed as
a dot product in a feature space.
2. Many machine learning algorithms can be expressed entirely in terms of dot
products.
54
Copyright Tapan P Bagchi 2022
A
A numerical illustration of reduction in
computational effort by use of the kernel
Let the two feature vectors be x = (x1, x2, x3); and y = (y1,
y2, y3).
Then suppose function f(x) is
(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
due to the transformation the SVM requires to produce data
in the higher dimensional space. Same for f(y)!
In that space we require the dot product to get linear
separability. Now the dot product <f(x), f(y)>, the kernel that
we could and should use is
K(x, y ) = (<x, y>)² = (x1y1 + x2y2 + x3y3)2
55
Copyright Tapan P Bagchi 2022
N
Numerical illustration
Let’s check its utility of K by plugging in some numbers to make this
point firm.
Suppose x = (1, 2, 3); y = (4, 5, 6). Then f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
and f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36).
Therefore, <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324
= 1024
A lot of algebra, mainly because f is a mapping from 3-dimensional to
9 dimensional space.
Now let us use the kernel instead. We get K(x, y) = (4 + 10 + 18)² = 32²
= 1024.
We get the same result, with calculations that is so much easier.
56
Copyright Tapan P Bagchi 2022
57
Copyright Tapan P Bagchi 2022
SVM applications: The challenge is in finding
the right kernel!
Vladimir Vapnik created the logic of SVM in 1965 in his PhD work and later
expanded it in 1992 and beyond at the Bell Labs. SVMs were originally written
by Boser, Guyon and Vapnik in 1992 and gained popularity in late 1990s.
SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data.
SVMs can be applied to complex data types even beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for such data.
SVM has been used in face recognition, text and hypertext categorization,
classification of images, bioinformatics, protein fold and remote homo-logy
detection, hand writing recognition, generalized predictive control (GPC) and
geo and environmental sciences.
SVM techniques have been extended to a number of tasks such as regression
(Vapnik et al. ’97), principal component analysis (Schölkopf et al. ’99), etc.
Most popular optimization algorithms for SVMs use decomposition to hill-climb
over a subset of αis at a time, e.g. SMO (Platt ’99) and (Joachims ’99)
¾Tuning SVMs remains a black art: selecting a specific kernel and parameters is
usually done in a try-and-see manner.
58
Copyright Tapan P Bagchi 2022
Advantages of SVM
Works better when data is linear
It is more effective in higher dimensions (when features
are many). SVM then “learns more.
By using the Kernel trick many complex problems may
be solved.
SVM is not sensitive to outliers. Also SVM needs fewer
samples than other classification methods.
SVM is quite effective in image classification. May
require many features.
Has been used in character and face recognition
59
Copyright Tapan P Bagchi 2022
A typical database of mobile handsets that needs
to be classified as acceptable/reject
Each handset has been evaluated for 16 features
ÆDoes adding more features always improve
accuracy of classification?
60
Copyright Tapan P Bagchi 2022
Disadvantages of SVM
Choosing a good kernel is not easy
It does not show good results in a big dataset
The SVM hyperparameters are costs Cand gamma (ϒ). It is not
easy to fine tune these hyperparameters. It is hard to visualize
their impact.
Generally as you add more features, SVM’s accuracy (%correct
classification) goes up, up to a threshold. Beyond this point the
model can’t learn more and gets “confused.. Accuracy goes down.
This is curse of dimensionality.
Strategies based on statistical tests exist to help in the optimum
selection of features to add.
61
Copyright Tapan P Bagchi 2022
R
References
1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines
2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel
Methods, PHI Learning
3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect
Training Data, Nonlinear Studies, MESA
4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM),
Analytics Vidhya Medium Feb 23, 2020 · 8 min read.
5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar
https://www.youtube.com/watch?v=ikt7Qze0czE
6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion
7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-
ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql
8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015
9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization
Principle, CNEL, University of Florida
62
Copyright Tapan P Bagchi 2022
Numerical Solutions of SVM
Example solved by Dan Ventura (2009)
https://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.exam
ple.pdf
Review of SVM by Mahesh Huddar
https://www.youtube.com/watch?v=ivPoCcYfFAw
https://www.youtube.com/watch?v=03IrkMM4E6M
Excel Solver solution of SVM (Bagchi)
Python SVM model (Sujoy)
63
Copyright Tapan P Bagchi 2022
S
Soft Margin SVM Solution by Excel®
Notations from Page 208, Machine Learning with SVM and other Kernel
Methods, Soman, Loganathan and Ajay, PHI (2011).
Logic we give is original.
1. Collect data that needs to be classified. To build the SVM you will need (1)
training data, and (2) validation data. Identify features for the samples
(vectors) and arrange the data in a worksheet as shown.
2. Write feature headings as x1 and x2 (assuming you are classifying a sample
population with each member having two features).
3. Add additional columns and headings w1, w2, gamma, etc. as shown.
4. For each sample, record its class in a separate column yfor class as assigned by
the domain expert.
5. Your worksheet should look like…
64
Data w1 w2
gamma
C10
Sample
#
x1 x2 y(w1x1+w2x2
-gamma)y
000psi
Slack
(w1x1+w2x2
-
psi - 1
Optimum
psi values
11-1
Copyright Tapan P Bagchi 2022
SVM by Excel contd.
6. Initialize decision variables w1, w2, and gamma (b) by entering zeros (0)
in cells E2 to G2. Also enter zeros in cells in the range I3 to I12.
7. Enter the Excel formatted formula “=$E$2*A3+$F$2*B3-$G$2*C3” in D3.
Copy this formula into the cell range D4 to D12.
8. Enter the formula “=SUM(I3:I12)” in cell I13, the target cell. Save sheet
but keep it open. With the worksheet open, proceed as follows.
9. In the worksheet select Tools, then Solver. Excel will display a dialog
shown on next slide
65
Copyright Tapan P Bagchi 2022
10.
The first entry is in the Target
cell. Enter its address against
Set Objective
11. Enter a value of 10 for C in J1.
12. Choose the Min option to
minimize (w12 + w12) + C sum ξi
13. In By Changing Variable Cells
enter E2:G2, I3:I12 variables
14. Next, the constraints are to be entered …in this space
66
E
Excel Solver
Dialog Box
Copyright Tapan P Bagchi 2022
S
SVM by Excel contd.
15. To enter the constraints click on the Add button. Excel will open the Constraint Filling dialog
box shown above.
16. In Cell Reference text box enter I3:I12. These cells hold values of psi. Choose in the middle
selection box and 0 in the last text box.
17. The next set of constraints are then entered. The quantity is (w1x1+w2x2-gamma)y + psi – 1
with values present in the column shown under its title on the worksheet. The procedure is
identical to what you did to constrain psi above. Select A Solving Method GRG Nonlinear
18. To optimize w, p si and gamma now click OK. To end, enter the Solve button. Read the results.
67
Copyright Tapan P Bagchi 2022
P
Partial Display of SVM model on Excel
Worksheet
68
Copyright Tapan P Bagchi 2022
T
Training the Soft Margin SVM
69
Copyright Tapan P Bagchi 2022
D
Decision Boundary Calculations
70
Copyright Tapan P Bagchi 2022
C
Classification of Training Samples by SVM
71
Copyright Tapan P Bagchi 2022
R
References
1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines
2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel
Methods, PHI Learning
3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect
Training Data, Nonlinear Studies, MESA
4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM),
Analytics Vidhya Medium Feb 23, 2020 · 8 min read.
5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar
https://www.youtube.com/watch?v=ikt7Qze0czE
6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion
7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-
ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql
8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015
9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization
Principle, CNEL, University of Florida
72
Copyright Tapan P Bagchi 2022
S
SVM Decision Boundaries and Classification Rules
You are given a vector x(= x1, x2, x3, …) and a trained SVM
with w, and gamma (ϒ).
Boundary Equations in SVM:
The separator maximum margin hyperplane xw-ϒ= 0
The positive classification boundary xw-ϒ= 1
The negative classification boundary xw-ϒ= -1
Classification Rules:
If xw-ϒ≥ 0, vector x is accepted (class +)
If xw-ϒ< 0, vector xis rejected (class -)
73
Copyright Tapan P Bagchi 2022
I
IPS Candidate Selection: Input data
74
Copyright Tapan P Bagchi 2022
S
SVM Calculations for Candidate Classification
75
Copyright Tapan P Bagchi 2022
C
Classification Results
76
Copyright Tapan P Bagchi 2022
I
IPS Selection Decision Boundaries
77
Copyright Tapan P Bagchi 2022
Constructing Confusion Matrix; αand β
78
TN
FN
FP
TP
Separating
Hyperplane
Samples
Classified as
Negative
Samples
classified as
Positive
Results of Classification by an Imperfect Classifier
Total samples = P + N = TP + FN + TN + FP
For a perfectly trained Classifier: FP = FN = 0
Copyright Tapan P Bagchi 2022
P
Performance of Classifiers
The ROC Curve
79
TPR = Sensitivity = TP/(TP + FN)
Proportion of TP that got correctly
classified as TP
FPR = 1 – Specificity = FP/(FP + TN)
Proportion of TP that got
incorrectly classified as FN
Empirical risk == Training error
Structural risk == Model complexity
True Positive (TP: f++): The number of instances that
were positive (+) and correctly classified as positive (+v).
False Negative (FN: f+-): The number of instances that
were positive (+) and incorrectly classified as negative (-).
P(Type 2 Error)
Å
Generally Deadly! (in medical terms +
means SICK! Like having covid.) =β= FN/(TP+FN)
False Positive (FP: f-+): The number of instances that
were negative (-) and incorrectly classified as (+).
P(Type 1 Error) = α= FP/(TN + FP).
True Negative (TN: f--): The number of instances that
were negative (-) and correctly classified as (-).
Confusion
Matrix
Good
Bad!
Copyright Tapan P Bagchi 2022
P
Performance of the SVM IPS Classifier we built
80
Copyright Tapan P Bagchi 2022
Confusion Matrix in Medicine ROC Curve
81
Copyright Tapan P Bagchi 2022
R
References
1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines
2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel Methods, PHI
Learning
3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect Training Data,
Nonlinear Studies, MESA
4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM), Analytics Vidhya
Medium Feb 23, 2020 · 8 min read.
5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar
https://www.youtube.com/watch?v=ikt7Qze0czE
6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion
7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-
ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql
8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015
9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization Principle, CNEL,
University of Florida
10. Soft Support Vector Machine Porwal, IIT Bombay
82
Copyright Tapan P Bagchi 2022
... Support Vector Machine[15] ...
Article
The Indian population is highly dependent on agriculture for vegetables, fruits, grains, natural textile fibres like cotton, jute, and many more. Also, the agricultural sector plays a vital role in the economic growth of the country. The agriculture sector is contributing around 19.9 percent since 2020-2021. As a result, agricultural production in India has a significant impact on employment. The soil in India has been in use for thousands of years, resulting in depletion and exhaustion of nutrients and minerals, which leads to a reduction of crop yield. Also, there is a lack of modern applications, which causes a need for precision agriculture. Precision Agriculture, also known as Satellite farming is a series of strategies and tools to manage farms based on observing, measuring, and responding to crop variability both within and between fields. One of the main applications of precision agriculture is the recommendation of accurate crops. It helps in increasing crop yield and gaining profits. This paper aims to review and analyse the implementation and performance of various methodologies on crop recommendation systems. Keywords: Machine Learning, Precision Agriculture, Crop Recommendation System, Classification.
Article
With relevant computational software, fuzzy prediction, a new intelligent modelling technique, is utilised to resolve unclear phenomena in various disciplines. Excellent software risk prediction is essential for effective prediction, such as risk management, case planning, and control. We provide an intelligent modelling strategy for software risk prediction in this research. We are applying a support vector machine model and two phases of hybrid fuzzy linear regression clustering (SVM). This method may produce the most accurate risk predictions for various continuous data. The best model with even less error value, acceptable interpretability, and imprecise uncertainty inputs is a fuzzy linear regression with symmetric parameter clustering with a support vector machine (FLRWSPCSVM), a new intelligent modelling technique. The model’s predictive accuracy is demonstrably higher than other prediction models, according to validation utilising simulation data and four software packages such as SPSS, MATLAB and Weka Explorer.
Support Vector Machines Succinctly
  • Alexandre Kowalczyk
Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion 7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql