(PDF) Support Vector Machines--An Overview

S

Support Vector Machines

Tapan P Bagchi

Online Workshop held January 27-30, 2022 on

Advanced Business Analytics

Instructor-in-Charge Sujoy Bhattacharya

IIT Kharagpur

1

12 graduates have applied for selection in IPS. We seek

two attributes—(a) physical fitness, and (b) leadership.

How would you classify them into accepted/rejected?

One solution is by SVM—a supervised ML method.

2EVHUYH 0RGHOE\$, 7UDLQOHDUQ

3UHGLFW$FW

Copyright Tapan P Bagchi 2022

H

How would you classify IPS applicants?

Stakeholders all want top performing officers

¾Aspiring candidates

¾Government administrators

¾Society

¾Judiciary

¾Serving and retired IPS officers can supply data on

features:

9Communication skills

9Judgement Skills

9Analytical skills

9Research skills

9People skills

9Fitness

9Perseverance

2

Copyright Tapan P Bagchi 2022

Latent factors and Features measured

in IPS Selection

Latent Factors

(can’t be observed

directly)

Observed

(measurable)

Features

Physical Fitness

Time to sprint

100 m

Standing jump

m

Bench press kg

Leadership

Represents college

All rounder,

IQ

Dean’s list, debater

3

In this lecture we use only two measurable features of each

candidate to judge (classify) them to be accepted/rejected.

120

125

130

135

140

145

150

155

160

810 12 14

16

IQ score

100 m (secs)

Scatter Plot of Weighted Scores

Data

Copyright Tapan P Bagchi 2022

S

SVM’s Goal: Classifying sample vectors (x1, x2, etc.)

in the feature space as +or -

4

Inventor Vladimir Vapnik

(1995)

Instructor Late Prof Patrick Winston, MIT

Copyright Tapan P Bagchi 2022

C

Cancer Diagnostics:

Medical Science is a big area for SVM applications

5

Sample

data or

vectors

Copyright Tapan P Bagchi 2022

Pictorial display of SVM in the Feature Space

S

Sample vector points ,

are

linearly separable

into two classes

6

Copyright Tapan P Bagchi 2022

N

Notations from Linear Algebra used in SVM

•Vector x= (x1, x2, …, xl)

•Transpose of vector xT

•Dot product

wT•x= x•wT= w1(x1x2x3) = w1 x1+ w2x2+ w3x3= ||w|| ||x|| Cos θ

w2

w3

w

•Unit vector of w = -----

||w||

•Magnitude ||w|| = ÷(w12+ w22+ ..)

7

Equations for Hyperplanes:

Y = mx + c … line

Y = w1x1+ w2x2+ b … plane

= w1[x1x2] + b

w2

=wT•x + b

w~ orientation; b

~ offset from origin

Copyright Tapan P Bagchi 2022

AA Linear Separator

Binary classification can be viewed as the task of

separating two classes present in the feature space by f(x)

:

w: orientation;

b

: offset from origin

wTx+b= 0

wTx+b< 0

f(x)=sign(wTx+b)

Sample vector x= (x1, x2)

Each sample xhere

has 2 (=l) features;

whas ldimensions

wTx+b> 0

x1

x2

8

wi= coefficient of xi

in the equation

of the hyperplane

b= bias in the equation of the

hyperplane

Recall y = mx + c

Copyright Tapan P Bagchi 2022

M

Many Linear Separators are possible

Which of the linear separators is optimal?

The best separator should be farthest from both red

and

blue Æthis will minimize

misclassification of new points.

9

Training data

(points) in the

feature space

and various

possible linear

separators

Copyright Tapan P Bagchi 2022

CChallenge

: Find the hyperplane using

classified training

samples

by maximizing margin ρ(the distance

between the two classes). ρ maximization computes

the SVM’s optimum d

decision parameters (w,

b)

)

10

Copyright Tapan P Bagchi 2022

H

Hard Margin SVM Classification

•Maximizing the margin is good according to intuition

and PAC (probably approximately correct) theory.

•Implies that only support vectors matter; other

training examples are ignored the SVM’s hyperplane

construction. hyperplane

nsamples,

lfeatures in

each sample

11

Copyright Tapan P Bagchi 2022

The Classification Margin ρ

•Distance rfrom example xito the separator is

•Examples closest to the hyperplane are support vectors.

•Margin ρ

of the separator is the distance between support vectors.

w

xw b

r

i

T



r

ρ

12

xiEach point in this

feature space is a

vector xi(x1, x2).

w1 and w2 are

optimum weights

that help orient

the hyperplane in

the feature space..

x1

x2

Copyright Tapan P Bagchi 2022

The w

width (=

ρ)

of the margin

needs a dot product

of x

x2

–x

x1

with the unit vector u

u

along the

perpendicular

13

Copyright Tapan P Bagchi 2022

D

Distance of line AB from the origin

Take a vector wthat is perpendicular to line AB.

Find the unit vector uof w.

Take an arbitrary point xon AB

Hence Distance of line AB from the origin

w

Note: Unit vector uof w= -----

||w||

14

= uTx= u•x

x1

u•x

x2

x

u

w

A

B

x

dot product

Copyright Tapan P Bagchi 2022

The classification rule uses the dot product of

perpendicular vector w

w

or its unit vector, and the

sample vector u.

The decision rule puts sample u

in class + or

–

as follows

:

15

Copyright Tapan P Bagchi 2022

To make the math for computing the margin ρ

ρ

easier, we define a variable yto indicate the class

of each sample x

x

: y= +1 or -1.

16

Copyright Tapan P Bagchi 2022

V

Variable

y

(= 1 or -1) is assigned to each sample

X

that is + or -

17

Copyright Tapan P Bagchi 2022

TTo train the SVM

w

we maximize ρ—the width of the margin.

Now x2

lies on the “

+

” boundary; x

x1

lies on the “

–

”

boundary. Hence x2

•

w

+ b = 1 and x

x1

•

w

+ b =

-1 are two

equations giving vectors x1

and x

x2

.

18

Recall

u•x=

perpendicular

distance of AB

from origin,

u = w/||w||

1

8

x2–x1

Hence u•(x2 –x1) = ρ

•

Copyright Tapan P Bagchi 2022

SSVM method seeks to maximize margin width ρ

19

Width of margin = ρ

= u

u

•(x

x2

–x

x1

) = (x

x2

–x

x1

)•

w

/||w

w

||

Hence we want to optimize parameters w

* and b*

Copyright Tapan P Bagchi 2022

The expression of the width of the margin

I

In SVM

we maximize ρor2/||

w

|| or minimize ||

w||

or equivalently minimize ||

w

|| 2= w12+w

22+ .. + w

l

2

20

Copyright Tapan P Bagchi 2022

Optimization methods:

A Linear Programming (LP) Problem

21

Solution can be by Simplex.

Copyright Tapan P Bagchi 2022

A Quadratic Programming (QP) Problem

22

Objective is

Convex!

Solution to QP can

be obtained by

(1) Easily by greedy

algorithms,

(2) By QP Solver in

MATLAB

Copyright Tapan P Bagchi 2022

L

Linear SVM’s Mathematical optimization

Minimizing ||w||2becomes a quadratic optimization

problem as follows:

which can be reformulated as a QP as:

Find wand bsuch that

is maximized

and for all (xi,yi), i=1..ns.t. yi(wTxi+b) ≥ 1

w

2

U

Find wand bsuch that

||w||2=w

Tw= w

12+w

22+.. + wl2is minimized

Subject to: for all (xi,yi), i=1..n, s.t. yi(wTxi+b)≥ 1

23

Copyright Tapan P Bagchi 2022

Note: P

Primal

and D

Dual

Problem formulations

can help here (dual is often quicker to solve)

24

Duality in LP: Dimension of Xis lfeatures;

Problem has ntraining samples; each sample has a class constraint.

Duality Theory: Each Primal variable xleads to a dual Constraint

(≤)and vice versa.

3xvariables, 2constraints

2yvariables, 3constraints

Copyright Tapan P Bagchi 2022

The constraints in minimizing ||w|| or

minimizing ||w

w||2

are as follows:

25

Copyright Tapan P Bagchi 2022

SVM’s margin Optimization Problem:

The P

Primal

(fundamental) F

Formulation

26

n= number of features present in every sample (x)

Copyright Tapan P Bagchi 2022

RRecall that we want to maximize ρρ,, the wwidth of the margin.

xx2

lies on the “+

+”

boundary; x1 lies on the “

–”

boundary.

Hence x2

•

w

+ b = 1 and x

x1

•

w

+ b =

-1 are two equations.

27

Recall u•x=

distance of AB

from origin

27

(x2–x1)

Hence

u•(X2 –X1) = ρ

-+

Copyright Tapan P Bagchi 2022

S

Solving the (w, b) SVM Parameter Optimization Problem

Note: Data

xi

is sample vector i, y

i

is class (1/-1) of

xi

•Thus we need to optimize a quadratic function subject to linear

constraints.

•Quadratic optimization problems are a well-known class of mathematical

programming problems for which several (non-trivial) algorithms exist.

•The solution involves constructing a dual problem where a Lagrange

multiplier αiis associated with every inequality constraint in the primal

(original) problem. This dual becomes a quadratic programming problem:

Find wand b such that

Φ(w)=wTwis minimized

subject to all (xi,yi), i= 1..n samples, yi(wTxi+b)≥ 1

Solve for α1…αnsuch that

Q(α)=Σαi-½ΣΣαiαjyiyjxiTxjis maximized and

(1) Σαiyi= 0

(2) αi≥ 0 for all αi

28

Solvable for {αi}

by Quadratic

Programming

d

u

a

l

p

ri

m

a

l

Copyright Tapan P Bagchi 2022

We use L

Lagrangian L

to find optimum b and

w

29

Copyright Tapan P Bagchi 2022

30

z}t˅GG¡GGG



G

UG{GGG

w 

UG~GGGGG



GG¢

¢

Ȼ

¤

UG{GGTG



G



UG{GG

GGGGGGG

O



P

UG

Copyright Tapan P Bagchi 2022

31

Copyright Tapan P Bagchi 2022

The optimization is a quadratic optimization

problem to find the α’s and it is convex.

Æ

SOLUTION IS ALSO POSSIBLE BY EXCEL®SOLVER ®

32

Copyright Tapan P Bagchi 2022

The Lagrangian shows that solving for band w

w

depends only on the

dot product

of vectors

xi

and

xj

33

Copyright Tapan P Bagchi 2022

R

References

1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines

2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel

Methods, PHI Learning

3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect

Training Data, Nonlinear Studies, MESA

4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM),

Analytics Vidhya Medium Feb 23, 2020 · 8 min read.

5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar

https://www.youtube.com/watch?v=ikt7Qze0czE

6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion

7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-

ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql

8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015

9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization

Principle, CNEL, University of Florida

34

Copyright Tapan P Bagchi 2022

S

Soft Margin Classification

•What if the training set is not linearly separable?

•We allow here a few misclassifications with some penalty

•Slack variables ξican be added to allow misclassification of difficult or

noisy examples, resulting margin called soft.

ξi

35

Slack variables

Copyright Tapan P Bagchi 2022

Soft Margin Classification Mathematically

•The old (hard margin) formulation:

•Modified primal formulation incorporates slack variables:

•Penalty Parameter Ccan be viewed as a way to control overfitting: it

“trades off” the relative importance of maximizing the margin and fitting

the training data.

•See also: Alternative model for Soft Support Vector Machine by Porwal

Find wand b such that

Φ(w)=wTwis minimized

and for all (xi,yi), i=1..n samples: yi(wTxi+b)≥ 1

Find wand b such that

Φ(w)=wTw+CΣξiis minimized

and for all (xi,yi), i=1..n:yi(wTxi+b)≥ 1–ξi, ,ξi≥ 0

36

Copyright Tapan P Bagchi 2022

S

Soft Margin Classification – Solution

•The Dual problem used here is identical to separable case (would not

be identical if the 2-norm penalty for slack variables CΣξi2was used in

primal objective, we would need additional Lagrange multipliers for

slack variables):

•Again, xiwith non-zero αiwill be support vectors.

•Solution to the dual problem gives {αi}. Hence

Find α1…αNsuch that

Q(α)=Σαi-½ΣΣαiαjyiyjxiTxjis maximized and

(1) Σαiyi= 0

(2) 0 ≤αi≤ Cfor all αiThis is again a quadratic problem!

w=Σαiyixi

b= yk(1- ξk)-ΣαiyixiTxkfor any ks.t. αk>0f(x) = Σαiy

i

xiTx + b

But again

, we don’t need to

compute wexplicitly

for

classification; only the “dot”:

37

Dot product

Copyright Tapan P Bagchi 2022

T

Theoretical Justification for Maximum Margins

FEEL FREE TO SKIP THIS SLIDE. Subject of Research.

•Vapnik has proved the following:

The class of optimal linear separators has “VC” dimension h bounded from above as

where ρis the margin, Dis the diameter of the smallest sphere that can enclose all of

the training examples, and m0is the dimensionality (m0= number of features).

VC = largest set of points (vectors) that the classification algorithm can manage to

classify without harming generalization (how well the model is trained to classify

unseen data).

•Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC

dimension by maximizing the margin ρ.

•Thus, complexity of the SVM classifier remains small regardless of dimensionality

(number of features).

1,min

0

2



¿

¾

½

¯

®

»

»

º

«

ª

dm

D

h

U

38

Copyright Tapan P Bagchi 2022

Linear SVMs: Overview

1. The classifier is a separating hyperplane.

2. Most “important” training points are support vectors; they define the

hyperplane.

3. Quadratic optimization algorithms can identify which training points xi

are support vectors with non-zero Lagrangian multipliers αi.

4. Both in the dual formulation of the problem and in the solution,

training points appear only inside inner (dot) products:

Find α1…αNsuch that

Q(α)=Σαi-½ΣΣαiαjyiyjxiTxjis maximized and

(1) Σαiyi= 0

(2) 0 ≤αi≤ Cfor all αi

f(x) = ΣαiyixiTx+b

39

(f(x) is the classifier function

—see slides 30 and 37)

Copyright Tapan P Bagchi 2022

Non-linear SVMs

•Datasets that are linearly separable with some noise work out great:

•But what are we going to do if the dataset is just too hard?

•How about… mapping data to a higher-dimensional space:

0

x2

x

x2is an added

feature that enables

the use of a linear

hyperplane to do

the classification

40

Copyright Tapan P Bagchi 2022

Non-linear SVMs: Feature spaces get added to

•General idea: the original feature space can always be mapped to

some higher-dimensional feature space where the training set is

separable:

Φ:x→φ(x)

2-d feature space 3-d feature space with

φ(x) feature added

φ(x)

41

x1

x2

Copyright Tapan P Bagchi 2022

The “

“Kernel Trick”

•The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj

•If every datapoint is mapped into high-dimensional space via some

transformation φ: x→φ(x), the inner (dot) product becomes:

K(xi,xj)= φ(xi)T•φ(xj)

¾A kernel function is a function that is equivalent to an inner product in

some feature space.

•Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2, (3rd

feature added)

Need to show that K(xi,xj)= φ(xi)Tφ(xj):

K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =

= φ(xi)Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]

•Thus, a kernel function implicitly maps data to a high-dimensional space

(without the need to compute each φ(x) explicitly).

42

Copyright Tapan P Bagchi 2022

What Functions are Kernels?

•For some functions K(xi,xj) checking that K(xi,xj)= φ(xi)Tφ(xj) can be

cumbersome.

¾Mercer’s theorem:

Every semi-positive definite symmetric function is a kernel

•Semi-positive definite symmetric functions correspond to a semi-

positive definite symmetric Gram matrix:

K(x1,x1)K(x1,x2)K(x1,x3)… K(x1,xn)

K(x2,x1)K(x2,x2)K(x2,x3)K(x2,xn)

… … … … …

K(xn,x1)K(xn,x2)K(xn,x3)… K(xn,xn)

K=

43

Eigenvalues are

nonnegative

Copyright Tapan P Bagchi 2022

Examples of K

Kernel Functions

– these help adding

features to x

x

that make samples linearly separable

•Linear: K(xi,xj)= xiTxj

•Mapping φ : x→ φ(x), where φ(x) is xitself

•Polynomial of power p: K(xi,xj)= (1+ xiTxj)p

•Mapping φ : x→ φ(x), where φ(x) has

dimensions

•Gaussian (radial-basis function): K(xi,xj) =

•Mapping φ : x→ φ(x), where φ(x) is infinite-

dimensional: every point is mapped to a function (a

Gaussian); combination of functions for support vectors is

the separator.

•Higher-dimensional space still has intrinsic dimensionality d

(the mapping is not onto), but linear separators in it

correspond to non-linear separators in original space.

2

V

ji

exx 



¸

¹

·

¨

©

§

p

pd

44

Copyright Tapan P Bagchi 2022

Non-linear SVM’s Mathematical formulation

•Dual problem formulation:

•The solution is:

•Optimization techniques for finding αi’s remain the same!

Find α1…αnsuch that

Q(α)=Σαi-½ΣσαiαjyiyjK(xi,xj) is maximized and

(1) Σαiyi= 0

(2) αi≥ 0 for all αi

f(x) = ΣαiyiK(xi,xj)+b

45

ÅCheck its “sign” to find class of

x

Copyright Tapan P Bagchi 2022

An excellent review of Support Vector

Machines is by Gopal Prasad Malakar

An Intro to Support Vector Machines

https://www.youtube.com/watch?v=ikt7Qze0czE

46

Copyright Tapan P Bagchi 2022

C

Classification by SVM can be made more

efficient by Kernel Functions

•A class is things that form a group by common attributes,

characteristics, qualities or traits, called features. Dogs by their

pedigree form classes called breeds. A breed’s features may be type,

height, skin color, body hair length, etc.

•SVM, a supervisory learning classifier, was invented as a binary

classifier that comprises a separating hyperplane that helps separate

a collection of mathematical objects into two labeled classes.

•Classification is a process based on the data observed, each sample

comprising attributes or features that are similar or different.

•SVM learns from a handful of class-labeled training samples and then

is able to identify the class of yet unseen samples based on their

observed features.

47

Copyright Tapan P Bagchi 2022

SVM’s learning models and algorithms:

data Æfeatures Ælearning Æprediction of

classes of new objects.

•Typically, objects in some environments may be linearly classifiable.

Here a linear hyperplane such as a one dimensional line (in a 2-d

space) or a two dimensional plane (in a 3-d space) can divide the

feature space to create the two classes.

•Many objects, however, are not linearly separable in the feature

space. These may be mixed and distributed randomly.

•A special mathematical devise or function, called kernel,may be used

in such cases to enable a linear separator such as SVM to be

effective.

•This makes the original linearly inseparable data to become linearly

separable, thus greatly easing the classification task. In SVM world,

the procedure is called the kernel trick.

48

Copyright Tapan P Bagchi 2022

Non-linear SVMs: Feature spaces get added

•General idea: the original feature space can always be mapped by

f(x) to some higher-dimensional feature space where the training set

is separable:

f:x→f(x)

2-d feature space 3-d feature space with

f(x) feature added

f(x)

49

x1

x2

Copyright Tapan P Bagchi 2022

The transformation function f

f(x)

•To use SVM, the transformation function f(x) is applied to each object

instance (containing particular values of the features in it) to map the

original non-linear observations into a higher dimensional space, thus

adding one or more extra features to each sample or object.

•Mathematically the kernel function produces the equivalence of the dot

product of the transformed data vectors in the transformed higher

dimensional space.

•Hence in this new transformed space the objects become linearly

separable by SVM. With the dot product computed, the kernel function

helps exploit similarity between objects that now have the added

features. This then analytically enables the emergence of classes that are

linearly separable.

•Instead of defining a slew of features, you define a single kernel function

to compute similarity say in the data from the breeds of dogs. You

provide this kernel, together with the data and labels to the learning

algorithm, and out comes a classifier.

50

Copyright Tapan P Bagchi 2022

Non-linear SVM’s Mathematical formulation

K

Kernel is now used!

•Dual problem formulation:

•The solution is:

•Optimization techniques for finding αi’s remain the same!

Find α1…αnsuch that

Q(α)=Σαi-½ΣσαiαjyiyjK(xi,xj)is maximized and

(1) Σαiyi= 0

(2) αi≥ 0 for all αi

f(x) = ΣαiyiK(xi,xj)+b

51

ÅCheck its “sign” to find class of x

Copyright Tapan P Bagchi 2022

A mathematical illustration of the r

reduced

computational effort

by use of Kernel

•A key step in the application of SVM is the calculation of the inner or

dot product of vectors comprising feature assessments.

•This calculation may become a huge load, voluminous, and

sometimes intractable.

•The kernel function, if a useful one can be discovered, or created,

can greatly simplify this dot product computation. We later list some

very useful kernel functions that machine learning experts have

harnessed to get the SVM to help with practical problems such as

cancer detection or hand character recognition.

•K(x, y) = <f(x), f(y)>

•Many useful Kernels have been now devised

52

A mathematical illustration provided by Lili

Jiang.

•We begin by defining

K(x, y) = <f(x), f(y)>

•Here K is the kernel function, x, y are ndimensional input vectors

each comprising nfeatures. f is a map from n-dimension to m-

dimension space.

•<x,y> denotes the dot product. In the applications of kernels, usually

mis much larger than n.

•K(x, y) provides the equivalence of the dot product <f(x), f(y)>.

Generally easier to compute than the dot product, a well-researched

K thus brings efficiency.

•For SVM to work as a classifier, it requires objects to be linearly

separable. This is achieved by transforming data from the n

dimensional space to the larger mdimensional space by using f(x).

53

Dot product <f(x), f(y)> is replaced by the

Kernel K! -

-- Vapnik

•To achieve linear separation, one needs to calculate the dot product <f(x), f(y)>.

•Normally this would require us to calculate f(x), f(y) first, and then do the dot

product.

•These two computation steps can be quite expensive as they involve

manipulations in higher mdimensional space, where mcan be a large number.

•But we recall that the result of the dot product is really a scalar. The question

Jiang raises is, do we really need to go through all the trouble to get this one

number?

¾Do we really have to go to the m-dimensional space? The answer is no, provided

you can find a clever kernel!

A couple of more points must be noted about kernels.

1. Under conditions given by Mercer, every kernel function can be expressed as

a dot product in a feature space.

2. Many machine learning algorithms can be expressed entirely in terms of dot

products.

54

A

A numerical illustration of reduction in

computational effort by use of the kernel

•Let the two feature vectors be x = (x1, x2, x3); and y = (y1,

y2, y3).

•Then suppose function f(x) is

(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)

due to the transformation the SVM requires to produce data

in the higher dimensional space. Same for f(y)!

•In that space we require the dot product to get linear

separability. Now the dot product <f(x), f(y)>, the kernel that

we could and should use is

K(x, y ) = (<x, y>)² = (x1y1 + x2y2 + x3y3)2

55

N

Numerical illustration

•Let’s check its utility of K by plugging in some numbers to make this

point firm.

•Suppose x = (1, 2, 3); y = (4, 5, 6). Then f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)

and f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36).

•Therefore, <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324

= 1024

•A lot of algebra, mainly because f is a mapping from 3-dimensional to

9 dimensional space.

•Now let us use the kernel instead. We get K(x, y) = (4 + 10 + 18)² = 32²

= 1024.

•We get the same result, with calculations that is so much easier.

56

57

SVM applications: The challenge is in finding

the right kernel!

•Vladimir Vapnik created the logic of SVM in 1965 in his PhD work and later

expanded it in 1992 and beyond at the Bell Labs. SVMs were originally written

by Boser, Guyon and Vapnik in 1992 and gained popularity in late 1990s.

•SVMs are currently among the best performers for a number of classification

tasks ranging from text to genomic data.

•SVMs can be applied to complex data types even beyond feature vectors (e.g.

graphs, sequences, relational data) by designing kernel functions for such data.

•SVM has been used in face recognition, text and hypertext categorization,

classification of images, bioinformatics, protein fold and remote homo-logy

detection, hand writing recognition, generalized predictive control (GPC) and

geo and environmental sciences.

•SVM techniques have been extended to a number of tasks such as regression

(Vapnik et al. ’97), principal component analysis (Schölkopf et al. ’99), etc.

•Most popular optimization algorithms for SVMs use decomposition to hill-climb

over a subset of αi’s at a time, e.g. SMO (Platt ’99) and (Joachims ’99)

¾Tuning SVMs remains a black art: selecting a specific kernel and parameters is

usually done in a try-and-see manner.

58

Advantages of SVM

•Works better when data is linear

•It is more effective in higher dimensions (when features

are many). SVM then “learns more.”

•By using the Kernel trick many complex problems may

be solved.

•SVM is not sensitive to outliers. Also SVM needs fewer

samples than other classification methods.

•SVM is quite effective in image classification. May

require many features.

•Has been used in character and face recognition

59

A typical database of mobile handsets that needs

to be classified as acceptable/reject

Each handset has been evaluated for 16 features

ÆDoes adding more features always improve

accuracy of classification?

60

Disadvantages of SVM

•Choosing a good kernel is not easy

•It does not show good results in a big dataset

•The SVM hyperparameters are costs Cand gamma (ϒ). It is not

easy to fine tune these hyperparameters. It is hard to visualize

their impact.

•Generally as you add more features, SVM’s accuracy (%correct

classification) goes up, up to a threshold. Beyond this point the

model can’t learn more and gets “confused.”. Accuracy goes down.

This is curse of dimensionality.

•Strategies based on statistical tests exist to help in the optimum

selection of features to add.

61

R

References

1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines

2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel

Methods, PHI Learning

3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect

Training Data, Nonlinear Studies, MESA

4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM),

Analytics Vidhya Medium Feb 23, 2020 · 8 min read.

5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar

https://www.youtube.com/watch?v=ikt7Qze0czE

6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion

7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-

ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql

8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015

9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization

Principle, CNEL, University of Florida

62

Numerical Solutions of SVM

•Example solved by Dan Ventura (2009)

https://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.exam

ple.pdf

•Review of SVM by Mahesh Huddar

https://www.youtube.com/watch?v=ivPoCcYfFAw

https://www.youtube.com/watch?v=03IrkMM4E6M

•Excel Solver solution of SVM (Bagchi)

•Python SVM model (Sujoy)

63

S

Soft Margin SVM Solution by Excel®

Notations from Page 208, Machine Learning with SVM and other Kernel

Methods, Soman, Loganathan and Ajay, PHI (2011).

Logic we give is original.

1. Collect data that needs to be classified. To build the SVM you will need (1)

training data, and (2) validation data. Identify features for the samples

(vectors) and arrange the data in a worksheet as shown.

2. Write feature headings as x1 and x2 (assuming you are classifying a sample

population with each member having two features).

3. Add additional columns and headings w1, w2, gamma, etc. as shown.

4. For each sample, record its class in a separate column yfor class as assigned by

the domain expert.

5. Your worksheet should look like…

64

Data w1 w2

gamma

C10

Sample

#

x1 x2 y(w1x1+w2x2

-gamma)y

000psi

Slack

(w1x1+w2x2

-

gamma)y +

psi - 1

Optimum

psi values

11-1

SVM by Excel contd.

6. Initialize decision variables w1, w2, and gamma (b) by entering zeros (0)

in cells E2 to G2. Also enter zeros in cells in the range I3 to I12.

7. Enter the Excel formatted formula “=$E$2*A3+$F$2*B3-$G$2*C3” in D3.

Copy this formula into the cell range D4 to D12.

8. Enter the formula “=SUM(I3:I12)” in cell I13, the target cell. Save sheet

but keep it open. With the worksheet open, proceed as follows.

9. In the worksheet select Tools, then Solver. Excel will display a dialog

shown on next slide

65

10.

The first entry is in the Target

cell. Enter its address against

Set Objective

11. Enter a value of 10 for C in J1.

12. Choose the Min option to

minimize (w12 + w12) + C sum ξi

13. In By Changing Variable Cells

enter E2:G2, I3:I12 variables

14. Next, the constraints are to be entered …in this space

66

E

Excel Solver

Dialog Box

S

SVM by Excel contd.

15. To enter the constraints click on the Add button. Excel will open the Constraint Filling dialog

box shown above.

16. In Cell Reference text box enter I3:I12. These cells hold values of psi. Choose ≥ in the middle

selection box and 0 in the last text box.

17. The next set of constraints are then entered. The quantity is (w1x1+w2x2-gamma)y + psi – 1

with values present in the column shown under its title on the worksheet. The procedure is

identical to what you did to constrain psi above. Select A Solving Method GRG Nonlinear

18. To optimize w, p si and gamma now click OK. To end, enter the Solve button. Read the results.

67

P

Partial Display of SVM model on Excel

Worksheet

68

T

Training the Soft Margin SVM

69

D

Decision Boundary Calculations

70

C

Classification of Training Samples by SVM

71

R

References

1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines

2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel

Methods, PHI Learning

3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect

Training Data, Nonlinear Studies, MESA

4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM),

Analytics Vidhya Medium Feb 23, 2020 · 8 min read.

5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar

https://www.youtube.com/watch?v=ikt7Qze0czE

6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion

7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-

ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql

8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015

9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization

Principle, CNEL, University of Florida

72

S

SVM Decision Boundaries and Classification Rules

You are given a vector x(= x1, x2, x3, …) and a trained SVM

with w, and gamma (ϒ).

Boundary Equations in SVM:

The separator maximum margin hyperplane x•w-ϒ= 0

The positive classification boundary x•w-ϒ= 1

The negative classification boundary x•w-ϒ= -1

Classification Rules:

If x•w-ϒ≥ 0, vector x is accepted (class +)

If x•w-ϒ< 0, vector xis rejected (class -)

73

I

IPS Candidate Selection: Input data

74

S

SVM Calculations for Candidate Classification

75

C

Classification Results

76

I

IPS Selection Decision Boundaries

77

Constructing Confusion Matrix; αand β

78

TN

FN

FP

TP

Separating

Hyperplane

Samples

Classified as

Negative

Samples

classified as

Positive

Results of Classification by an Imperfect Classifier

Total samples = P + N = TP + FN + TN + FP

For a perfectly trained Classifier: FP = FN = 0

P

Performance of Classifiers

The ROC Curve

79

TPR = Sensitivity = TP/(TP + FN)

Proportion of TP that got correctly

classified as TP

FPR = 1 – Specificity = FP/(FP + TN)

Proportion of TP that got

incorrectly classified as FN

Empirical risk == Training error

Structural risk == Model complexity

True Positive (TP: f++): The number of instances that

were positive (+) and correctly classified as positive (+v).

False Negative (FN: f+-): The number of instances that

were positive (+) and incorrectly classified as negative (-).

P(Type 2 Error)

Å

Generally Deadly! (in medical terms +

means SICK! Like having covid.) =β= FN/(TP+FN)

False Positive (FP: f-+): The number of instances that

were negative (-) and incorrectly classified as (+).

P(Type 1 Error) = α= FP/(TN + FP).

True Negative (TN: f--): The number of instances that

were negative (-) and correctly classified as (-).

Confusion

Matrix

Good

Bad!

P

Performance of the SVM IPS Classifier we built

80

Confusion Matrix in Medicine ROC Curve

81

R

References

1. Winston, Patrick (2014). MIT Open Course video lecture Learning Support Vector Machines

2. Soman, K P, R Loganathan and V Ajay (2011). Machine Learning SVM and other Kernel Methods, PHI

Learning

3. Bagchi, Tapan P, Rahul Samant and Milan Joshi (2). SVM Classifiers built using Imperfect Training Data,

Nonlinear Studies, MESA

4. Kapri, Aman (2020). Everything one should know about — Support Vector Machines (SVM), Analytics Vidhya

Medium Feb 23, 2020 · 8 min read.

5. An excellent review of Support Vector Machines is by Gopal Prasad Malakar

https://www.youtube.com/watch?v=ikt7Qze0czE

6. Kowalczyk, Alexandre (2017). Support Vector Machines Succinctly, Syncfusion

7. Introduction to Microsoft Excel course https://www.youtube.com/watch?v=-

ujVQzTtxSg&list=PLWPirh4EWFpEpO6NjjWLbKSCb-wx3hMql

8. Zisserman, A (2015). Lecture 2: The SVM classifier, C19 Machine Learning Hilary 2015

9. Principe, Jose and Sohan Seth. Statistical Learning Theory: The Structural Risk Minimization Principle, CNEL,

University of Florida

10. Soft Support Vector Machine Porwal, IIT Bombay

82

Support Vector Machines--An Overview

Abstract

Supplementary resource (1)

Recommended publications

Excel Example for Soft Margin SVM with Classical Objective p208 Machine Learning by Soman et al Exam...

051-0040 POMS 2014 SVM Classifiers Based on Imperfect Training Data

Refining AI Methods for Medical Diagnostics Management

Building Crack Due To Lombok Earthquake Classification Based on GLCM Features and SVM Classifier