Support Vector Machine

Mads Møller

LinkedIn: https://www.linkedin.com/in/madsmoeller1/

Mail: mads.moeller@outlook.com

This paper is the third in the series about machine learning algorithms. The Support Vector Machine

(SVM) algorithm can both be used for classiﬁcation problems as well as for regression problems. In this

paper we will focus on how SVM can be used for classiﬁcation problems. SVM is like linear- and logistic

regression also a supervised machine learning algorithm, meaning that our data need to be labelled.

There are also a subset of SVM called SVR (regression) which uses the same principles to solve regression

problems.

The core objective of SVM is to ﬁnd an optimal hyperplane which linearly separates two classes by

maximizing the margin (we will come back to this part).

1 Linear Algebra for SVM

Figure 1: Length of orthogonal projection

In order to understand how SVM works we need to brush up some basic linear algebra. First, the norm

or length of a vector x(x

, x

) can be calculated as:

||x|| =

+ x

(1)

The inner product between two vectors u =









and v =









can be calculated as:

v = p · ||u|| = u

+ u

= v

u (2)

Where p ∈ R is the length of the orthogonal projection of v onto u as illustrated in ﬁgure 1.

2 Intuition

A hyperplane is a plane, or imagine a cut, that separates the n-dimensional data-points into two classes.

The simplest hyperplane is 2-dimensional, so a linear line. If we have a 3-dimensional hyperplane we can

imagine it as a tabletop separating points. When we go to n-dimensions it is hard for our brains to

imagine how the hyperplane looks like but the principle is identical with 2- and 3-dimensional data.

SVM linearly separates data into two components. So let us see how this looks like in ﬁgure 2.

Figure 2: Support Vector Machine Intuition

As we see in ﬁgure 2 (a) the data can be linearly separated in an inﬁnite number of ways. The problem

of SVM is to ﬁnd the ideal line. In the ﬁgure neither of these two lines are ideal. But how do we ﬁnd

the hyperplane (line) that best separates the two classes? Let us take a look at ﬁgure 2 (b). This is

the hyperplane that best separates the two classes. Why?, because the margin from the two classes are

maximized at exactly this hyperplane.

3 Optimal Hyperplane (Decision boundary)

In order to ﬁnd the hyperplane which maximizes the margin we need to mathematically formulate our

problem, or objective function. SVM is actually a constrained optimization problem. Let us see how the

objective of SVM can be formulated:

min

j=1

(3)

s.t.

(i)

+ b ≥ 1 if y

(i)

= 1

(i)

+ b ≤ −1 if y

(i)

= 0

This formulation of SVM is called the hard margin (because we demand all observations to be correctly

classiﬁed). We can also rearrange our constraints in the following form:

(i)

(ω · x

(i)

+ b) ≥ 1

In words what this optimization problem from equation 4 says is ”maximizing the margin by minimizing

the objective function”. In a moment we will dive into why this expression is maximizing the margin.

Let us take an example where we have two explanatory variables. We can then reformulate the problem

using the mathematics we deﬁned in section 1. In this example we assume that the intercept is zero, so

b = 0:

min

j=1

(ω

+ ω

) =

(

+ ω

)

||ω||

(4)

s.t.

(i)

= p

(i)

||ω|| ≥ 1 if y

(i)

= 1

(i)

= p

(i)

||ω|| ≤ −1 if y

(i)

= 0

Where p

(i)

is the projection for each observation x

(i)

on the vector ω. So we can see that SVM is really

all about maximizing the margin because the constraints can only be fulﬁlled if the margin is large.

4 Soft Margin Formulation and Dual Form

Until now we have assumed that the data is linearly separable. This is often not the case. So let us

formulate SVM where not all observations needs to be correctly classiﬁed:

min

ω,ξ

j=1

i=1

(5)

s.t.

(ω

) ≥ 1 − ξ

i = 1, . . . , n

ω =













∈ R

k+1

≥ i = 1, ..., n

In the soft margin formulation we allow observations to be inside the margin and allow miss-classiﬁcations.

If ξ

= 0 the individual observation x

(i)

are correctly classiﬁed. Else if 0 ≤ ξ

≤ 1, the observation x

(i)

is inside the margin but at the correct side of the hyperplane. Here imagine a point inside the margin

on the correctly classiﬁed side on ﬁgure 2 (b). Else ξ

> 1, the observation is miss-classiﬁed. We call

the hyperparameter C the trade-oﬀ parameter between the margin and correct classiﬁcations. This is

exactly what the parameter controls which is obvious when we look at equation 5.

However, when later introducing kernels in SVM we need to write the optimization problem of SVM on

what is called the dual problem. This requires a strong mathematical background to fully understand,

knowledge of lagrange multipliers and KKT. However, I will try to explain the dual formulation of SVM

as easy as possible. The dual formulation looks like this:

max

i=1

−

i=1

j=1

(6)

s.t.

i=1

= 0

0 ≤ α

≤

i = 1, . . . , n

Here α > 0 are the support vectors. So, only the support vectors matters. Support vectors are observa-

tions which intercepts with the margin of the decision boundary, so the observations closest to the margin.

I have illustrated the two support vectors on ﬁgure 2 (b). If a point is a support vector then α > 0. If a

point is not a support vector then α = 0. Lets expect equation 6. The last part of the objective function

) is the dot product between one point and the rest of the data points. This dot-product measures

the similarity between one observation and the rest of the observations. So, in the dual formulation only

the support vectors and observations that are not correctly classiﬁed matter. When a solution for the

dual problem is found a solution to the primal problem can easily be found:

∗

i=1

(7)

5 Kernel Trick for SVM

In general, the best way to separate data into two classes is not linear. However, SVM support kernels.

Kernels are functions we use to transform data. We will see an example graphically later on how this

could look like. Until now we have worked with what is called a linear kernel. When we use a linear kernel

we therefore work directly on the dataset. However, when using another kernel than linear, SVM works

on a transformed dataset. Like neural networks SVM often becomes a black-box because we cannot fully

explain what the model does. To formulate SVM with kernels we need to formulate it as a dual problem

as well:

max

i=1

−

i=1

j=1

K(x

, x

) (8)

s.t.

i=1

= 0

0 ≤ α

≤

i = 1, . . . , n

As you can see the dual problem is very similar with the soft margin dual formulation we saw in equation

6. Now the dot product has been switched out with a kernel, K(·, ·). As mentioned earlier a kernel

transforms the data. So now the similarity between x

and x

is measured in a transformed space. Know,

why does this make sense? So, the kernel trick can also be to add another dimension to the data. Let us

take an example. If we inspect ﬁgure 3 below:

Figure 3: Transformation to higher dimension

When we look a ﬁgure 3(a) it is clear that this dataset is not linear separable. However, we can actually

just add another dimension to the dataset in order to make it linear separable. To make this transfor-

mation we have added another dimension z = x

+ y

to our data. So ﬁgure 3 (b) actually has a third

dimension z, just imagine that this how you see the data points in this three dimensional space from one

side. By simply adding one more dimension our data is now linearly separable again! In real life the

SVM algorithm is not this simple often you get a lot more dimensions or the SVM uses advanced kernels.

Hopefully the idea behind adding more dimensions and kernels are now in place. The last thing we need

to cover in this paper about SVM is kernels. Because I have not really introduced any kernels, so let us

take a look at the default kernel in most ML libraries.

6 Kernels

As mentioned earlier kernels are ways we can transform our data in order to make it more linear separable.

The default kernel in most ML libraries are called Radial Basis Function (RBF). The RBF kernel for

two points x

and x

computes the similarity or how close the two observations are on each other. The

mathematical representation of this kernel is:

K(x

, x

) = exp



−

||x

− x

2σ



(9)

Where ||x

− x

|| is the euclidean norm between our two points x

and x

. This norm measures the

distance between two points. The maximum value of the RBF kernel is 1. This only happens when

||x

− x

|| is zero, meaning that the points are the same (similar). If the value of the RBF kernel is near

zero the points are very dissimilar, meaning that the points are very far from each other. Distance can

be thought of as an equivalent to dissimilarity. When inspecting equation 9 we also see the parameter

σ. This is a hyperparameter and it decides when a point should be considered similar. Let us take some

examples to illustrate the value of σ:

Figure 4: RBF illustration of σ

As we can see the value of σ has a huge eﬀect of when data points are considered similar or not. Normally

the σ parameter will be chosen with hyperparameter tuning in a software that can handle SVM. Of course

there are lots of other kernels that we will not cover in this paper. The reason of why RBF is the default

kernel is because it often performs best when you don’t know how your data looks like.