
However, when later introducing kernels in SVM we need to write the optimization problem of SVM on
what is called the dual problem. This requires a strong mathematical background to fully understand,
knowledge of lagrange multipliers and KKT. However, I will try to explain the dual formulation of SVM
as easy as possible. The dual formulation looks like this:
max
α
n
X
i=1
α
i
−
1
2
n
X
i=1
n
X
j=1
α
i
α
j
y
i
y
j
x
T
i
x
j
(6)
s.t.
n
X
i=1
α
i
y
i
= 0
0 ≤ α
i
≤
C
n
i = 1, . . . , n
Here α > 0 are the support vectors. So, only the support vectors matters. Support vectors are observa-
tions which intercepts with the margin of the decision boundary, so the observations closest to the margin.
I have illustrated the two support vectors on figure 2 (b). If a point is a support vector then α > 0. If a
point is not a support vector then α = 0. Lets expect equation 6. The last part of the objective function
(x
T
i
x
j
) is the dot product between one point and the rest of the data points. This dot-product measures
the similarity between one observation and the rest of the observations. So, in the dual formulation only
the support vectors and observations that are not correctly classified matter. When a solution for the
dual problem is found a solution to the primal problem can easily be found:
ω
∗
=
n
X
i=1
α
i
y
i
x
i
(7)
5 Kernel Trick for SVM
In general, the best way to separate data into two classes is not linear. However, SVM support kernels.
Kernels are functions we use to transform data. We will see an example graphically later on how this
could look like. Until now we have worked with what is called a linear kernel. When we use a linear kernel
we therefore work directly on the dataset. However, when using another kernel than linear, SVM works
on a transformed dataset. Like neural networks SVM often becomes a black-box because we cannot fully
explain what the model does. To formulate SVM with kernels we need to formulate it as a dual problem
as well:
max
α
n
X
i=1
α
i
−
1
2
n
X
i=1
n
X
j=1
α
i
α
j
y
i
y
j
K(x
i
, x
j
) (8)
s.t.
n
X
i=1
α
i
y
i
= 0
0 ≤ α
i
≤
C
n
i = 1, . . . , n
As you can see the dual problem is very similar with the soft margin dual formulation we saw in equation
6. Now the dot product has been switched out with a kernel, K(·, ·). As mentioned earlier a kernel
transforms the data. So now the similarity between x
i
and x
j
is measured in a transformed space. Know,
4