In a 2D or 3D setting the euclidean distance can be seen as the length of the straight line that connects
two observations in our feature space.
The Manhattan distance is very similar with the Euclidean distance. We can calculate the Manhattan
distance in the following way:
d(x
i
, x
j
) = |x
i1
− x
j1
| + |x
i2
− x
j2
| + . . . + |x
ik
− x
jk
| (2)
The last distance measure we will see is the Maximum distance:
d(x
i
, x
j
) = max{|x
i1
− x
j1
|, |x
i2
− x
j2
|, . . . , |x
ik
− x
jk
|} (3)
You can think of the Maximum distance as the worst case, or the longest distance for a variable between
two observations. As mentioned earlier all of these distances belongs to a class of distances called l
p
-
distance. The general formulation of l
p
-distance is:
d(x
i
, x
j
) = (|x
i1
− x
j1
|
p
+ |x
i2
− x
j2
|
p
+ . . . + |x
ik
− x
jk
|
p
)
1/p
(4)
From the general formulation we can see that for p = 2 we have the Euclidean, for p = 1 we have the
Manhattan and for p = ∞ we have the Maximum.
1.2 distances for categorical variables
Until now we have only been looking at how we can measure distance between continuous variables.
However, in many cases a dataset also contains categorical variables. We also need to be able to measure
how ”far” an observation is from another observation if we have a categorical variable. There are three
popular measurements of distances for categorical variables. These are called Hammig, Simple Matching
and Jaccard. The Hamming distance is based on counting:
d(x
i
, x
j
) = cardinality{s : x
is
6= x
js
} (5)
It is counting the number of differences in categorical variables between two observations. So, imagine
a dataset with three categorical variables. If two observations only has the same category in one of the
three variables then the Hamming distance would be 2. Because there are two categories where the
observations are different. The other two measurements are simple as well but we will not go into these
in this paper.
When you have a dataset with both continuous and categorical variables normally the algorithm within
the program you use will do a mixture of distances in order to handle both data types.
2 The KNN Algorithm
The KNN algorithm is based on similarity arguments. The KNN algorithm does not deliver a model per
se, but it is an algorithm to classify new observations in an already existing dataset. Similarity defines
2