K-Nearest Neighbors (KNN)

Mads Møller

LinkedIn: https://www.linkedin.com/in/madsmoeller1/

Mail: mads.moeller@outlook.com

This paper is the ﬁfth in the series about machine learning algorithms. The Nearest Neighbor algorithm is

used for classiﬁcation problems but can also be used for regression. Nearest Neighbor is also a supervised

machine learning algorithm, meaning that our data need to be labelled. The Nearest Neighbor algorithm

is one of the most popular algorithms within classiﬁcation. It is moreover a multi-class classiﬁcation

algorithm.

1 Dissimilarities and Distances

The intuition behind behind the KNN algorithm is reﬂected in its name. We measure how much each

observation are similar to other observations. In KNN, similar observations are assigned to the same

class. Similarity is the keyword when we talk about KNN. There are a lot of diﬀerent methods of how

you can measure similarity between observations, we will dive into some of these methods in a moment.

The complement of similarity is dissimilarity and when we talk about KNN it is actually the dissimilarity

we are interested in. We deﬁne a dissimilarity as a function δ that satisfy:

• δ

≥ 0

• δ

= 0

• δ

= δ

Where i and j are diﬀerent observations.

We can from the dissimilarity function see that one observation is equal to (100% similar or 0% dissimilar)

another observation if δ

= 0. Some well-known distances for continous variables are Euclidean distance,

Manhattan distance and Maximum distance. All of these three distances belongs to a family of distances

called l

-distances.

1.1 Distances for continous variables

The ﬁrst distance we will take a look at is the Euclidean:

d(x

, x

) =

− x

)

+ (x

− x

)

+ . . . + (x

− x

)

(1)

In a 2D or 3D setting the euclidean distance can be seen as the length of the straight line that connects

two observations in our feature space.

The Manhattan distance is very similar with the Euclidean distance. We can calculate the Manhattan

distance in the following way:

d(x

, x

) = |x

− x

| + |x

− x

| + . . . + |x

− x

| (2)

The last distance measure we will see is the Maximum distance:

d(x

, x

) = max{|x

− x

|, |x

− x

|, . . . , |x

− x

|} (3)

You can think of the Maximum distance as the worst case, or the longest distance for a variable between

two observations. As mentioned earlier all of these distances belongs to a class of distances called l

distance. The general formulation of l

-distance is:

d(x

, x

) = (|x

− x

+ |x

− x

+ . . . + |x

− x

)

1/p

(4)

From the general formulation we can see that for p = 2 we have the Euclidean, for p = 1 we have the

Manhattan and for p = ∞ we have the Maximum.

1.2 distances for categorical variables

Until now we have only been looking at how we can measure distance between continuous variables.

However, in many cases a dataset also contains categorical variables. We also need to be able to measure

how ”far” an observation is from another observation if we have a categorical variable. There are three

popular measurements of distances for categorical variables. These are called Hammig, Simple Matching

and Jaccard. The Hamming distance is based on counting:

d(x

, x

) = cardinality{s : x

6= x

} (5)

It is counting the number of diﬀerences in categorical variables between two observations. So, imagine

a dataset with three categorical variables. If two observations only has the same category in one of the

three variables then the Hamming distance would be 2. Because there are two categories where the

observations are diﬀerent. The other two measurements are simple as well but we will not go into these

in this paper.

When you have a dataset with both continuous and categorical variables normally the algorithm within

the program you use will do a mixture of distances in order to handle both data types.

2 The KNN Algorithm

The KNN algorithm is based on similarity arguments. The KNN algorithm does not deliver a model per

se, but it is an algorithm to classify new observations in an already existing dataset. Similarity deﬁnes

the ”neighbors” in KNN. The neighbors are the most similar observations to a new observation. The K

in KNN is a constant (hyperparameter) that deﬁnes how many neighbors the algorithm should ﬁnd. The

”problem” with KNN is that it is a computational expensive algorithm.

2.1 Class Assignment

For new observations, the predicted class is deﬁned by the K closest (most similar) neighbors. How

KNN is classifying a new observation is by looking at the majority class in the K-nearest neighbors.

The new observation will be assigned to the same class as the majority of its neighbors. Given a new

individual x

new

the KNN algorithm is calculating δ

new,i

for all i in the training dataset. Afterwards the

dissimilarities are being rearranged in increasing order, e.g.

new,(1)

≤ δ

new,(2)

≤ · · · ≤ δ

new,(K)

≤ δ

new,(K+1)

≤ · · · ≤ δ

new,(n)

The K ﬁrst observations are the so-called nearest neighbors. ˆy

new

is the majority class among the K

nearest neighbors.

2.2 Hyperparameter Tuning

As in almost all machine learning models there are hyperparameters we can tune to increase our ”model”-

performance. In case of KNN we do not really have a model, but we can tune some hyperparameters to

make our classiﬁcation performance better. The number of neighbors, K, is the most obvious hyperpa-

rameter we can tune. You can also tune the p in l

-distance measurement.