
Here x is called our independent variable (input variable), w
1
is called the weight and w
0
is called the bias
term. Therefore, we want to find those values of w
0
and w
1
which gives us the best model fit. In order to
find the best line we have to set up a formula which can figure out what line minimizes the distance from
all points to the line. First of all we need to get familiar with two essential machine learning principles
called cost functions and gradient descent.
1.1 Cost Function
If we inspect figure 1 again, the blue line represent predictions of our dependent variable y. So the
problem that linear regression solves is fitting the line which minimizes the distance between the actual
values (observations on the graph) and our predicted values (the line). This is also referred to as the
cost. Therefore, the function we are going to minimize is called the cost function. In linear regression we
define our cost function as:
J(w
0
, w
1
) =
1
n
n
X
i=1
( ˆy
i
− y
i
)
2
=
1
n
n
X
i=1
((w
0
+ w
1
x
i
) − y
i
)
2
(2)
We denote our cost function with ”J”, the predicted values as ˆy, the observations from our dataset as y
and the number of observations n. This cost function is also referred to as the mean squared error or
just MSE. By using MSE we are going to minimize the error between the predicted value and the actual
value. As seen from equation 2, we square the error difference and sum over all the n datapoints and
divide that value by the total number of data points. The reason of why we square the error difference
is because we need to be able to handle negative differences and add that value to our cost as well.
However, in order to minimize our cost function we need to figure out how to change the weights w
0
and
w
1
, such that our cost is minimized. To understand how to change the weights properly we have to define
an optimization algorithm for our cost function.
1.2 Gradient Descent
Gradient Descent is a simple optimization algorithm. An optimization algorithm which tries to minimize
our cost function. The math behind Gradient Descent is a simple process:
w
i
= w
i
− α
∂J
∂w
i
(3)
Gradient Descent repeats this process until convergence. All of the weights has to be updated simulta-
neously, so for our simple univariate linear regression the two weights i = {0, 1} are going to be updated
simultaneous. The variable α is called the learning rate, and the value of α decides how much the weights
are being changed in each iteration. The partial derivatives w.r.t. the weights can easily be calculated
from equation 2:
∂J
∂w
0
=
2
m
m
X
i=1
( ˆy
i
− y
i
) (4)
2