Logistic Regression

Mads Møller

LinkedIn: https://www.linkedin.com/in/madsmoeller1/

Mail: mads.moeller@outlook.com

This paper is the second in the series about machine learning algorithms. Logistic Regression is our

ﬁrst classiﬁcation algorithm. Even though regression appears in the name, the Logistic Regression

algorithm is a classiﬁcation algorithm, not a regression algorithm.

Logistic regression is highly connected with linear regression. Actually, what you will see is that linear

regression is happening within logistic regression. First of all, logistic regression is used for classiﬁcation

problems, so let us just deﬁne what we mean when we say ”classiﬁcation”. The formal deﬁnition of

classiﬁcation is

”Classiﬁcation is a process of categorizing a given set of data into classes”

An example of classiﬁcation could be determining whether a person has cancer or not based on an x-

ray photo. This is actually a great example of what deep learning is used for in real life. Another

example could be classifying whether an email is spam or not. Logistic regression can be applied on such

problems. In order to understand how the logistic regression algorithm works, let us inspect the simple

case of binary logistic regression.

1 Binary Logistic Regression

Even though logistic regression is a classiﬁcation algorithm the output is continuous. We inspect a simple

example where we have a binary outcome. We deﬁne the linear regression as z, which we are going to

use in order to understand logistic regression:

z = w

+ w

x = w

x (1)

In binary logistic regression we can either classify our data as 0 or 1. The way logistic regression works

is by squeezing the output of linear regression to be between 0 and 1. Logistic regression is able to

transform the linear output with the help of the sigmoid function. The sigmoid function (or sometimes

called the logistic function) is deﬁned mathematically as:

σ(z) =

1 + e

−z

1 + e

−(w

(2)

The output of the sigmoid function is probabilistic meaning that its output reﬂect the probability of the

data belonging to each class. If we inspect the sigmoid function visually it is clear that the output will

be between 0 and 1:

Figure 1: Sigmoid Function

In order to decide when we would classify an output from the sigmoid function as either 1 or 0 we need

to set up a decision boundary. The decision boundary maps our probability score to a discrete class

(0 or 1). In ﬁgure 1 the decision boundary is illustrated as the dotted line. The decision boundary is the

line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function. In

general, the decision boundary is set with a threshold of 0.5, such that:

σ(z) ≥ 0.5 → ˆy = 1

σ(z) < 0.5 → ˆy = 0

Our hypothesis function, which is denoted by h

, tells us what the probability is of that point belonging

to the class ”1”:

(x) = p(y = 1|x, w) (3)

Because we have a binary classiﬁcation problem it must also apply that:

p(y = 0|x, w) = 1 − p(y = 1|x, w) (4)

It is worth mentioning that the input to the sigmoid function doesn’t need to be linear. We could have

x = w

+ w

, such that we could describe split our classes with a circle.

1.1 Cost Function

As in linear regression and all other machine learning algorithms we also has a cost function for logistic

regression. The cost function is used to update the weights, which are going to change the decision

boundary. In linear regression we used the Mean Squared Error (MSE) as our cost function. In logistic

regression we use a cost function called Cross-Entropy. This cost function can be divided into two

seperate cost functions. One for y = 0 and one for y = 1:

J(w) =

i=1

Cost(h

(i)

, y

(i)

) (5)

Like for the linear regression model we also need to be able to compute the cost of a function for logistic

regression:

Cost(h

(x), y) =











−log(h

(x)) if y = 1

−log(1 − h

(x)) if y = 0

(6)

Were h

(x) is some value predicted and y is the actual value. But why does this make sense? Let us

illustrate the cost function:

Figure 2: Cost function

Note that writing the cost function in this way guarantees that J(w) is convex for logistic regression.

When the actual value of a prediction is 1 and our hypothesis function returns 0.7 our cost will be very

low. However, if we observe a hypothesis of 0.1 we will have a very high cost. This can be seen from

ﬁgure 2. If we observe an actual value of 0 an we see a hypothesis of 0.1 it means that we have a very

low cost (the light blue graph in ﬁgure 2). However, if we observe a hypothesis of 0.9 it will have a high

cost. We can rewrite the cost from equation 6:

Cost(h

(x), y) = −y · log(h

(x)) − (1 − y)log(1 − h

(x)) (7)

Therefore, we can write our cost function from 5 as:

J(w) =

i=1

Cost(h

(i)

, y

(i)

) = −

i=1

(i)

· log(h

(i)

)) + (1 − y

(i)

) · log(1 − h

(i)

)) (8)

However, as in the case of linear regression it is easier to implement these algorithms in vectorized format.

Therefore we can now rewrite the cost function from equation 8 in vectorized format:

|{z}

m×1

= σ( X

|{z}

m×(n+1)

· w

|{z}

(n+1)×1

)

J(w)

|{z}

1×1







− y

|{z}

1×m

log( h

|{z}

m×1

) − (1 − y)

| {z }

1×m

log(1 − h

|{z}

1×m

)







(9)

1.2 Gradient Descent

For the logistic regression algorithm we will use gradient descent again as our optimization algorithm.

In practice other optimization algorithms are often used. Speciﬁcally an algorithm called Adam, which

we will cover later in this series. For now we will just stick with the gradient descent algorithm as our

optimizer. Our problem is the same as in the case of linear regression:

min

J(w)

We will minimize our cost by simultaneously updating the weights by using gradient descent:

= w

− α

∂J

∂w

= w

−

i=1

(i)

) − y

(i)

(10)

A vectorized implementation of this algorithm can be implemented as:

w = w −

− y) (11)

As we see the gradient descent algorithm can now easily be implemented in Python. Before implementa-

tion let us inspect the case of multi-class classiﬁcation.

2 Multi-Class Logistic Regression

In the last section we saw how one could use the logistic regression algorithm for binary classiﬁcation.

However, it is not always the case that we only have two diﬀerent classes. One might as well have many

diﬀerent classes. But how does the algorithm work, if we are having more than two classes?

It is actually straightforward how the logistic regression algorithm works on multi-class problems. It uses

a method called one-vs-all. The one-vs-all is an algorithm for multi-class classiﬁcation problems. It

the multi-class classiﬁcation problem into binary classiﬁcation problems. By doing so the algorithm is

looking at ”one” class vs. all others as a binary problem. The concept of one-vs-all is illustrated below

in ﬁgure 3:

Figure 3: One-Vs-All

Since y = {0, 1, . . . , n}, we divide our problem into n + 1 binary classiﬁcation problems. In each binary

problem we predict the probability that y is a member of one of our classes:

(X) =







(0)

(X) = P (y = 0|X, w)

(1)

(X) = P (y = 1|X, w)

(n)

(X) = P (n = 0|X, w)







(12)

We ﬁnd our prediction as:

ˆy = max

(i)

(X) (13)

We are choosing one class and then lumping all the others into a single second class. We do this repeatedly,

applying binary logistic regression to each case, and then use the hypothesis that returned the highest

value as our prediction.

3 Implementation

For our implementation of logistic regression we will ﬁrstly implement the algorithm from scratch. I

like to use Object Oriented Programming when coding these algorithms from scratch. If you are

not familiar with the concepts of object oriented programming i will recommend you to read this article

before trying to understand the code implementation. Afterwards we will see how logistic regression can

be implemented using the Python machine learning library Scikitlearn.

3.1 Implementation - From Scratch

For implementing linear regression without fancy machine learning packages we are going to use the

following libraries:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

Let us import the dataset. We are looking at a binary classiﬁcation problem.

data = pd.read_csv('ex2data1.txt', header = None)

X = np.array(data.iloc[:, [0,1]])

y = np.array(data.iloc[:, -1])

print('dim of X:', X.shape)

print('dim of y:', y.shape)

We are now ready to implement our Logistic Regression class:

class LogisticRegression:

def __init__(self, alpha = 0.01, iterations = 100000):

self.alpha = alpha

self.iterations = iterations

Within our LogisticRegression class we will now create functions in order to perform logistic regression

on a binary dataset:

def add_intercept(self, X):

ones = np.ones((X.shape[0], 1))

return np.concatenate((ones, X), axis = 1)

def sigmoid(self, z):

return 1 / (1 + np.exp(-z))

def cost(self, hw, y):

m = len(y)

return 1/m*(-np.dot(y.T,np.log(hw))-np.dot((1-y).T,np.log(1-hw)))

def fit(self, X, y):

X = self.add_intercept(X)

#initialize weights

self.w = np.zeros(X.shape[1])

#gradient descent

cost_iterations = []

for i in range(self.iterations):

z = np.dot(X, self.w)

hw = self.sigmoid(z)

m = len(y)

gradient = 1/m*np.dot(X.T,(hw-y))

self.w -= self.alpha*gradient

cost_iterations.append(self.cost(hw,y))

self.final_cost = self.cost(hw, y)

def predict_prob(self, X):

X = self.add_intercept(X)

return self.sigmoid(np.dot(X, self.w))

def predict(self, X):

return self.predict_prob(X).round()

def plot_scatter(self, X, y):

plt.figure(figsize=(19.20, 10.80))

plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='#06b2d6', marker = 'o', label='0')

plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='#0085A1', marker = 'o', label='1')

plt.legend()

x1_min, x1_max = X[:,0].min(), X[:,0].max(),

x2_min, x2_max = X[:,1].min(), X[:,1].max(),

xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))

grid = np.c_[xx1.ravel(), xx2.ravel()]

probs = self.predict_prob(grid).reshape(xx1.shape)

plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='black')

plt.xlim([x1_min-x1_min*0.05,x1_max+x1_max*0.02])

plt.ylim([x2_min-x2_min*0.05,x2_max+x2_max*0.02])

This is all the code we need in order to implement Logistic Regression from scratch. Let us try to test it

for our dataset:

LR = LogisticRegression(alpha = 0.0016)

LR.fit(X, y)

print('min cost:', LR.final_cost)

min cost: 0.3181404335967254

LR.plot_scatter(X, y)

LR.w

array([-6.87627765, 0.06105823, 0.05445471])

LR.final_cost

0.3181404335967254

3.2 mplementation - Scikit-Learn

We will import the following libraries:

from sklearn.linear_model import LogisticRegression

As usual Scikit-learn implementation is straightforward:

model = LogisticRegression()

model.fit(X, y)

print(model.intercept_)

print(model.coef_)

[-25.05219314, 0.20535491 0.2005838 ]

What we can see is that we actually get completely diﬀerent weights for our Scikit-learn implementation.

This model is much better than the one we implemented from scratch. This is because Sckit-Learn

doesn’t use gradient descent as its optimization algorithm. As i mentioned earlier there exist much better

optimization algorithms than gradient descent. The reason of why our own implementation doesn’t ﬁnd

as great a model is most likely because gradient descent ends up converting at a local minimum on our

cost function. We discussed the problem of local minima in the linear regression paper.