Logistic Regression: A Simple Beginner’s Guide in 4 Steps

Introduction

There are multiple types of algorithm methods used in machine learning. One such popular and commonly used machine learning method is logistic regression. This blog covers the various concepts related to logistic regression to help you better understand the subject and become a better machine learning practitioner. Let’s get started.

  1. What is Logistic Regression?
  2. What Are the Types of Logistic Regression?
  3. How Does Logistic Regression Work?
  4. How to Build a Logistic Regression Model in Python?

1. What is Logistic Regression?

Logistic regression definition: Logistic regression is a type of supervised machine learning used to predict the probability of a target variable. It is used to estimate the relationship between a dependent (target) variable and one or more independent variables. The output of the dependent variable is represented in discrete values such as 0 and 1. 

Here’s a look at the math behind logistic regression. The logistic regression equation can be represented as-

  • Equation for logistic regression:

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk

Where;

p= probability of the occurrence of the feature

x1,x2,..xk= set of input features of x

b1,b2…bk= parameter values to be estimated in the logistic regression formula

2. What Are the Types of Logistic Regression?

Logistic regression models are generally used for predictive analysis for binary classification of data. However, they can also be used for multi-class classification. Logistic regression models can be classified into three main logistic regression analysis categories. They are:

  • Binary Logistic Regression Model

This is one of the most widely-used logistic regression models, used to predict and categorize data into either of the two classes. For example, a patient can have cancerous cells, or they cannot. The data can’t belong to two categories at the same time.

  • Multinomial Logistic Regression Model

The multinomial logistic regression model is used to classify the target variable into multiple classes, irrespective of any quantitative significance. For instance, the type of food an individual is likely to order based on their diet preferences – vegetarians, non-vegetarians, and vegan.

  • Ordinal Logistic Regression Model

The ordinal logistic regression model is used to classify the target variable into classes and also in order. For example, a pupil’s performance in an examination can be classified as poor, good, and excellent in a hierarchical order. Thus, we can see that the data is not only classified into three distinct categories, but each category has a unique level of importance.

The logistic regression algorithm can be used in a plethora of cases such as tumor classification, spam detection, and sex categorization, to name a few. Let’s have a look at some logistic regression examples to get a better idea.

Example 1: Identifying spam emails

To detect whether an email is a spam(1) or not(0), various attributes of the email are extracted and analyzed such as;

  • The sender
  • Keywords in the email such as “winner,” “congratulations,” “bank details.”
  • Spelling errors
  • Contact information or URL in the email

The extracted data is fed into a logistic regression algorithm, which analyzes the data and then outputs a score between 0 and 1. If the score lies in the range of 0.5 to 1, then the email is classified as spam. Similarly, if the score lies between 0 to 0.5, it is marked non-spam.

Example 2: Credit card fraud

Banks can employ a logistic regression-based machine learning program to identify fraud online credit card transactions. Details such as the point-of-sale, card number, transaction value, and the date of transaction are fed into the algorithm, which then determines whether a particular transaction is genuine(0) or fraud(1). For instance, if the purchase value is too high and deviates from usual values, the regression model assigns a value (between 0.5 and 1) classifies the transaction as fraud.

3. How Does Logistic Regression Work?

The Sigmoid function (logistic regression model) is used to map the predicted predictions to probabilities. The Sigmoid function represents an ‘S’ shaped curve when plotted on a map. The graph plots the predicted values between 0 and 1. The values are then plotted towards the margins at the top and the bottom of the Y-axis, with the labels as 0 and 1. Based on these values, the target variable can be classified in either of the classes.

The equation for the Sigmoid function is given as:

y=1/(1+e^x), where

e^x= the exponential constant with a value of 2.718.

This equation gives the value of y(predicted value) close to zero if x is a considerable negative value. Similarly, if the value of x is a large positive value, the value of y is predicted close to one.

A decision boundary can be set to predict the class to which the data belongs. Based on the set value, the estimated values can be classified into classes. 

For instance, let us take the example of classifying emails as spam or not. If the predicted value(p) is less than 0.5, then the email is classified spam and vice versa. 

4. How to Build a Logistic Regression Model in Python?

Let’s take the Social Network Ads dataset to carry out logit regression analysis and predict whether an individual will purchase a car or not.

  • Importing the libraries

First, we need to import the libraries that we will use to build our logical regression model. We’ll use the Pandas library to load in the CSV or the dataset, and Numpy to convert the data frame into arrays.

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

dataset = pd.read_csv(‘Social_Network_Ads.csv’)

  • Now, we need to split the data into dependent and independent variables.

For our example, we will consider the purchased value as the dependent variable and the Age and Estimated Salary of the individuals as the independent variables.

X = dataset.iloc[:, [2,3]].values

y = dataset.iloc[:, 4].values

  • Now that we have defined the target variable(Y) and the independent variables, we need to split the data set into the training set and the test set. We will use the training set to train our logistic regression algorithm. Similarly, the test data set will be used to validate the logistic regression model. To split the data into two sets, we will use Sklearn. The train_split_function can be used and we can specify the amount of data we want to set aside for training and testing. 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)

For our example, we have defined the test size as 0.33. This means 33% of the data set will be used as a test data set while the rest 66% will be used for training.

  • Scaling

To get better accuracy for our model, we need to rescale the data to bring value that may have extremely varying values into alignment with one another.

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.transform(X_test)

  • Building the logistic regression model

Now, we need to build the logistic regression model and fit it to the training data set. First, we will need to import the logistic regression algorithm from Sklearn.

from sklearn.linear_model import LogisticRegression

Next, we need to create an instance classifier and fit it to the training data.

classifier = LogisticRegression(random_state=0)

classifier.fit(X_train, y_train)

Next, we need to create predictions on the test dataset.

y_pred = classifier.predict(X_test)

Lastly, we can check the performance of our model by using the Confusion matrix.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

acc = accuracy_score(y_test, y_pred)

print(acc)

print(cm)

We can now use the matplotlib to plot our dataset and visualize the training set result.

Conclusion

We hope that this blog helped answer your doubts regarding logistic regression. If you’re interested in learning more about logistic regression and machine learning, you can consider our guaranteed placement Postgraduate Diploma in Data Science.

Also Read

Related Articles

loader
Please wait while your application is being created.
Request Callback