There are multiple types of algorithm methods used in machine learning. One such popular and commonly used machine learning method is logistic regression. This blog covers the various concepts related to logistic regression to help you better understand the subject and become a better machine learning practitioner. Let’s get started.
Logistic regression definition: Logistic regression is a type of supervised machine learning used to predict the probability of a target variable. It is used to estimate the relationship between a dependent (target) variable and one or more independent variables. The output of the dependent variable is represented in discrete values such as 0 and 1.
Here’s a look at the math behind logistic regression. The logistic regression equation can be represented as-
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
p= probability of the occurrence of the feature
x1,x2,..xk= set of input features of x
b1,b2…bk= parameter values to be estimated in the logistic regression formula
Logistic regression models are generally used for predictive analysis for binary classification of data. However, they can also be used for multi-class classification. Logistic regression models can be classified into three main logistic regression analysis categories. They are:
This is one of the most widely-used logistic regression models, used to predict and categorize data into either of the two classes. For example, a patient can have cancerous cells, or they cannot. The data can’t belong to two categories at the same time.
The multinomial logistic regression model is used to classify the target variable into multiple classes, irrespective of any quantitative significance. For instance, the type of food an individual is likely to order based on their diet preferences – vegetarians, non-vegetarians, and vegan.
The ordinal logistic regression model is used to classify the target variable into classes and also in order. For example, a pupil’s performance in an examination can be classified as poor, good, and excellent in a hierarchical order. Thus, we can see that the data is not only classified into three distinct categories, but each category has a unique level of importance.
The logistic regression algorithm can be used in a plethora of cases such as tumor classification, spam detection, and sex categorization, to name a few. Let’s have a look at some logistic regression examples to get a better idea.
Example 1: Identifying spam emails
To detect whether an email is a spam(1) or not(0), various attributes of the email are extracted and analyzed such as;
The extracted data is fed into a logistic regression algorithm, which analyzes the data and then outputs a score between 0 and 1. If the score lies in the range of 0.5 to 1, then the email is classified as spam. Similarly, if the score lies between 0 to 0.5, it is marked non-spam.
Example 2: Credit card fraud
Banks can employ a logistic regression-based machine learning program to identify fraud online credit card transactions. Details such as the point-of-sale, card number, transaction value, and the date of transaction are fed into the algorithm, which then determines whether a particular transaction is genuine(0) or fraud(1). For instance, if the purchase value is too high and deviates from usual values, the regression model assigns a value (between 0.5 and 1) classifies the transaction as fraud.
The Sigmoid function (logistic regression model) is used to map the predicted predictions to probabilities. The Sigmoid function represents an ‘S’ shaped curve when plotted on a map. The graph plots the predicted values between 0 and 1. The values are then plotted towards the margins at the top and the bottom of the Y-axis, with the labels as 0 and 1. Based on these values, the target variable can be classified in either of the classes.
The equation for the Sigmoid function is given as:
e^x= the exponential constant with a value of 2.718.
This equation gives the value of y(predicted value) close to zero if x is a considerable negative value. Similarly, if the value of x is a large positive value, the value of y is predicted close to one.
A decision boundary can be set to predict the class to which the data belongs. Based on the set value, the estimated values can be classified into classes.
For instance, let us take the example of classifying emails as spam or not. If the predicted value(p) is less than 0.5, then the email is classified spam and vice versa.
Let’s take the Social Network Ads dataset to carry out logit regression analysis and predict whether an individual will purchase a car or not.
First, we need to import the libraries that we will use to build our logical regression model. We’ll use the Pandas library to load in the CSV or the dataset, and Numpy to convert the data frame into arrays.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv(‘Social_Network_Ads.csv’)
For our example, we will consider the purchased value as the dependent variable and the Age and Estimated Salary of the individuals as the independent variables.
X = dataset.iloc[:, [2,3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
For our example, we have defined the test size as 0.33. This means 33% of the data set will be used as a test data set while the rest 66% will be used for training.
To get better accuracy for our model, we need to rescale the data to bring value that may have extremely varying values into alignment with one another.
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Now, we need to build the logistic regression model and fit it to the training data set. First, we will need to import the logistic regression algorithm from Sklearn.
from sklearn.linear_model import LogisticRegression
Next, we need to create an instance classifier and fit it to the training data.
classifier = LogisticRegression(random_state=0)
Next, we need to create predictions on the test dataset.
y_pred = classifier.predict(X_test)
Lastly, we can check the performance of our model by using the Confusion matrix.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
We can now use the matplotlib to plot our dataset and visualize the training set result.
We hope that this blog helped answer your doubts regarding logistic regression. If you’re interested in learning more about logistic regression and machine learning, you can consider our guaranteed placement Postgraduate Diploma in Data Science.