Building Random Forest Algorithm Models In Python

19 Nov 2020

Introduction

Let us look into Building Random Forest Algorithm Models In Python. Random Forest is a supervised, flexible, and easy to use learning algorithm based on Ensemble Learning. Ensemble Learning is a method in Machine Learning that joins different or the same algorithms multiple times to form a powerful prediction model. This combination is known as Multiple Decision Trees. Random Forest develops Decision Trees on randomly selected data samples, gains prediction from each tree, and chooses the best solution by voting. The more trees a Random Forest algorithm has, the more robust the Forest is. Random Forest algorithms are used for both Classification and Regression.

In this article, we will see:

How Does The Random Forest Algorithm Work?
Advantages Of Using Random Forest
Disadvantages Of Using Random Forest
How To Use The Random Forest Algorithm For Classification Using Python?

1. How Does The Random Forest Algorithm Work?

Random Forest algorithm works in the following way:

Selection of random samples from a given dataset
Construction of a Decision Tree for each sample
Arrangement of a prediction result for each Decision Tree and performing voting for each predicted result
Selection of the final result. Here, the prediction result with the most votes is selected

2. Advantages Of Using Random Forest

Random Forest is a highly accurate and robust method because of its numerous Decision Trees. It is unbiased as it comprises multiple trees. Every tree is trained on an individual subset of data.
The Random Forest algorithm is very stable.
It also handles missing values. It does so either by utilizing median values to replace continuous variables or computing the proximity-weighted average of missing values.
It gives estimates of what variables are important in the classification.

3. Disadvantages Of Using Random Forest

Like any other algorithm, Random Forest algorithms also have a few drawbacks.

As it has multiple Decision Trees, the Random Forest algorithm is slow in generating predictions. To make a prediction, every tree in the Forest predicts a result for the same given input and then performs voting. This process is highly time-consuming.
The Random Forest algorithm is complex as compared to a Decision Tree. In a Decision Tree, you easily make decisions by following the tree’s path. The Random Forest algorithm needs much more computational resources because numerous Decision Trees get joined together.

Now, let us discover how to build a Random Forest algorithm in Python.

4. Random Forest Algorithm For Classification Using Python

Problem Definition

For example, consider predicting whether a bank currency note is authentic or not based on four attributes. These attributes are the variance of the image wavelet transformed image, skewness, entropy, and the image kurtosis.

Solution

The task here is a binary classification problem, and a Random Forest classifier in Python solves this problem. The steps to solve this problem are as follows:

Step 1: Import Libraries

Step 2: Importing Dataset

The dataset for this task can be downloaded from the following link:

https://drive.google.com/file/d/1mPE17z6bxJp8sQAb0FpC5h1b_6GfkzRd/view?usp=sharing

Import the dataset using the following code.

For getting a high-level view of the dataset, execute the following command:

Step 3: Preparing Data For Training

Use the following code to divide data into attributes and labels:

Use the following code to divide data into training and testing sets:

Step 4: Feature Scaling

Use the following code for feature scaling:

Step 5: Training The Algorithm

After we have scaled the dataset, we will train our Random Forests to solve this classification problem. To do so, execute the following code.

Step 6: Evaluating The Algorithm

For solving a classification problem using the Random Forest classifier in Python the metrics used to evaluate an algorithm are accuracy, confusion matrix, precision-recall, and F1 values. Use the following script to find these values:

The output will look like this:

The accuracy achieved by the Random Forest classifier with 20 trees is 98.90% and it is considered good. For the Random Forest classifier with Python, changing the number of estimators for the problem didn’t significantly improve the results. It is represented in the following chart, where the X-axis contains the number of estimators while the Y-axis shows the accuracy.

Conclusion

In this article, we have demonstrated how a Random Forest in Python is built, how it works, its advantages, and its disadvantages. Random Forests have various applications, such as Recommendation Engines, Image Classification, and Feature Selection.
If you wish to learn more about Data Science tools and Machine Learning Algorithms, take a look at our 11-months in-person Postgraduate Diploma In Data Science (PGD-DS). You can know more about this placement guaranteed program by clicking here.

Also, Read

Data Structures in Python: Comprehensive 6 Steps Guide

Euclidean Distance Python: Easy Beginner’s Guide in 2020