Boosting vs Random Forest Classifiers

11 Sep 2017

Classification problem is quite popular in various domains such as finance and telecommunication, for example, to predict the churn in telecommunication. Selection of a method, out of classical or machine learning algorithms, depends on business priorities. Each comes with certain advantages and disadvantages. In this blog, I will examine two machine learning algorithm – boosting and random forest.

Boosting

In boosting, n number of samples are created using bootstrapping – sampling with replacement. Boosting works on weak classifiers that have high bias and low variation works iteratively on weak learners, and more weightage is given to misclassified learners in next iteration. The final classification is done after combining prediction of each classifier. Boosting uses base model as decision tree generally. However, linear regression or logistic regression can be used as a base model too. It is best to grow a tree with no pruning and trees with 2-8 leaves work well. The process flow of common boosting method- ADABOOST-is as following:

Random forest

This ensemble method works on bootstrapped samples and uncorrelated classifiers. Alternatively, this model learns from various over grown trees and a final decision is made based on the majority. In this method, predictors are also sampled for each node. It best works on over fitted models that have low bias and high variation and is a bagged model.

I have personally used fgl dataset of R to compare these two methods. Fgl has predictors as chemical elements- RI, Na, Mg, Al, Si, K, Ca, Ba, and Fe and target variable is a type of forensic glass – WinF, WinNF, Veh, Con, Tabl, and Head.

In short, not only is a random forest more accurate model than boosting model but also it is more explainable that it gives importance of various predictors. See the importance of predictors as given by random forest. Boosting is used for data sets that has high dimensions.