A decision tree is a predictive model that, as its name implies, can be viewed as a tree. The predictions are made on the basis of a series of decision much like the game of 20 questions.For instance, if a loan company wants to create a set of rules to identify potential defaulters, the resulting decision tree may look something like this.
We started the analysis with historical data on 10000 customers, 500 of whom were defaulters, i.e. a default rate of 5%. We asked a series of questions and on the basis of the responses we were able to break the population into smaller groups. In the end we identified 3 smaller groups of population with default rates of 50% or above.
In this data, there are 200 customers whose income is not greater than 1.2 times their loan instalment. 150 of these customers defaulted i.e. 75% of the population. Based on this, it is clear that if a customer’s income is not greater than 1.2 times the instalment amount they have to pay, they are likely to default on the loan. The bank can reject such loan applications in the future to reduce its default rate.
Decision tree algorithms
The most crucial step in creating a decision tree is to identify the right split i.e. the right question to ask at each stage. Most algorithms look at all possible split options and calculate the most effective option using one of the many methods commonly used. The split will result in 2 or more child nodes from the parent node. The same process is repeated at each child node. All possible options are evaluated to identify the best split as per a pre-set criterion.
Although there are many decision tree algorithms, they basically work on 2 principles. One kind works on increasing the purity of the resulting nodes while the other works on ensuring maximum statistically significant difference from the parent node. The 3 most commonly used methods are Gini, Chi-square and Information gain. If the target variable is continuous for example of something happening probability (0 to 1) or income etc we use the reduction in variance test or the F test.
Applying decision trees to Business
Because of their graphical structure, decision trees are easy to understand and explain. Hence, this technique is very popular in business analytics. Because of this clarity they also allow for more complex profit and ROI models to be added easily in on top of the predictive model. .
Because of their high level of automation and the ease of translating decision tree models into SQL for deployment in relational databases the technology has also proven to be easy to integrate with existing IT processes, requiring little pre-processing and cleansing of the data.
Apart from predictive modelling, decision trees are useful in other ways too. During the data exploration stage of a project decision trees are a quickly and effective way of understanding impact of various variables as well identifying the important variables. A real life business ituation could involve thousands of variables and decision trees are a quick and easy way of reducing this to a more manageable number.
Decision trees are also very useful in understanding variable interactions. Sometimes two or more variables could combine to become powerful. For example, the target market for Yamaha motorcycles will be males between a certain age group and income band. Just the income or the gender or age may not be as powerful in segmenting this market.
Trees can be used to impute missing values. Sometimes trees can help interpret model results that could otherwise be difficult to understand. For example a neural net may generate output that can’t be interpreted easily. Applying decision trees using the prediction of neural net as the target variable can generate rules that provide broad understanding of the neural net model.
Features of a good decision tree
A good decision tree follows Occam’s razor. It delivers a high level of accuracy using as few variables as possible. A good tree is easy to visualize and interpret. Finally, like all models, it should make intuitive sense.
When to use decision trees
Decision trees are the technique of choice when the problem is binary (0/1, yes/no). It is not the first choice for estimating continuous values. It is widely used in business especially in situations where the interpretation of results is more important than the accuracy. Use decision trees when time is of essence.Finally, whenever available, use decision trees for initial data exploration in any project.