Before we understand what a regression tree is let us first understand a decision tree. A decision tree is used to make decisions and the goal is to predict an outcome based on the various variables that are inputted. This is why the regression trees find use in data analysis and modelling. Can decision trees be used for regression? It is used in the field of computer programming where the computer uses some criteria to decide on an option.
The decision tree has two components namely the problem statement that is the root of a tree and the solutions or consequences that are the branches of a tree. The decision tree is invested with the root towards the top.
The decision trees are of two kinds.
The boosted regression trees are built using a process. This is known as the binary recursive partitioning. In this process data gets split into various branches and partitions. The data then keeps getting split into small groups. This is as the method gets moved up to each branch.
In the initial phase, all the records are grouped in a similar partition. The algorithm will then start to allocate the data into the first two branch partitions using the binary split in each field. The algorithm will select the split. This is the one that minimizes the squared deviation sum from the mean of two different partitions.
The rule of splitting is put to each new branch. The process will keep continuing till every node reaches the minimum node size as specified by the user and now becomes a terminal node.
Pruning the tree is also an important concept in the regression tree. The tree grows from the training set and this causes the tree that is fully grown to suffer from what is known as overfitting. This causes real-life data to perform poorly.
This is what the tree has to be pruned which is done by making use of the validation set. The complexity of cost is calculated at every step during the tree growth and this decides the decision nodes that the pruned tree should have.
The tree gets pruned in order to lower the sum of the variance of the output variable that is present in the validation data. This is taken as one terminal node at each time. It is also done to minimize the production of the complexity factor of cost. It also minimises the number of terminal nodes.
A common example of a regression tree is the alarm clock. The alarm clock keeps track of time and in case the time matches the pre-set time then it will start to ring. The coin tossing is also a popular example of a regression tree where you get one of the two results which are either ahead or a tail.
The regression tree-based method is used in the field of analytics because of its many advantages. It lets the user visualize every step which in turn helps him to make a rational decision. Priority can be given to the criteria of a decision based on what you think is of the utmost urgency.
It is easy to make decisions based on the regression as compared to many other methods. Since most of the undesired data get filtered out it leaves you with less data as you go further down in the tree. It is easy to make the regression tree and one can present it to high authorities in an easier manner in the form of a chart or an easy diagram.
If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Integrated Program in Business Analytics, a 10-month online program, in collaboration with IIM Indore!