As an analyst, I cannot directly start building a model soon after I get the data. Even though my desired destination is the final model, I have to go through a lot of very important, though tedious, data cleaning and data preparation activities. If I do not spend time on these and instead start building my model, the model might not serve the purpose for which it was built and hence might be disregarded as “UNUSABLE”.
In this article, I will talk about 3 ways in which one can transform the variables before including them in a model.
1. Let’s assume we have a telecom churn dataset. In this dataset, we have a variable that captures the state to which a customer belongs (Categorical Variable). For simplicity, there are 3 states in my dataset – Bihar, Orissa and Gujarat. Let’s say I create a new variable and assign numeric values – 1, 2 and 3 to Bihar, Orissa and Gujarat. I then plug in this variable into my logistic model that predicts churn and I end up getting a single coefficient (For Eg. 1.5) for state.
This is how I would interpret this coefficient: If all the variables have the same values for 2 customers and the only difference is the state they are from, then the person from Orissa has a higher chance of churning than a person from Bihar as much as a person from Gujarat has over a person from Orissa.
In reality, this may not always be right. It might be that the increase is not always constant. The increase in churning probability might be higher from Bihar to Orissa and a little lower from Orissa to Gujarat.
In order to account for these differences, it is advisable to create dummy variables for the different levels of the categorical variables. This way, we can get individual coefficients for individual states.
2. Even for continuous variables, it might be that the single coefficient generated may not always be significant, even though one knows that the variable will definitely be a very good predictor.
There are 2 alternatives to this:
3. There is a third, very interesting transformation that one can think of. In the data exploration stage, when we see that a particular variable has a non – linear relationship with the dependent variable, in order to linearize this relationship, the independent variable should be treated appropriately. The treatment can be squaring/cubing the independent variable, log transformation etc. Eg: We will create a squared term for diminishing returns.
Thus, before any modeling can begin a lot of time has to be spent on data preparation. Also remember that data preparation is not a one-time process, it is iterative. Sometimes, even after building the model, certain transformations might have to be done. For example, if there is heteroscedasticity, log transformation on the dependent variable might be appropriate. So, don’t rush. Your modelling will only be accurate if you have prepared your data well and have made required transformations along the way.