The last blog briefly examined the quality vs quantity debate as far as data is concerned. Data is the fundamental building block on which models are created and it is critical that we get it right if the final models are to generate business value. It may be useful to think of a model as a jigsaw puzzle for the purpose of understanding why all the pieces of data have to be present for the final product to be complete. Having said that, not having all the elements of data required is very common and we first need to know we do not have all the pieces needed, either before or during the modeling process. While this is by no means a treatise on the subject, this is a high-level view of some of the problems associated with it.
To understand the consequences of incomplete data, we need to look at some fundamental assumptions of applying OLS (ordinary least squares). The first assumption is that once modeling is complete, the residuals or the inexplicable part of the variation is completely random. Till that point, the assumption is that till that point, all the variation in the dependent has not been fully explained and has predictable component still exists within the residual. The second assumption that we need to review is that the correlation between two variables retained in the final model should be zero (in theory) or collinearity should be minimal.
When a variable is missing due to incomplete data, ideally the variation it accounts for should sit in the residual as a whole. Post-modeling checks would then indicate violation of the random residual assumption which would imply a missing variable(s) in the regression model. However in practice, this is almost always not the case and the diagnosis and fixing of the problem of incomplete data can almost be considered an art. If the variable missing is correlated with others included in the model, then it is quite possible to encounter omitted variable biases. This could result in either overestimation or underestimation of the relationship between variables in the model and the dependent. It is not possible to determine the exact level of bias in the models which adds an additional layer to complexity to the problem.
To help illustrate this problem, let us look at a simple linear regression model with sales as the dependent and the various marketing and sales levers used in the marketplace as the explanatory variables. In general there is paucity in data related to competition activities (in industries other than CPG/retail) and this factor is crucial to getting a clear picture of market dynamics. It is fairly well known that in general there is generally a strong correlation between the marketing/pricing activities of the brand being analysed and its competition both in terms of magnitude and timing. Hence not including this variable has the potential to then bias the estimates of the regression. Since the rationale behind this exercise, is to gauge market responsiveness to marketing levers, biased coefficients may provide misleading results, thus not resulting in tangible benefits.
The purpose behind model creation is value generation for a business and detection of a missing variable is imperative if we are to realize and benefits from this activity. One of the ways to do this is to look at model fits and scrutinize residuals – is there a systematic over or under prediction, abnormally large residuals, is there a seasonal pattern in the large residuals etc. Another way to do this is to critically analyze the coefficients and validate it with knowledge of dynamics in the field. These are some general rules of thumb that may be applied and an analyst will develop these skills as they grow in experience.
Knowing your data and studying data diagnostics prior to modeling along with knowledge of what happens on the ground are very important aspects to consider while creating models. The analysis has to mirror what is happening on the ground as closely as possible, generating actionable insights and creating value to be considered a good study. While model creation is largely a science, there is an art as well and a good analyst will have to marry the two to come up with analyses that are meaningful.
Susan Mani- All for the love of Data
The opinions expressed here are my own and do not reflect those of my employer.