Top 5 Mistakes in Predictive Modeling

24 Jun 2013

Using data to build predictive models is an exciting and interesting field to be in. And while many of us are caught up in the excitement of generating models using the latest techniques and algorithms, we should not forget that there could be severe implications if the models are inaccurate or wrong. Businesses invest a lot of resources in implementing recommendations generated from models, and stand to lose substantially if sub-par models are deployed.

In this article, I list some common mistakes that analysts make in building predictive models, and how to avoid them.

1. Not investing time in understanding business context

Most analysts I know, especially in the first couple of years of their careers, are raring to build models from the word go. Model building is after all the exciting part, all the pre-processing is perceived as “boring”. However, in order to build an effective model that generates insights or recommendations that can be implemented, it is critical to gain an understanding of the business context, the business KPIs, the constraints that need to be accounted for, and overall business needs. Without that understanding, you may end up generating technically superior models, but with outcomes that business is not interested in, or is unable to implement. It is critical therefor at the start of a modeling project to budget time for a thorough understanding of context, which may be achieved either through client interaction or industry and company research.

2. Using tool driven technical criteria to assess model validity and usefulness

This is another common mistake. Analytics tools are enablers, and help process vast amounts of data and implement statistical algorithms on the data to generate models. However, they are simply tools. Sophisticated tools include all manner of technical criteria to assess model fit and predictive power, and great visualizations, and while technical criteria are important, they are not enough. The analyst needs to modify or build models that meet both technical criteria and business criteria in terms of interpretability and more important, implementability.

3. Skipping the data exploration stage

In a hurry to get to the models, a lot of analysts either entirely skip the data exploration process or do a very perfunctory analysis. This is a mistake with many implications, including generating inaccurate models, generating accurate models but on wrong data, not creating the right types of variables in data preparation, and using resources inefficiently because of realizing only after generating models that perhaps the data is skewed, or has outliers, or has too many missing values, or some values are potentially wrong. Investing time upfront on making sure of data, its values, distributions, completeness, consistency, and reliability saves a lot of time later in the modeling process.

4. Not validating models on a separate sample

Many times analysts use up all the data to build the model, and don’t validate the model on a holdout or test sample. There is no way of evaluating if the model has been overfit, leading to potentially disastrous consequences. Its always critical to save some of your data to run a test once the model has been finalized, and be doubly sure about the applicability of your model and to avoid nasty surprises post implementation. Even with limited data, there are now many techniques that can generate multiple random samples from existing data.

5. Making invalid predictions

Invalid predictions are generated when inferences from a model built on data with a certain attributes are applied to data that has very different attributes. For example, if I am building a scoring model for consumers with income ranges of say $50k to $150k, and income is a critical predictor in my model, its not a good idea for me to use this model to score customers that perhaps have incomes greater than $175k or less than $40k. When implementing a model, it’s a good idea to ensure that the data attributes for data used are similar in most respects to data used to build the model.

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.

Jigsaw’s Data Science with SAS Course – click here.

Jigsaw’s Data Science with R Course – click here.

Jigsaw’s Big Data Course – click here.