Data: Quality, Quantity or Both?

Marketing guru, John Little[1] said “Data is prolific but poorly digested, often irrelevant and some issues entirely lack the illumination of measurement.”

It is a very profound statement which highlights a common problem even today, close to 45 years after he said it. Data is the bedrock on which analytics practices are built and a carefully crafted data strategy is crucial for any organization embarking on this journey.

Oftentimes laying the foundation and getting high quality data can take years but it is well worth the time and effort that goes into it. Does that mean that we just collect all the data available to us? An investment in the form of time and effort to identify and prioritize areas where the data can eventually be monetized in some form is required. Data is like food, says Hal Varian[2]. “We used to be calorie poor and now the problem is obesity. We used to be data poor, now the problem is data obesity.”

In 2010, approximately 800 exabytes of data was generated as opposed to the 5 exabytes of data generated till 2003[3]. The rate of increase in the generation of data necessitates a clearly articulated strategy of data collection closely linked to how it will be used to create tangible business value.

The other aspect to be carefully considered while articulating a data strategy is the quality of data. Varian alludes to this when he says “You need to focus on quality. You’ll be better off with a small but carefully structured sample rather than a large sloppy sample.” Coming from a man in the thick of things at Google, you would do well to pay heed.

Accurate data implies that it describes ground realities and is from a source that can be audited. It has to be consistently defined, measured and should fall between acceptable ranges. Missing or incomplete data are also causes for concerns on the quality front. Very often we do not have modeling database accounting for all the dynamics governing the area being analyzed. While all models do allow for a certain error component which includes this inability to capture all factors in the form of data, the error component can account for not more than twenty percent of the variation of the dependent. Another problem is that of missing data. If in a particular row of data, the parameter being captured has not been entered, this is an instance of missing data. For any model if more than 20% of the database is excluded on account of missing data, it should raise an alarm. This number may have to be smaller where sample sizes are smaller.

While a certain minimum quantum of data is required to run models, which can sometimes be a bottleneck while modeling, more often than not, quality tends to be a bigger problem. Technology can be an enabler in the quest for better data quality, but strengthening and re-engineering business processes when required plays a bigger role. Even as organizations transition from making choices based on gut-feelings to assessments driven by data, it is imperative that we are confident that the basis for vital decision making is sound and underpinned with reliable data.

Susan Mani- All for the love of Data
The opinions expressed here are my own and do not reflect those of my employer.


[1] John D.C. Little is the Institute Professor at MIT and is known for his seminal contributions to marketing science and is considered by many to be its founder.
[2] Hal Varian is the chief economist of Google and Professor Emeritus at UC Berkeley where he is the founding dean of the School of Information.
Interested in a career in Data Science?
To learn more about Jigsaw’s Data Science with SAS Course – click here.
To learn more about Jigsaw’s Data Science with R Course – click here.
To learn more about Jigsaw’s Big Data Course – click here.

Related Articles

loader
Please wait while your application is being created.
Request Callback