The Collection and Collation of Data: Tips for the Business Analyst

29 May 2020

“Data is the new oil”ⁱ is a phrase that is commonly being heard on TV and read in the print media. An interesting article in HBRⁱⁱ says that the comparison is not really valid given that data is “the ultimate renewable resource”. However there is no denying the tremendous value that data can provide if used in the right way. The average person is not aware of what happens with all the data that they provide to retailers, restaurateurs , on Facebook, on Google etc. and therefore may not appreciate the extent to which organizations are dependent on this data to be both relevant and successful in an increasingly competitive marketplace. What is increasingly apparent is that the value in the data can be realized at the end of an extensive process of collection, collation and analysis and frankly there are simply no short cuts to realizing sustainable benefits.

However before we get to the value from the end-product, we need to begin with the collection and collation process. Hal Varian’sⁱⁱⁱ approach is to adopt a process that will generate a “small but carefully structured sample” which, to start with, I would strongly advocate. Often the sample needs to be created from a number of different sources and the collation process involves mapping the relationship between the various elements from the sources being brought together. My previous blog Why Missing or Incomplete Data is Crucial to the Data Analyst, covered the problems associated with incomplete data. Sales of a product for instance, is not just dictated by the policies and marketing activities of the manufacturer but also includes the macro-economic conditions, the competitive landscape etc.

Let’s try and understand this with an example. Getting data on the general economic health of a country involves the identification of sources like government bodies and the central bank. These bodies generally publish data regularly, which is easily downloadable. The Bureau of Labor Statistics is one such source in the US and the Central Statistical Organization and the Ministry of Statistics and Programme Implementation are similar agencies in India, apart from numerous privately owned organization that offer subscription based data services. However each source has to be validated for veracity, consistency, relevance and timely availability of data.

In my experience government bodies re-calibrate their data very infrequently and as such can be considered consistent data to be used for further analysis. However data by CSO and MOSPI is very rarely at a state level which means that the analyst would have to split the data themselves or have to rely on private partners to do so. Splitting of data requires identifying variables which are available at a state level, that is of a high standard and available for three years and reflects what happens on the ground. This is fraught with complications and is a major project in itself. Sourcing this from private parties is an easier option but my experience is that some of them resort to frequent re-calibration of data, rendering it unusable in analyses where historical data is needed.

Assuming none of these problems exist or that the national trends are a reflection of what is happening in the states or districts then the analyst will have to bring the multiple sources of data together. The common fields are often time-period, geography, product or a combination of these which can then be used in a VLOOKUP in excel or to MERGE in SAS. The “merge” is fairly straightforward if there is a one-to-one or a one-to-many relationship.

Complexity often comes in the guise of many-to-many relationships or a complete absence of common elements. As an example, a promotional scheme i.e. a discount of INR 10/- is being offered on a kilogram of dal in all of the sample stores being audited in Bangalore district. Making an assumption that this promotion is being offered in all their stores in this geography is a reasonable assumption. The business owner will have to validate these assumptions.

Sourcing data is time-consuming and is often a painstaking activity. However the power of analytics can be unleashed only if the underlying data is as complete as possible and accurate, i.e. from an auditable source and available in a timely fashion. It is very important to know your data – the source, definition, periodicity and granularity. It will only make you a better analyst.

ihttps://www.google.co.in/searchq=Data+is+the+new+oil&oq=Data+is+the+new+oil&aqs=chrome..69i57j69i65l3j69i60.3965j0j8&sourceid=chrome&es_sm=93&ie=UTF-8 for a Google search on the phrase

ii https://blogs.hbr.org/2012/11/data-humans-and-the-new-oil/

iii Hal Varian is Chief Economist at Google and Professor Emeritus at UC Berkeley

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.

Jigsaw’s Data Science with SAS Course – click here.

Jigsaw’s Data Science with R Course – click here.

Jigsaw’s Big Data Course – click here.