Analytics on Unstructured data – Twitter, Facebook and social media

8 Oct 2012

Quoting Wikipedia: – Unstructured Data (or unstructured information) refers to information that either does not have a predefined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Yes, most big data sources, including Facebook, twitter etc., have unstructured data. And nearly no analytics can work directly on this unstructured data. Unstructured data is the starting point, but it has to be metamorphosed into some structured format before we can start with any actual analytics technique application. So what is the process?

Business requirement: – Tweets, Facebook postings and other social comments have to be analysed to determine the sentiment of the population.

Creating semi structured/ structured data (which can fit into relational tables) will involve dissecting the text into words and phrases which can then be categorised from ‘good’ to ‘bad’ and everything in between. Converted numerically, this would be in a range of +1 to -1. This set of fully numeric data is then ready for use. And all the analysis techniques can then be used to conclude and arrive at results. Thus, a new step of extracting structured data from unstructured data gets added into the analysis process.

Thus, all the Analytics skills and techniques are going to remain very valid in this new paradigm. Only the type of data, the source and its general understanding has to be re-vamped. And most of us analysts can breathe a sigh of relief.

So does this process of converting unstructured to structured data have to be manual, through heuristics? Or machine driven through algorithms? Algorithms reduce accuracy but increase scale. So a judicious decision or a gradual shift from manual to algorithm can be used to standardise this process within the organisation.

In fact, in this whole landscape, the decision on what data to delete becomes of paramount importance. And there are a new set of data guardians who are experts on helping organisations retain only relevant data.

This understanding and coming together of all the bits and pieces has given a lot of confidence to decision making process on dynamic and big data.