Hadoop And Unstructured Data

30 Sep 2014

Let’s first begin by understanding the term ‘unstructured data’ and comprehending how is it different from other forms of data available.

Data that also contains meta-data (data about data) are generally classified as structured or semi-structured data. Relational databases – that contain schema of tables, XML files – that contain tags, simple tables with columns etc. are examples of structured data.

Image credit: Dileep Govindaraju

Now consider data like a blog content, or a comment, email messages, any text document – say legal policies of a company, or an audio file, or video file or images, which constitute about 80 to 90% of all forms of data available for analysis. These forms of data do not follow any specific structure nor do they contain information about the content of the data. These are all classified as unstructured data.

Having talked about the proportions of structured and unstructured data, old school database analytics methods on only structured data will limit the access to just 0.5% of the information available for analysis. With technologies like Hadoop growing fast, the focus is shifting towards tapping information from this unexplored chaotic realm of unstructured data that is available in huge volumes.

How is Hadoop suitable for analysing unstructured data?

Hadoop has distributed storage and distributed processing framework, which is essential for unstructured data analysis, owing to its size and complexity.
Hadoop is designed to support Big Data – Data that is too big for any traditional database technologies to accommodate. Unstructured data is BIG – really BIG in most cases.
Data in HDFS is stored as files. Hadoop does not enforce on having a schema or a structure to the data that has to be stored.
Hadoop also has applications like Sqoop, HIVE, HBASE etc. to import and export from other popular traditional and non-traditional database forms. This allows using Hadoop for structuring any unstructured data and then exporting the semi-structured or structured data into traditional databases for further analysis.
Hadoop is a very powerful tool for writing customized codes. Analyzing unstructured data typically involves complex algorithms. Programmers can implement algorithms of any complexity, while exploiting the benefits of the Hadoop framework for efficiency and reliability. This gives flexibility for users to understand the data at a crude level and program any algorithm that may be appropriate.
Hadoop being an open-source project, in numerous applications specific to video/audio file processing, image files analysis, text analytics have being developed in market; Pivotal, pythian to mentioned a few.

Let’s take an example of unstructured data analysis:

Consider the Video data feed from a CCTV surveillance system of an enterprise. Currently monitoring of these videos is done by humans. Detecting incidents from these videos will not only require the monitoring person to be noticing multiple video feeds, but also be attentive all the time.

Assume this monitoring process needs to be automated. The amount of data that will be fed in is huge – few Terabytes every hours. Processing close to real-time is required to detect incidents at the right time. Clearly, this will require a system that has the capability to store really heavy volumes of streaming data, very high processing speed and also the flexibility to be configured to perform any customized algorithm on the data.

Clearly Hadoop has all the capabilities listed and can be used in this scenario effectively. However, in many cases of unstructured data – mainly video/audio analysis, designing optimized algorithms to extract useful information for analysis is still a challenging problem under research. But with the way innovations are constantly being seen in the data space, we are sure to see new and improved techniques and tools in the very near future. Watch this space as the team at Jigsaw will be sure to update you on all new updates and more as and when they happen.

Interested in a career in Big Data? Check out Jigsaw Academy’s Big Data courses and see how you can get trained to become a Big Data specialist.

Big Data Professionals in Demand.

Big Data in Action- How Modak Analytics, Built India’s First Big Data-Based Electoral Data Repository