New addition into Analytics DBMS – Hadoop as the ERP of web data and its Integration with SAS

6 Sep 2012

Big data and Apache Hadoop are mentioned in the same breath in conversations and industry practices. We have been talking about big data and unstructured data quite a lot recently. And then when the SAS blog reads in a March 2012 article that:-

‘The SAS/ACCESS Interface to Hadoop enables Hadoop users to tap into the power of SAS by extending support for the complete analytics life cycle to Hadoop, including discovery, data preparation, modeling and deployment.

Technical Details

LIBNAME statement makes Hive tables look like SAS data sets.
PROC SQL provides the ability to execute explicit HiveQL commands in Hadoop.
SAS procedures (including PROC FREQ, PROC RANK, PROC REPORT, PROC SORT, PROC SUMMARY, PROC MEANS and PROC TABULATE) are supported.’

(https://blogs.sas.com/content/datamanagement/2012/03/06/sas-hadoop-a-peek-at-the-technology/)

What is Apache Hadoop?

Apache is a non-profit community of developers who work on free and open source software. Hadoop is an open source software framework in Java that supports data intensive distributed applications. It started with Google when it was indexing the web and slotting user behaviour to improve performance algorithms and extract other useful and actionable data from it.

Thus, Hadoop helps you store and solve problems related to large volumes of unstructured and complex data that may not fit into structured tables. And it helps you run analytics like clustering and targeting on this data.

Hadoop is designed to run on multiple machines which do not share any hardware. The server keeps track of where the different bits and pieces of the data is stored and multiple copies are made for the each data dump. Thus, it is a de-centralised database.

The complex computational queries are worked on the multiple processors and then the outputs are harnessed together to give a unified answer or result.

So which applications on Hadoop are free? Not a lot, many companies like IBM, SAS etc. have paid solutions that work on/with Hadoop.

And which are the largest users of Hadoop? Yahoo and Facebook are the largest users. The other notable names include-Amazon.com, American Airlines, AOL, Apple, eBay, Federal Reserve Board of Governors, Hewlett-Packard, IBM, ISI, Twitter, SAS Institute, Linked In , Microsoft etc.

Interestingly, Hadoop was the name of the toy elephant owned by the son of Dough Cutting (creator of Hadoop) !!