Machine Learning in Hadoop

17 Feb 2015

One of the vital components of Data Analytics is Machine learning. Machine learning is significantly used in the medical domain for cancer predictions, natural language processing, search engines, recommendation engines, bio-informatics, image processing, text analytics and much more.

Image courtesy: Hugo Penedones

The Statistical tools like R and SAS have packages designed specifically for executing machine learning algorithms on structured and un-structured data. Also, quite clearly, Machine learning algorithms gain in significance the bigger the size of data, especially when it’s un-structured, as it means making sense out of thousands of parameters, of billions of data values. In the recent era, with the Analytics industries interest expanding towards Big Data, let’s try and evaluate Hadoop Mapreduce with respect to implementing Machine Learning Algorithms.

How easy is it to code Machine learning jobs in Java Map Reduce?

Writing Java Map Reduce codes even for the most common analytics tasks like join and group-by, is tedious and time consuming. Machine Learning Algorithms are often very complex. Fitting algorithms for clustering, classification, neural networks etc. into the map-reduce framework and coding them in JAVA could be nearly impossible for Analysts. Apache came up with languages like PIG and HIVE for the convenience of Analysts. Similarly, in order to facilitate machine learning on Big Data, Apache software foundation is working on a project called ‘Apache Mahout’.

The goal of Apache Mahout is to provide scalable libraries that enables running various machine learning algorithms on Hadoop in a distributed manner. As of now, Mahout supports only Clustering, Classification and Recommendation Mining. Apache Mahout Algorithms are currently implemented on top of the Hadoop Map Reduce framework.

Is Map Reduce efficient for Machine learning Algorithms?

Even though the Mahout libraries facilitate effortless application of Machine learning Algorithms, there are performance limitations with the underlying Map Reduce framework in Hadoop, since Map Reduce stores the data in the disk while processing. With the Advent of Yarn – Hadoop 2.0, Apache Spark, an alternative framework to Map Reduce, is gaining popularity. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited to machine learning algorithms. Work is in progress in migrating the machine learning libraries of Mahout from Map Reduce to Spark.

While until 2013, the focus was on developing the technologies to meet various challenges of Big Data, the interest is now moving more towards enabling Analytics on Big Data. Apart from the development activities in the Apache’s open-source section, there are also a number of start-ups booming with products for performing Advanced Analytics like predictive modelling, regression, supervised and un-supervised learning etc. on Big Data in Hadoop. With more than 100 developers actively contributing into Apache Spark and Mahout, we can surely look forward for more efficient libraries and products for Machine learning in Hadoop in the coming days.

Interested in a career in Big Data? Check out Jigsaw Academy’s Big Data courses and see how you can get trained to become a Big Data specialist.

What is Big Data? What are it’s Advantages? What are it’s Sources?
Jigsaw Mentor Explains Machine Learning
Hadoop And Unstructured Data

Machine Learning in Hadoop

Programs Offered By UNext

Programs Offered By UNext

Programs Offered By UNext

Machine Learning in Hadoop

Related Articles