The big daddy of big data – Google – bids goodbye to MapReduce in 2014-15. It was the Google guys who inspired the development of Hadoop with core parallel processing engine as MapReduce. This definitely sounds like a death knell for MapReduce and Hadoop, where it immediately forces to turn our heads towards SPARK. However, let’s try to understand the real story.
Almost every article on Spark on the web lambasts MapReduce, citing the performance of SPARK to be 100x faster than MapReduce and making MapReduce and Hadoop appear puny. The reality is that the best-case performance of SPARK is 100x better than that of MapReduce in most idealistic scenarios. In the worst case, it’s 3x faster than MapReduce. One of the key reasons for this is the fact that MapReduce extensively depends on disc I/O operations, which is slower when compared to the in-memory based operations of Spark. Well, this is not the end of the story.
The speed of processing is not the only USP of Spark but speed and flexibility in a single package are. Flexibility here refers to the ability to perform both batch-oriented jobs as well as interactive & iterative workloads incorporating machine learning. These are the areas where MapReduce severely lacks and takes a backseat, restricting it to only for batch-oriented jobs – something close to traditional data warehousing kind of applications involving ETL operations predominantly.
It’s obviously not doom and gloom days for Hadoop and MapReduce but it would stay afloat for quite some time; after all, the mainframe systems (the grandfather of computers) are still in use in major Wall Street financial trading companies. The advent of Spark simply has shifted the momentum away from Hadoop & MapReduce and the real focus is more towards fast & interactive real-time analytics involving streaming data. The promising candidates to look out in this space would include SPARK, STORM & KAFKA and definitely not MapReduce which was certainly not built for this.