Operation Santa Claus!!!
Last weekend, I happened to watch a Christmas flick “Arthur Christmas” featured on HBO. It is an animation movie released in theatres worldwide three years ago and the plot revolves centrally around the Christmas present’s distribution to children by Santa Claus and his family. There’s more to this plot with the ingredients of a Christmas classic – Santa missing out a child present out of millions and Arthur, Santa’s youngest son comes to the rescue with only few hours remaining before the Christmas dawn. Over the years, we all wondered how Santa Claus manages to go around the whole world in just one night and deliver presents in every house. Well, this movie tries to provide a solution to that question and surprisingly it is not with the help of traditional Santa cart pulled by reindeers but by using cutting edge technologies to solve the biggest delivery problem one could think of. If you are more a numbers person, then this would translate to about Santa roughly covering about land area of 60 million square miles to deliver 2 billion presents within the time span between 12:00 AM and 6:00 AM (360 minutes) on the Christmas day. Surely this sounds like a gigantic optimization problem given the diversity of presents to be delivered, vastness of the area to be covered and further with the help of smallest time frame possible. If you break down the time frame available for delivery across different continents, this would equate to Santa’s spending time of 108 minutes for Asia, 73 minutes for Africa, and 60 minutes for North America and probably 24 minutes for entire Europe.
In the movie “Arthur Christmas”, Santa stationed with thousands of elves at North Pole runs a state of the art facility which looks into tasks such as gathering children’s wish lists for Christmas, sorting and labelling presents based on locations, and devising a strategic execution plan for the final delivery on the Christmas day. On the D-Day, Santa along with a huge army of elves drives around the world in a most sophisticated air craft which drives about 150,000 miles per hour, covers continent by continent and delivers labelled presents to the children present in every corner of the world. But the ultimate question would be how Santa would go about deciding on the time frame for each country in terms of present distribution. One way to go about this would be based on land area distribution of each country as in higher the area higher the time required. But this technique would work assuming the present wishes from the children are directly proportional to the size of the countries which is more of an ideal scenario. This leads a plausible conclusion that Santa needs to first identify the distribution of presents across each country, and using this output then would devise a delivery plan by better prioritizing his time and resources. Definitely manual process is ruled out as we are talking about millions of presents and one can think of using MapReduce framework as a possible solution for the time allotment problem.
MapReduce framework is highly effective in handling big data which addresses scalability and efficiency issues most commonly encountered while processing large amounts of data. MapReduce forms a core component of Hadoop, the most common big data technology, which handles big data in a parallel and distributed manner rather than as a whole. Typically Hadoop cluster consists of multiple nodes often ranging between 10 to 1 million which would divide the big data into multiple chunks to enable parallel processing. Using MapReduce framework, one can write Java programs which would be executed chunks of data present on individual nodes separately and the final output is reported to the user. Another important thing to remember is that MapReduce programs deal with input and output data in terms of key-value pairs.
Let us now see how Santa applies the MapReduce framework to the time allotment problem.
After gathering of the children’s wish lists across the world, first thing would be loading the data into a Hadoop cluster. Say, this data would have attributes like name, present wished, street, city and country for all the 2 billion children. Our objective here would be to get an overall distribution of presents wished for across various countries.
Generally, MapReduce program has 3 stages: Map/Shuffle/Reduce. The Shuffle part is done automatically by Hadoop, and Santa just need to implement the Map and Reduce parts.
For Map part, the input data should be as <Key, Value> pairs. Based on the time allotment problem requirements, Santa would enter the Key as country and Value as the respective present wished. As Map part executes, the input values are first sent to the mapper that will count each present wished for each country: a list of (key/value) pairs is thus created with the Country as Key and its count as Value which would be numeric value 1.
Now, the shuffle task will run on the output of the Map task. Shuffling task output is what the Reduce task will get as input: the Key, List<Value>. The Reduce task is the one that does the logic on the data, in Santa’s case this would be computing frequency distribution of presents wished by country. As Reduce part executes, the reducer basically aggregates the key-value pairs producing a final list with the countries and number of presents wished.
This above demonstration is a simple and straight forward one, and for Santa’s time allotment problem, actual application would be quite complex and involves processing millions of rows. Thanks to MapReduce, now even exponential growth of presents wished year after year across countries wouldn’t bother Santa even a bit. After all, being Santa is the best job in the world, the one which involves spreading smiles on every child’s face.