More data usually beats better algorithms … point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set – Anand Rajaraman, Co-Author of “Mining of Massive Datasets”, Entrepreneur and Academic teaching at Stanford
The above statement came to mind when we were working on New York City taxi trip data.
At Jigsaw Academy our pursuit of real-world datasets for Big Data hands-on exercises and case studies led us to New York City’s taxi trip data. These datasets are collected by the NYC Taxi and Limousine Commission (TLC). Each dataset includes trip records from all trips completed in yellow and green taxis in NYC from 2009 onwards. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
For the month of May 2016, Green Taxi trip data contains 1.5 million records which come close to a whopping 50,000 taxi trips per day. And this is just for Green taxis not counting the number of Yellow Taxi trips which perhaps is higher! With the data elements mentioned above, this dataset can be a fund of information and the scope for analytics on this data is immense.
The first thing that comes to mind is figuring out the demand as in finding the location with maximum pick-ups during a given timeslot or getting the time of the day that has the highest demand for cabs for a given location. Using advanced tools & techniques like regression, with the amount of data available one can predict the demand for a given location or a given time slot. We are all too familiar with the surge pricing that the taxi integrators are charging in Indian cities applying these methods on the demand details they are capturing.
From the viewpoint of New York city’s MTA (Metropolitan Transport Authority), this data can be used to derive inputs for changes or additions to the bus schedules and routes by looking at the co- occurrences of pickup and drops locations. An end-user can use this data to decide what time is the best to visit a particular location with minimum commute time on a weekday if the intent is to minimize the time; or the time of the day can be found out when the taxi fare is most likely to be less if the priority is to reduce the cost. Most of these questions can be answered with an initial analysis of the data using the tools like MapReduce, Pig and/or Hive.
The data has the date & time of pickup and drop in the standard timestamp format but does not have the day of a week. And the pickup and drop locations are given in terms of longitude and latitude but not as a complete address with a neighborhood. So for our case study, we had to add these details as derived fields to each of the tens of thousands of records. We have written Pig UDFs (User Defined Functions) in Java and augmented the data using these UDFs in our Pig scripts.
The recent versions of Pig though have DateTime as a data type and several built-in DateTime functions, we did not find any built-in function to get the day of the week. The UDF we wrote takes the date & time in the timestamp format and returns the day of the week as a string. For getting the complete address and neighborhood for a given pair of longitude and latitude, there are several web services available. These web services provide APIs that take the longitude and latitude as the parameters and return complete address. We used one of these web service APIs and used in out Java code for the Pig UDF.
One point to note is that most of these web services require the users to register and get a key if the users are planning to use their services for a substantial number of times. Without a key, the service can be used for about 10 or 12 times in a day and once you obtain a key you can make over a couple of thousands of calls to the web service per day. To be able to use the web services beyond this number one has to use their paid services. This is the option we have taken as we have a few hundreds of thousands of calls to make, in order to get the pickup and drop addresses and neighborhoods for all the records we want to use.
As these functions are not readily available in Pig, we plan to contribute them to be included in PiggyBank, the repository of UDFs maintained by Apache Pig. The JAR files will be given as part of the case study to the students asking them to use these UDFs and write Pig scripts to include the additional data fields to the dataset. This provides a good hands-on exercise in addition to the questions for which they need to write MapReduce routines, Pig Scripts and Hive statements.
Coming back to the statement quoted at the outset – adding more independent data, surpasses using more sophisticated algorithms for data analysis. This rings true, especially in predictive analytics problems. For instance by adding the data like the time of the year – whether it is summer vacation time or festival season, and/or by adding schedules & routes of public transport buses or a city’s metro or subway stations & schedules to the taxi trip data, one can tease several more patterns from the data and draw more insightful inferences or to use a more current term – perform more effective prescriptive analytics.