By: Gunnvant Singh, Faculty
Let’s be honest, once you start doing serious data wrangling in R, the code starts appearing ugly. One of the reason for this is the functional nature of R. In order to accomplish a decent data manipulation task one has to write nested functions. Now, writing nested functions is not that difficult, what’s difficult is reading them!!! Sample the code below:
Now, if you pay attention to the last line it takes a while to figure out what is happening in that piece of code. Let me break it down for you:
1. First I subset the data based on sex and name. Essentially the line of code filter(babynames,sex==”F”,name==”Mary”) is selecting all the observations in the data where the gender is female and name is Mary.
2. After this, the command ‘select’ is being used to pick the column named ‘n’ from the subsetted data.
3. The function sum() is being used to add the numbers present in the selected column ‘n’.
Now, the key to understand this line of code is to read it from inside out. Now, you all will agree that this is not a very straightforward and one has to think through to understand the piece of code.
What we just did above by nesting functions ‘filter’, ‘select’ and ‘sum’ is called function composition. Essentially if I have functions ‘f’, ‘g’ and ‘z’ , this is what I am trying to achieve through nesting:
z(g(f(x)))) ~ sum(select(filter(data))).
Can function composition be achieved in R without using nested functions? Well till about an year ago, the answer to this question would have been ‘no’. But then two important things happened:
(1) Stefan Milton Bache came out with a package called ‘magrittr’ in January 2014, implementing the %>% (pipe) operator .
(2) Hadley Wickham adopted Stefan’s pipe operator in his dplyr package
Since dplyr is a very powerful data manipulation package in R, its adoption of %>% operator has contributed to the popularity of pipes in R.
Before I describe the %>% (pipe) operator in the context of R, let’s take a look at how piping works.
When we are using pipes, instead of providing the arguments directly inside the function, we provide the functional arguments ‘near’ the function,
Let us see how, the above written nested function call can be simplified using %>% operator. Sample the code below:
There are three functions being used in the whole code:
Let us look at how %>% works,
babynames is piped to filter function. Note that babynames is a dataframe.
The result of babynames%>%filter(sex==”F”,name==”Mary”) is piped into the select function.
At last the result of select(n) is piped to sum() function
Now if we see the whole code in totality, babynames%>%filter(sex==”F”,name==”Mary”)%>%select(n)%>%sum, we can read it as, take the data babynames, then subset it according to sex and name, from this subsetted data select column ‘n’ and then find the sum of all the values in this column. Compare this with our orginal code: sum(select(filter(babynames,sex==”F’” name==”Mary”),n)).
Clearly, piping enhances the readability of R code and makes complex data manipulation very easy.
Interested in a career in Big Data? Check out UNext Data Science & Machine Learning courses and see how you can get trained to become a Big Data specialist.