Using Pipes in R

By: Gunnvant Singh, Faculty

 

Let’s be honest, once you start doing serious data wrangling in R, the code starts appearing ugly. One of the reason for this is the functional nature of R. In order to accomplish a decent data manipulation task one has to write nested functions. Now, writing nested functions is not that difficult, what’s difficult is reading them!!! Sample the code below:

library(babynames)

libary(dplyr)

data(babynames)

head(babynames)

sum(select(filter(babynames,sex==”F”,name==”Mary”),n))

Now, if you pay attention to the last line it takes a while to figure out what is happening in that piece of code. Let me break it down for you:

1. First I subset the data based on sex and name. Essentially the line of code filter(babynames,sex==”F”,name==”Mary”) is selecting all the observations in the data where the gender is female and name is Mary.

2. After this, the command ‘select’ is being used to pick the column named ‘n’ from the subsetted data.

3. The function sum() is being used to add the numbers present in the selected column ‘n’.

Now, the key to understand this line of code is to read it from inside out. Now, you all will agree that this is not a very straightforward and one has to think through to understand the piece of code.

What we just did above by nesting functions ‘filter’, ‘select’ and ‘sum’ is called function composition. Essentially if I have functions ‘f’, ‘g’ and ‘z’ , this is what I am trying to achieve through nesting:

z(g(f(x)))) ~ sum(select(filter(data))).

Can function composition be achieved in R without using nested functions? Well till about an year ago, the answer to this question would have been ‘no’. But then two important things happened:

(1) Stefan Milton Bache came out with a package called ‘magrittr’ in January 2014, implementing the %>% (pipe) operator .

(2) Hadley Wickham adopted Stefan’s pipe operator in his dplyr package

Since dplyr is a very powerful data manipulation package in R, its adoption of %>% operator has contributed to the popularity of pipes in R.

Before I describe the %>% (pipe) operator in the context of R, let’s take a look at how piping works.

When we are using pipes, instead of providing the arguments directly inside the function, we provide the functional arguments ‘near’ the function,

Let us see how, the above written nested function call can be simplified using %>% operator. Sample the code below:

babynames%>%filter(sex==”F”,name==”Mary”)%>%select(n)%>%sum

There are three functions being used in the whole code:

Let us look at how %>% works,

babynames%>%filter(sex==”F”,name==”Mary”)%>%select(n)%>%sum

babynames is piped to filter function. Note that babynames is a dataframe.

babynames%>%filter(sex==”F”,name==”Mary”)%>%select(n)%>%sum

The result of babynames%>%filter(sex==”F”,name==”Mary”) is piped into the select function.

babynames%>%filter(sex==”F”,name==”Mary”)%>%select(n)%>%sum

At last the result of select(n) is piped to sum() function

Now if we see the whole code in totality, babynames%>%filter(sex==”F”,name==”Mary”)%>%select(n)%>%sum, we can read it as, take the data babynames, then subset it according to sex and name, from this subsetted data select column ‘n’ and then find the sum of all the values in this column. Compare this with our orginal code: sum(select(filter(babynames,sex==”F’” name==”Mary”),n)).

Clearly, piping enhances the readability of R code and makes complex data manipulation very easy.

Interested in a career in Big Data? Check out UNext Data Science & Machine Learning courses and see how you can get trained to become a Big Data specialist.

Related Articles

loader
Please wait while your application is being created.
Request Callback