If You’re a Data Analyst you Should Read this Review of Hadley’s readr 0.1.0 Right Now

 

I must admit that I am a big fan of Hadley Wickham and his packages. So the moment I heard that his new readr() package was out on CRAN,  I decided to check it out.

What I thought would be exciting to do was compare the file read times of readr()’s read_csv() function with that of data.table()’s fread() and base R function read.csv(). To do this, I chose my linux machine with 4 GB Ram and a very old core II duo processor.

 

I imported a 67.3 MB csv file using the above mentioned functions. The read times and the code used are below

library(readr)
library(data.table)
setwd(“/media/ramius/E2A02905A028E1B1/Work/Jigsaw Academy”)
#Read time for read.csv()
pt<-proc.time()
data<-read.csv(“telecom.csv”)
proc.time()-pt

##    user  system elapsed
##  15.063   0.071  15.140

#Data stored as data.frame
class(data)

## [1] “data.frame”

#Read time for readr()’s read_csv()
pt1<-proc.time()
data1<-read_csv(“telecom.csv”)
proc.time()-pt1

##    user  system elapsed
##   2.449   0.040   2.489

#Data stored as data.frame
class(data1)

## [1] “tbl_df”     “tbl”        “data.frame”

#Read time for data.table()’s fread()
pt2<-proc.time()
data2<-fread(input = “telecom.csv”)
proc.time()-pt2

##    user  system elapsed
##   1.604   0.040   1.644

#Data stored as data.table
class(data2)

## [1] “data.table” “data.frame”

As one can see the file read times were lowest for data.table()’s fread(). (No surprises there!!!) Also worth noting that read_csv() is upto 5 times faster than read.csv().

According to Hadley, https://github.com/hadley/readr , readr is fast but is not as fast as fread(). The question then is why should one even bother about using readr()? Simple answer, everything that is read by readr() functions such as read_csv() is data.frame wrapped as a tbl_df.  On the other hand fread() will produce a data.table. How does that matter? Think data manipulation, the way dataframes (including tbl_df) behave is quite different from how a data.table would behave. And if you are already using packages like dplyr() and ggplot2() that work on data frames, then using readr() is probably better than loading data through fread().

It is expected that just like his earlier packages such as dplyr() and ggplot2(), readr() will also become an integral part of any data analyst’s workflow. Life for R users would have been very dull had Hadley decided to remain just an academic!!!!

Long live R and “Hadleyverse”

The whole code for this .rmd file can be accessed from https://github.com/Gunnvant/RFiles/blob/master/readr%20comparison.Rmd

Suggested Reads:

Want to use R, but are stuck because your Data Set is too large? We have a solution

Stringi Package in R

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.
Jigsaw’s Data Science with SAS Course – click here.
Jigsaw’s Data Science with R Course – click here.
Jigsaw’s Big Data Course – click here.

Related Articles

loader
Please wait while your application is being created.
Request Callback