If You’re a Data Analyst you Should Read this Review of Hadley’s readr 0.1.0 Right Now

21 Apr 2015

I must admit that I am a big fan of Hadley Wickham and his packages. So the moment I heard that his new readr() package was out on CRAN, I decided to check it out.

What I thought would be exciting to do was compare the file read times of readr()’s read_csv() function with that of data.table()’s fread() and base R function read.csv(). To do this, I chose my linux machine with 4 GB Ram and a very old core II duo processor.

I imported a 67.3 MB csv file using the above mentioned functions. The read times and the code used are below

library(readr)
library(data.table)
setwd(“/media/ramius/E2A02905A028E1B1/Work/Jigsaw Academy”)
#Read time for read.csv()
pt<-proc.time()
data<-read.csv(“telecom.csv”)
proc.time()-pt

## user system elapsed
## 15.063 0.071 15.140

#Data stored as data.frame
class(data)

## [1] “data.frame”

#Read time for readr()’s read_csv()
pt1<-proc.time()
data1<-read_csv(“telecom.csv”)
proc.time()-pt1

## user system elapsed
## 2.449 0.040 2.489

#Data stored as data.frame
class(data1)

## [1] “tbl_df” “tbl” “data.frame”

#Read time for data.table()’s fread()
pt2<-proc.time()
data2<-fread(input = “telecom.csv”)
proc.time()-pt2

## user system elapsed
## 1.604 0.040 1.644

#Data stored as data.table
class(data2)

## [1] “data.table” “data.frame”

As one can see the file read times were lowest for data.table()’s fread(). (No surprises there!!!) Also worth noting that read_csv() is upto 5 times faster than read.csv().

According to Hadley, https://github.com/hadley/readr , readr is fast but is not as fast as fread(). The question then is why should one even bother about using readr()? Simple answer, everything that is read by readr() functions such as read_csv() is data.frame wrapped as a tbl_df. On the other hand fread() will produce a data.table. How does that matter? Think data manipulation, the way dataframes (including tbl_df) behave is quite different from how a data.table would behave. And if you are already using packages like dplyr() and ggplot2() that work on data frames, then using readr() is probably better than loading data through fread().

It is expected that just like his earlier packages such as dplyr() and ggplot2(), readr() will also become an integral part of any data analyst’s workflow. Life for R users would have been very dull had Hadley decided to remain just an academic!!!!

Long live R and “Hadleyverse”

The whole code for this .rmd file can be accessed from https://github.com/Gunnvant/RFiles/blob/master/readr%20comparison.Rmd