R is a very efficient open-source language in Statistics for Data Mining, Data Preparation, visualization, credit-card scoring etc. However the biggest drawback of the language is that it is memory-bound, which means all the data required for analysis has to be in the memory (RAM) for being processed. In the world of exponentially growing size of data, organisations are reticent to deploy R beyond research mainly due to this drawback.
This limitation of R does not only impacts the capacity of data that can be handled, but also adversely affects the performance and scalability. For example, allocating a data frame as small as 5,000,000 rows and 3 columns would throw an Error in a PC of 3 GB of RAM due to memory restrictions. Even though high-end servers with much higher memory sizes can manage to load big data in memory, the performance of standard R on such large files will be prohibitively slow.
To overcome this limitation, efforts have been made in improving R to scale for Big data. The best way to achieve it is by implementing parallel external memory storage and parallel processing mechanisms in R.
We will discuss about 2 such technologies that will enable Big Data processing and Analytics using R.
The “Big Data” initiative started by the Revolutionary Analytics, released an add-on Package in R, called RevoScaleR. Really breaking through the memory/performance barrier requires implementing external memory algorithms and data structures that explicitly manage “data placement and movement”. RevoScaleR brings parallel external memory algorithms and a new very efficient data file format to R.
The three main components of this package are:
RevoScaleR provides unprecedented levels of performance and capacity for statistical analysis in the R environment. For the first time, R users can process, visualize and model their largest data sets in a fraction of the time of legacy systems, without the need to deploy expensive or specialized hardware.
To know more about each of the components and how to program in R using RevoScale, refer: https://www.rochester.edu/College/psc/thestarlab/help/Big-Data-WP.pdf
(Find out why R is an essential skill for data analysts. The Power of R – And Why it’s an Essential Skill for Data Analysts)
2. RHadoop Integration: Hadoop is an Apache open-source project specifically designed for storing and processing of Big Data – Data that is growing in size, speed and type. Hadoop framework has implemented a reliable and scalable distributed storage and distributed processing methodologies.
With Hadoop being the pioneer in Big Data handling; and R being a legacy; and is widely used in the Data Analytics domain; and both being open-source as well, Revolutionary analytics has been working towards empowering R by integrating it with Hadoop.
RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. The packages are regularly tested (and always before a release) on recent releases of the Cloudera and Hortonworks Hadoop distributions and should have broad compatibility with open source Hadoop and mapR’s distribution.
Using this integration packages, R algorithms can be run on a hadoop cluster – i.e. in a distributed storage and processing framework.
For more information on how to use R with Hadoop, refer : https://www.revolutionanalytics.com/free-webinars/using-r-hadoop
There are many valuable free resources available that will help you enhance your knowledge of R. Take a look at the article: Best Free Resources on R
Want to know more about PROC R, a toll developed as an interface between R and SAS. Using Proc_R, the SAS users can access the extensive statistical and visualization capabilities of R. Read more here: How to use PROC R in SAS
Interested in a career in Big Data? Check out Jigsaw Academy’s Big Data courses and see how you can get trained to become a Big Data specialist.
Examples of How R is Used
How to use PROC R in SAS
Best Free Resources on R