How to Access Datasets Stored in Hadoop?

31 Mar 2015

The main aim of integrating SAS with Hadoop is the extension of SAS Analytics to Hadoop. SAS/ACCESS provides the ability to access data sets stored in Hadoop in SAS natively. With SAS/Access to Hadoop, we can use LIBNAME statements to make Hive tables look like SAS data sets on top of which SAS Procedures and SAS DATA steps can interact. PROC SQL commands can be used provide the ability to execute direct Hive SQL commands on Hadoop. Also, PROC HADOOP can be used to provide the ability to directly submit MapReduce, Apache Pig, and HDFS commands from the SAS execution environment to your CDH cluster.

The SAS/ACCESS interface is available from the SAS 9.3 release. SAS/ACCESS enables users familiar with the SAS interface to operate seamlessly on data stored in Hadoop while bringing the power of SAS to these data sets. SAS came up a procedure to support the access for Hadoop in SAS. The PROC HADOOP statement supports several options that control access to the Hadoop server. The HADOOP procedure enables you to submit HDFS commands, MapReduce programs, and Pig language code against Hadoop data. To connect to the Hadoop server, a Hadoop configuration file is required that specifies the name and JobTracker addresses for the specific server. With LIBNAME statement makes Hive tables look just like SAS data sets. Also, PROC SQL provides the ability to perform explicit HiveQL commands into Hadoop.

Prior to the use of Hadoop procedure as the requirement perspective, Base SAS is required for the installation of SAS/ACCESS Interface to Hadoop software and HiveServer2 and Kerberos should be supported to the version SAS 9.4.

Hadoop’s HDFS within SAS

SAS provides the ability to execute Hadoop functionality by enabling MapReduce programming, scripting support and the execution of HDFS commands from within the SAS environment. This complements the capability that SAS/ACCESS provides for Hive by extending support for Pig, MapReduce and HDFS commands. The Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file system for the Hadoop framework. HDFS is designed to hold large amounts of data and provide access to the data to many clients that are distributed across a network.

There are many benefits for using SAS with HDFS. One is that it combines world-class analytics with a strong distributed processing and commodity-based storage infrastructure. Secondly it accesses the Hadoop data in SAS, the language which is known already.

Why SAS/ACCESS Interface to Hadoop Platform-as-a-Service

Even though Hadoop provides Inherent data protection and self-healing capabilities, low cost, high scalability, computing power, SAS is providing the benefits like analysis and storage flexibility to Hadoop. It includes the strategy of “Write once, run anywhere” of SAS to Hadoop. Also, it helps in integrating data stored in Hadoop with data from other sources along with the direct data access, easily and securely, with native interfaces. The interface gives immense faster performance by minimizing data movement and support both technical and business users need.

Image courtesy: Image credit: https://www.freedigitalphotos.net by Stuart Miles

Hadoop And Unstructured Data

Why Data Scientists Need a Combination of SAS, R and Hadoop Skills

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.

Jigsaw’s Data Science with SAS Course – click here.

Jigsaw’s Data Science with R Course – click here.

Jigsaw’s Big Data Course – click here.