HBase is a Hadoop ecosystem component that I would describe as an all-rounder in the big data analytics playground. Let me explain with an example. Take a typical Hadoop-based big data analytics project – be it building a decision tree/random forest model or a recommender system or a regression model. The data pipeline in the project can be depicted as below.
This phase of the project can easily take a few months or a few quarters and some of these exercises are iterative as well. During this period usually, additional data will arrive with new records and updates to existing records. New ones can be appended to the existing data on Hadoop; however, updating existing records can be cumbersome. This is where HBase can be very valuable as a data store because of its ability to provide random real-time read/write access to your big data as mentioned on the official website of HBase.
HBase is an open-source, distributed, versioned, non-relational database modelled after Google’s Bigtable and runs on top of hadoop. It is built to handle tables of billions of rows and millions of columns. The following illustrates HBase’s data model without getting into too many details.
HBase’s data model is quite different from that of an RDBMS. An HBase table consists of rows and column families. Each column family can contain any number of columns. Rows are sorted by row keys. There are no data types in HBase; data is stored as byte arrays in the cells of HBase table. The content or the value in cell is versioned by the timestamp when the value is stored in the cell. So each cell of an HBase table may contain multiple versions of data. As any row or cell of any row can be directly accessed using the row key, HBase affords us random read/write access to any required row among the billions of rows of big data.
The logical representation of the rows in an HBase table can be shown as below:
HBase comes with a command line interface – HBase shell. Using HBase shell you can – create an HBase table, add rows into it, get rows based on the row key and run commands like scan which displays all the rows, filter which filters rows based on specified criteria and a few other ones.
We can work with HBase using Pig which is very useful to upload data in bulk into HBase tables. For example you can load the following records into an HBase table using the Pig script given below.
1,Richard Hernandez,6303 Heather Plaza,Brownsville,TX,78521
2,Mary Barrett,9526 Noble Embers Ridge,Littleton,CO,80126
3,Ann Smith,3422 Blue Pioneer Bend,Caguas,PR,00725
4,Mary Jones,8324 Little Common,San Marcos,CA,92069
5,Robert Hudson,10 Crystal River Mall ,Caguas,PR,00725
6,Mary Smith,3151 Sleepy Quail Promenade,Passaic,NJ,07055
7,Melissa Wilcox,9453 High Concession,Caguas,PR,00725
8,Megan Smith,3047 Foggy Forest Plaza,Lawrence,MA,01841
9,Mary Perez,3616 Quaking Street,Caguas,PR,00725
10,Melissa Smith,8598 Harvest Beacon Plaza,Stafford,VA,22554
11,Mary Huffman,3169 Stony Woods,Caguas,PR,00725
12,Christopher Smith,5594 Jagged Embers By-pass,San Antonio,TX,78227
To try it out follow these steps:
create ‘customer’, ‘details’
pig –x local
input_data = LOAD ‘customerdet.csv’ USING PigStorage(‘,’);
dump input_data;
STORE input_data INTO ‘hbase://customer’ USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘details:name details:address details:city details:state details:zipcode’);
scan ‘customer’
We can use the same library pig.backend.hadoop.hbase.HBaseStorage to read (load) the data from Hbase into Pig scripts for any data manipulation or transformations.
Hive is integrated with HBase very closely as well. We can use HBase through Hive and run Hive queries on HBase tables. For example we can create an external table in Hive and point it to the HBase table created above and query the HBase data through Hive queries.
You can try it out with the following steps.
CREATE DATABASE retail1;
USE retail1;
SET hive.cli.print.current.db=true;
CREATE EXTERNAL TABLE customer1(customerid int, name string, address string, city string, state string, zipcode string) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key, details:name, details:address, details:city, details:state, details:zipcode”) TBLPROPERTIES (“hbase.table.name” = “customer”);
DESCRIBE FORMATTED customer1;
SELECT * FROM customer1 where name like “Mary%”;
HBase also has a Thrift Server and REST Server which enable remote clients such as web applications to access HBase data. Moreover we can use Sqoop to ingest data into an HBase table from any RDBMS.
Let us now take an RDBMS and look at how to design the HBase table to ingest the data. Following is the schema of the database from retail industry domain.
We can describe the relationships of the database tables as: Customers place an order which has several order items. Each order item is for a product and each product belongs to a particular category. For analysing customer purchase behaviour and/or for building a recommender system the data from this RDBMS needs to be ingested into an HBase table.
In a relational database design, the starting point would the entities and their relationships with one another. Factors like how the data is accessed and the indexes to be built will come in later. In HBase the major factors to consider for the table design are:
One other aspect is normalization. While designing relational databases, schema is normalized to avoid duplication of data storage and repeating data is put in a separate table. This however results in joins of tables while retrieving the data and when data from multiple tables is being retrieved through joins it may be slow. In HBase table design you actually de-normalize data, though it causes redundancy and it can be looked at as substitution for joins.
In case the data is being accessed mainly through products then let us see if we can consider product id as the row key. We can then group the product details in a column family called products and the customer details can be in grouped into a customer family with multiple columns. In this case the rows in HBase table could look as shown below:
This approach clearly is not feasible because the number of customers who purchase a product can be large. Even when you have a use case and dataset with large but finite number of columns, filtering for column values on HBase tables can be slow. Hence long narrow tables are advisable over broad flat tables while designing row keys for HBase tables.
So a more appropriate way is to store details of one customer and the details of one product in each row. Since the row key has to be unique we need to use a combination (concatenation) of product id and customer id as the row key. This will make the row narrow with one column family for product data and one column family for customer data. And the number of rows in the table will be large since there will be as many rows as the number of customers who bought the product; and the product details will be repeating in each of these rows as shown below.
The advantage of storing the data in HBase is that if any field changes for instance if a product is returned or if quantity is decreased or increased then only that cell can be updated. You can do it with a load function in Pig or Hive and put command in HBase shell or its equivalent API if a client/web-based application is built. Based on the row key HBase will update the cell by adding a new version of the value with the current timestamp and not insert another row for this. This kind of updates are not done in an easy or straightforward manner if you are processing the data as HDFS files with Pig scripts or with Hive tables defined on top of this data.
You may see one problem with the above design of the HBase table. If a product is ordered one more time after a month or so then the row key for the new order will be the same old order and hence instead of inserting a new row HBase will update the existing row of that row key. We can remedy this by adding order date or order id to the row key.
All the steps of the above process can be performed by loading the data set into a database of any RDBMS and using Sqoop to load the data into HBase.
A MySQL database was created with the abovementioned schema and data was loaded into all the tables using mysqlimport tool. To keep it simple an SQL join statement was used to join the records from the required tables and load them into a new table retail2.
An HBase table is created with the following HBase shell command and then Sqoop import commands below are run to load the data from the RDBMS table into the HBase table.
hbase(main) > create ‘retail2′,’products’,’customers’
$ sqoop import \
–connect jdbc:mysql://<hostname.domainname>/retaildb \
–username <username> \
–P \
–query ‘select hbaserowkey, product_id, product_category, product_name, product_description, product_price, product_image, product_quantity from retail2 WHERE $CONDITIONS’ \
–hbase-table retail2 \
–column-family products \
–hbase-row-key hbaserowkey -m 1
–connect jdbc:mysql://<hostname.domainname/retaildb \
–query ‘select hbaserowkey, customer_id, customer_name, customer_email, customer_street, customer_city, customer_state, customer_zipcode from retail2 WHERE $CONDITIONS’ \
–column-family customers \
Now with the data available in HBase we can access and manage it as required for any processing or analysis using any tools as afforded by HBase.
You can use HBase not only with the tools like Pig, Hive and Sqoop as shown above but also with Spark; you can also write client or web-based applications using its Thrift or REST services. As we saw HBase requires only the table name and the column family names at the time of table creation. There are no data types in HBase as all data is treated as byte arrays.
We have a very flexible data store in HBase, we just need to design the tables carefully, bearing in mind the factors such as data access patterns, column families and mainly row key design.
PS: You can download the sample datasets used and mentioned in this blog here <LINK> by signing in with a valid e-mail address.
Photo by rawpixel on Unsplash
Fill in the details to know more
Understanding the Staffing Pyramid!
May 15, 2023
From The Eyes Of Emerging Technologies: IPL Through The Ages
April 29, 2023
Understanding HR Terminologies!
April 24, 2023
How Does HR Work in an Organization?
A Brief Overview: Measurement Maturity Model!
April 20, 2023
HR Analytics: Use Cases and Examples
What’s the Relationship Between Big Data and Machine Learning?
November 25, 2022
What are Product Features? An Overview (2023) | UNext
November 21, 2022
20 Big Data Analytics Tools You Need To Know
October 31, 2022
Hypothesis Testing: A Step-by-Step Guide With Easy Examples
October 21, 2022
Biases in Data Collection: Types and How to Avoid the Same
What Is Data Collection? Methods, Types, Tools, and Techniques
October 20, 2022
Add your details:
By proceeding, you agree to our privacy policy and also agree to receive information from UNext through WhatsApp & other means of communication.
Upgrade your inbox with our curated newletters once every month. We appreciate your support and will make sure to keep your subscription worthwhile