Big Data interviews can take place in general lines or concentrate on a specific system or method. This article will focus on the Big Data tool- Apache Hive- frequently used. You get a detailed understanding of questions asked in Big Data interviews by employers connected with Apache Hive after going through this Apache Hive interview questions article.
Hadoop is an open-source framework designed to facilitate the storing and processing of large volumes of data. Hive is a data warehouse tool that works in the Hadoop ecosystem to process and summarize the data, making it easier to use. Now that you know what Hive is in the Hadoop ecosystem read on to find out the most common Hive interview questions.
Apache Hive is a popular data warehouse system. It is built on top of Hadoop and is extensively used for analyzing structured and semi-structured data. It provides an easy and reliable mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language), similar to SQL statements.
Most companies today consider Apache Hive their go-to resource for analytics on large data sets. As it also supports SQL-like query statements, it is quite popular amongst professionals from a non – programming background who are looking forward to working on the Hadoop MapReduce framework.
Hive-related questions are most often an integral part of any Data Science related interview. Being prepared to answer the same with confidence helps you build a good image in the eyes of the interviewer and also chart out a successful career. The following Hive Interview Questions have been specifically curated to help you get acquainted with the nature of questions you might have to answer in the interview. If you are a beginner, the interviewer will be looking out to check how strong your foundation is and might ask you questions related to basic concepts. As your experience increases, so will the difficulty level of the questions, with them becoming more technical and application-oriented.
Hive Interview Questions
Here is the comprehensive list of the most commonly asked Hive interview questions. Interview questions on Hive may be direct or application-based.
Hive supports client applications based on Java, PHP, Python, C, and Ruby coding languages.
There are two types of tables available in Hive – managed and external.
While external tables give data control to Hive but not control of a schema, managed tables give both schema and data control.
The Hive table gets stored in an HDFS directory – /user/hive/warehouse, by default. You can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml.
Since Hive does not support row-level data insertion, it is unsuitable for OLTP systems.
Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.
Hive table data is stored in an HDFS directory by default – user/hive/warehouse. This can be altered.
Yes, the default managed table location can be changed in Hive by using the LOCATION ‘<hdfs_path>’ clause.
A Metastore is a relational database that stores the metadata of Hive partitions, tables, databases, and so on.
Local and Remote meta stores are the two types of Hive meta stores.
Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service, whereas remote meta stores run on a separate, distinct JVM.
The default database for metastore is the embedded Derby database provided by Hive, which is backed by the local disk.
No, metastore sharing is not supported by Hive.
The three modes in which Hive can be operated are Local mode, distributed mode, and pseudo-distributed mode.
The TIMESTAMP data type in Hive stores all data information in java.sql.timestamp format.
Partitioning is used in Hive as it allows for the reduction of query latency. Instead of scanning entire tables, only relevant partitions and corresponding datasets are scanned.
Dynamic partitioning is the one where the values of the partition column will be known in the runtime, I.e, during loading of data into the Hive table
A dynamic partition can be used in the following two cases:
ARRAY, MAP, AND STRUCT are the three Hive collection data types.
Yes, one can run shell commands in Hive by adding a ‘!’ before the command.
Yes, one can do so with the help of a source command. For example – Hive> source /path/queryfile.hql
It is a file consisting of a list of commands that must be run when the Command Line Input is initiated.
Use the following command: SHOW PARTITIONS table_name PARTITION (partitioned_column=’partition_value’)
By using the following command: SHOW DATABASES LIKE ‘c.*’
No, there is no way to delete the DBPROPERTY.
The ‘org.apache.hadoop.mapred.TextInputFormat’ class.
The ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ class.
The data remains in the old directory and needs to be transferred manually.
No, archiving Hive tables only helps reduce the number of files that make for easier management of data.
Use the ENABLE OFFLINE clause along with the ALTER TABLE command.
MapReduce is a programming framework that allows Hive to divide large datasets into smaller units and process them parallelly.
You can make Hive avoid MapReduce to return query results by setting the hive.exec.mode.local.auto property to ‘true’.
This is not possible as it cannot be implemented in MapReduce programming.
A view is a logical construct that allows search queries to be treated as tables.
No, the name of the view must always be unique in the database.
No, these commands cannot be used with respect to a view in Hive.
Hive indexing is a query optimization technique to reduce the time needed to access a column or a set of columns within a Hive database.
No, multi-line comments are supported by Hive.
By using the following command: SHOW INDEX ON table_name
It helps to analyze the structure of individual columns and rows and provides access to the complex objects that are stored within the database.
Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partitioning.
Bucketing helps optimize the sampling process and shortens the query response time.
Yes, by using the TBLPROPERTIES clause. For example – TBLPROPERTIES (‘creator’= ‘john’)
Hcatalog is a tool that helps to share data structures with other external systems in the Hadoop ecosystem.
UDF is a user-designed function created with a Java program to address a specific function that is not part of the existing Hive functions.
A query hint allows for a table to be streamed into memory before a query is executed.
Hive has the following limitations:
For sharing Data structures with external systems, Hcatalog is a necessary tool. It offers access to the Hive metastore for reading and writing data in a Hive data warehouse.
Following are the components of a Hive query processor:
Here are the two main reasons for performing bucketing to a partition:
Hive uses the formula: hash_function (bucketing_column) modulo (num_of_buckets) to calculate the row’s bucket number. Here, hash_function is based on the Data type of the column. The hash_function is for integer data type:
hash_function (int_type_column)= value of int_type_column
The command: ‘SET hive.enforce.bucketing=true;’ allows you to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files generated in the table directory to be unequal to the number of buckets. As an alternative solution, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.
You can easily store the Hive Data with the ORC (Optimized Row Column) format, which helps to streamline several limitations.
Following are the five components of a Hive Architecture:
The above-listed Hive interview questions and answers cover most of the important topics under Hive, but this is in no way an exhaustive list. If you are interested in making it big in the world of data and evolving as a Future Leader, you may consider our Integrated Program In Business Analytics, a 10-month online program in collaboration with IIM Indore!