53 Crucial Hive Interview Questions With Answers | UNext

Introduction

Big Data interviews can take place in general lines or concentrate on a specific system or method. This article will focus on the Big Data tool- Apache Hive- frequently used. You get a detailed understanding of questions asked in Big Data interviews by employers connected with Apache Hive after going through this Apache Hive interview questions article.

Hadoop is an open-source framework designed to facilitate the storing and processing of large volumes of data. Hive is a data warehouse tool that works in the Hadoop ecosystem to process and summarize the data, making it easier to use. Now that you know what Hive is in the Hadoop ecosystem read on to find out the most common Hive interview questions.

Apache Hive – A Brief Introduction

Apache Hive is a popular data warehouse system. It is built on top of Hadoop and is extensively used for analyzing structured and semi-structured data. It provides an easy and reliable mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language), similar to SQL statements.

Apache Hive Job Trends:

Most companies today consider Apache Hive their go-to resource for analytics on large data sets. As it also supports SQL-like query statements, it is quite popular amongst professionals from a non – programming background who are looking forward to working on the Hadoop MapReduce framework.

Hive Interview Questions

Hive-related questions are most often an integral part of any Data Science related interview. Being prepared to answer the same with confidence helps you build a good image in the eyes of the interviewer and also chart out a successful career. The following Hive Interview Questions have been specifically curated to help you get acquainted with the nature of questions you might have to answer in the interview. If you are a beginner, the interviewer will be looking out to check how strong your foundation is and might ask you questions related to basic concepts. As your experience increases, so will the difficulty level of the questions, with them becoming more technical and application-oriented.

Hive Interview Questions

  1. What applications are supported by Hive?
  2. What are the different tables available in Hive?
  3. What is the difference between external and managed tables?
  4. Where does the data of a Hive table get stored?
  5. Can Hive be used in OLTP systems?
  6. Can a table name be changed in Hive?
  7. Where is Hive table data stored?
  8. Can the default location of a managed table be changed in Hive?
  9. What is a Hive Metastore?
  10. What are the types of meta stores?
  11. What is the difference between Local and Remote meta stores?
  12. What is the default Apache Hive metastore database?
  13. Can multiple users use one metastore?
  14. What are the three different modes in which Hive can be operated?
  15. Is there a data type in Hive to store date information?
  16. Why is partitioning used in Hive?
  17. What is dynamic partitioning and when is it used?
  18. What are the Hive collection data types?
  19. Is it possible to run UNIX shell commands in Hive?
  20. Is it possible to execute Hive queries from a script file?
  21. What is a .hiverc file?
  22. How can you check if a specific partition exists?
  23.  If you had to list all databases that began with the letter ‘c’, how would you do it?
  24. Is it possible to delete DBPROPERTY in Hive?
  25. Which Java class handles the input record encoding into files that store Hive tables?
  26. Which Java class handles output record encoding into Hive query files?
  27. When a Hive table partition is pointed to a new directory, what happens to the data?
  28. Do you save space in the HDFS by archiving Hive tables?
  29. How can you stop a partition from being accessed in a query?
  30. What is a table generating function on Hive?
  31. Can you avoid MapReduce on Hive?
  32. Can a Cartesian join be created between two Hive tables?
  33. What is a view in Hive?
  34. Can the name of a view be the same as a Hive table name?
  35. Can we use the LOAD or INSERT command to view?
  36. What is indexing in Hive?
  37. Are multi-line comments supported by Hive?
  38. How can you view the indexes of a Hive table?
  39. What is the Hive ObjectInspector function?
  40. What is bucketing?
  41. How is bucketing helpful?
  42. Can you specify the name of the table creator in Hive?
  43. What is Hcatalog?
  44. What is UDF in Hive?
  45. What does /*streamtable(table_name)*/ do? 
  46. What are the limitations of Hive?
  47. Why do you need a Hcatolog?
  48. Name the components of a Hive query processor?
  49. Why do we need buckets? 
  50. How Hive distribute the rows into buckets?
  51. What will happen in case you have not issued the command:  ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x? 
  52. How do ORC format tables help Hive to enhance its performance?
  53. What are the different components of a Hive architecture?

Hive Interview Questions for 2022 

Here is the comprehensive list of the most commonly asked Hive interview questions. Interview questions on Hive may be direct or application-based.

1. What applications are supported by Hive?

Hive supports client applications based on Java, PHP, Python, C, and Ruby coding languages.

2. What are the different tables available in Hive?

There are two types of tables available in Hive – managed and external.

3. What is the difference between external and managed tables?

While external tables give data control to Hive but not control of a schema, managed tables give both schema and data control.

4. Where does the data of a Hive table get stored?

The Hive table gets stored in an HDFS directory – /user/hive/warehouse, by default. You can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml.

5. Can Hive be used in OLTP systems?

Since Hive does not support row-level data insertion, it is unsuitable for OLTP systems.

6. Can a table name be changed in Hive?

Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.

7. Where is Hive table data stored?

Hive table data is stored in an HDFS directory by default – user/hive/warehouse. This can be altered.

8. Can the default location of a managed table be changed in Hive?

Yes, the default managed table location can be changed in Hive by using the LOCATION ‘<hdfs_path>’ clause.

9. What is a Hive Metastore?

A Metastore is a relational database that stores the metadata of Hive partitions, tables, databases, and so on.

10. What are the types of meta-stores?

Local and Remote meta stores are the two types of Hive meta stores.

11. What is the difference between Local and Remote meta stores?

Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service, whereas remote meta stores run on a separate, distinct JVM.

12. What is the default Apache Hive metastore database?

The default database for metastore is the embedded Derby database provided by Hive, which is backed by the local disk.

13. Can multiple users use one metastore?

No, metastore sharing is not supported by Hive.

14. What are the three different modes in which Hive can be operated?

The three modes in which Hive can be operated are Local mode, distributed mode, and pseudo-distributed mode.

15. Is there a data type in Hive to store date information?

The TIMESTAMP data type in Hive stores all data information in java.sql.timestamp format.

16. Why is partitioning used in Hive?

Partitioning is used in Hive as it allows for the reduction of query latency. Instead of scanning entire tables, only relevant partitions and corresponding datasets are scanned.

17. What is dynamic partitioning, and when is it used?

Dynamic partitioning is the one where the values of the partition column will be known in the runtime, I.e, during loading of data into the Hive table

A dynamic partition can be used in the following two cases:

  • Loading data from an already existing non-partitioned table to help improve the sampling and thus, decrease the query latency.
  • When the values of the partitions are unknown beforehand and thus, finding the partition values manually from a huge data sets is a tedious task.

18. What are the Hive collection data types?

ARRAY, MAP, AND STRUCT are the three Hive collection data types.

19. Is it possible to run UNIX shell commands in Hive?

Yes, one can run shell commands in Hive by adding a ‘!’ before the command.

20. Is it possible to execute Hive queries from a script file?

Yes, one can do so with the help of a source command. For example – Hive> source /path/queryfile.hql

21. What is a .hiverc file?

It is a file consisting of a list of commands that must be run when the Command Line Input is initiated.

22. How can you check if a specific partition exists?

Use the following command: SHOW PARTITIONS table_name PARTITION (partitioned_column=’partition_value’)

23. If you had to list all databases that began with the letter ‘c’, how would you do it?

By using the following command: SHOW DATABASES LIKE ‘c.*’

24. Is it possible to delete DBPROPERTY in Hive?

No, there is no way to delete the DBPROPERTY.

25. Which Java class handles the input record encoding into files that store Hive tables?

The ‘org.apache.hadoop.mapred.TextInputFormat’ class.

26. Which Java class handles output record encoding into Hive query files?

The ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ class.

27. When a Hive table partition is pointed to a new directory, what happens to the data?

The data remains in the old directory and needs to be transferred manually. 

28. Do you save space in the HDFS by archiving Hive tables?

No, archiving Hive tables only helps reduce the number of files that make for easier management of data.

29. How can you stop a partition from being accessed in a query?

Use the ENABLE OFFLINE clause along with the ALTER TABLE command.

30. What is a table generating function on Hive?

MapReduce is a programming framework that allows Hive to divide large datasets into smaller units and process them parallelly. 

31. Can you avoid MapReduce on Hive?

You can make Hive avoid MapReduce to return query results by setting the hive.exec.mode.local.auto property to ‘true’.

32. Can a Cartesian join be created between two Hive tables?

This is not possible as it cannot be implemented in MapReduce programming.

33. What is a view in Hive?

A view is a logical construct that allows search queries to be treated as tables.

34. Can the name of a view be the same as a Hive table name?

No, the name of the view must always be unique in the database.

35. Can we use the LOAD or INSERT command to view?

No, these commands cannot be used with respect to a view in Hive.

36. What is indexing in Hive?

Hive indexing is a query optimization technique to reduce the time needed to access a column or a set of columns within a Hive database.

37. Are multi-line comments supported by Hive?

No, multi-line comments are supported by Hive.

38. How can you view the indexes of a Hive table?

By using the following command: SHOW INDEX ON table_name

39. What is the Hive ObjectInspector function?

It helps to analyze the structure of individual columns and rows and provides access to the complex objects that are stored within the database.

40. What is bucketing?

Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partitioning.

41. How is bucketing helpful? 

Bucketing helps optimize the sampling process and shortens the query response time.

42. Can you specify the name of the table creator in Hive?

Yes, by using the TBLPROPERTIES clause. For example – TBLPROPERTIES (‘creator’= ‘john’)

43. What is Hcatalog?

Hcatalog is a tool that helps to share data structures with other external systems in the Hadoop ecosystem.

44. What is UDF in Hive?

UDF is a user-designed function created with a Java program to address a specific function that is not part of the existing Hive functions.

45. What does /*streamtable(table_name)*/ do? 

A query hint allows for a table to be streamed into memory before a query is executed.

46. What are the limitations of Hive?

Hive has the following limitations: 

  • Real-time queries cannot be executed and it has no row-level support.
  • Hive cannot be used for online transaction processing.

47. Why do you need a Hcatolog? 

For sharing Data structures with external systems, Hcatalog is a necessary tool. It offers access to the Hive metastore for reading and writing data in a Hive data warehouse.

48. Name the components of a Hive query processor?

Following are the components of a Hive query processor:

  • Logical Plan of Generation.
  • Physical Plan of Generation.
  • Execution Engine.
  • UDF’s and UDAF.
  • Operators.
  • Optimizer.
  • Parser.
  • Semantic Analyzer.
  • Type Checking.

49. Why do we need buckets? 

Here are the two main reasons for performing bucketing to a partition: 

  • A map side join requires data belonging to a unique join key to be present in the same partition. However, what about those cases where your partition key differs from that of the join key? Therefore, you can perform a map side join by bucketing the table using the join key in such cases. 
  • Bucketing makes the sampling process more efficient and, thus, allows us to decrease the query time. 

50. How does Hive distribute the rows into buckets?

Hive uses the formula: hash_function (bucketing_column) modulo (num_of_buckets) to calculate the row’s bucket number. Here, hash_function is based on the Data type of the column. The hash_function is for integer data type:

hash_function (int_type_column)= value of int_type_column

51. What will happen in case you have not issued the command:  ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x? 

The command: ‘SET hive.enforce.bucketing=true;’ allows you to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files generated in the table directory to be unequal to the number of buckets. As an alternative solution, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket. 

52. How do ORC format tables help Hive to enhance its performance?

You can easily store the Hive Data with the ORC (Optimized Row Column) format, which helps to streamline several limitations.

53. What are the different components of a Hive architecture?

Following are the five components of a Hive Architecture:

  1. User Interface: It helps the user to send queries to the Hive system and other operations. The user interface provides hive Web UI, Hive Command-Line and Hive HDInsight.
  2. Driver: It designs a session handle for the query, and then the queries are sent to the compiler for the execution plan.
  3. Metastore: It contains organized data and information on various warehouse tables and partitions.
  4. Compiler: It creates the execution plan for the queries, performs semantic analysis on different query blocks, and generates query expressions.
  5. Execution Engine: It implements the execution plans created by the compiler

Conclusion

The above-listed Hive interview questions and answers cover most of the important topics under Hive, but this is in no way an exhaustive list. If you are interested in making it big in the world of data and evolving as a Future Leader, you may consider our Integrated Program In Business Analytics, a 10-month online program in collaboration with IIM Indore!

Also, Read

Related Articles

loader
Please wait while your application is being created.
Request Callback