Apache Hive 10 Best Practices
Apache hive is looks like Traditional SQL software used Hadoop to give users the capability of performing SQL-like queries on it’s own language,HiveQL works very quickly and efficiently.Comparing to Traditional SQL HiveQL gives users additional query and analytical abilities which are not available in Traditional SQL.
With APache Hive we can use Hiveql or traditional Mapreduce systems,its all are depending on users individual needs and preferences.Hive is mainly used for analyzing large data sets and also includes a variety of storage options.
Hive is a full of unique tools that can also users to quickly and efficiently perform data queries and analysis. In order to make full use of all these tools, it’s important for users to use best practices for Hive implementation. Here are 10 ways to make the most of Hive.
1. Partitioning Tables:
Hive Partition is one of the best way to improve the query optimization performance on on larger tables.Generally hive partition allows to divided the large size of table data into number of Partitions.The all partitions are stored in separate sub-directories under table location.We have to partition the data based on only associated columns.Suppose if your data is associated with Locations like city,country,state etc at that time we can partitioning the data by location column.Detailed information about Partitioning tables (Hive partition )
2. De-normalizing data:
Normalized data is nothing but high level data like some rules and deal with data redundancy(Duplication).Generally joins are very useful concepts but Joins are very difficult and expensive operations to perform and also one of the reasons for performance issues.If we normalize the data Joins operations are very hard Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
3. Compress map/reduce output:
Compressing techniques are used for the reduce or decrease the size of Intermediate data,This techniques are internally reduce the amount of transferring data between mappers and reducers.This all happens over the network.This compression techniques applied separately on mappers and Reducers.gzip compressed files are not splittable.snappy, lzo, bzip, etc other options of compression.
- For map output compression set mapred.compress.map.output to true
- For job output compression set mapred.output.compress to true
4. Map join:
Map Joins are very useful when the hive table on the other side of a join is small enough to fit in the memory.Hive supports a parameter, hive.auto.convert.join, which when it’s set to “true” suggests that Hive try to map join automatically. When using this parameter, be sure the auto convert is enabled in the Hive environment.
The table divide into number of partitions is called Hive Partition, The Hive Partition can be further subdivided into Clusters or Buckets.
For More About Bucketing click Here (Hive Buckets)
6. Input Format Selection:
Input formats are playing very important role in Hive performance.Primary choices of Input Format are Text,Sequence File,RC File,ORC .Detailed Information and difference between Text,Sequence File,RC File,ORC click here (Hive Input Format Selection)
7. Parallel execution:
Hadoop can execute MapReduce jobs in parallel,many queries in hive also executed by using this parallel execution concept we can utilize the performance of our hadoop cluster and reduce the query execution time also improve the performance The configuration in Hive to change this behavior is merely switching a single flag SET hive.exec.parallel=true. More Information Click here (Parallel Execution)
Generally Hive allows processing one row at a time but Vectorization allows Hive to process a batch of rows together instead of processing one row at a time.which improves the instruction pipelines and cache usage. To enable vectorization, set this configuration parameter SET hive.vectorized.execution.enabled=true.
9. Unit Testing:
Unit Testing is nothing but Testing the small piece of code works exactly as you expect or not.By using with Unit Testing we can easily find out the errors and easily re modify the code also.In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more.By using this Unit Testing we can save huge amount of time.There are several tools available that helps you to test Hive queries. Some of them that you might want to look atHiveRunner, Hive_test and Beetest.
Sampling is nothing but just try the query on some amount of sample data without having to analyze the entire data set Then a query can return meaningful results as well as finish quicker and consume fewer compute resources.