In my previous posts we already discussed about Hive Partitioning concepts and Hive buckets concepts very clearly and also we deeply involved in differences between static partition and dynamic partition in hive concepts also now this is the time to known about main Difference Between Partitioning and Bucketing in hive .
- Hive Partitioning dividing the large amount of data into number pieces of folders based on table columns value.
- Hive Partition is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
If you want to use Partition in hive then you should use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.
- We can perform partition on any number of columns in a table by using hive partition concept.
- We can perform Hive Partitioning concept on Hive Tables like Managed tables or External tables
- Partitioning is works better when the cardinality of the partitioning field is not too high .
Supposes if we perform partition on Date column then new partition directories created for every date this very burden to name node metadata.
- Partitioning works best when the cardinality of the partitioning field is not too high.
Assume that you are storing information of people in entire world spread across 196+ countries spanning around 500 crores of entries. If you want to query people from a particular country (Vatican city), in absence of partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of a country. If you partition the table based on country, you can fine tune querying process by just checking the data for only one country partition. Hive partition creates a separate directory for a column(s) value.
Advantages with Hive Partition
- Distribute execution load horizontally
- Faster execution of queries in case of partition with low volume of data. e.g. Get the population from “Vatican city” returns very fast instead of searching entire population of world.
- No need to search entire table columns for a single record.
Disadvantages with Hive Partition
- there is a possibility for creating too many folders in HDFS that is extra burden for Namenode metadata.
Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city. Partition is not solving responsiveness problem in case of data skewing towards a particular partition value.
- so there is no guarantee for query optimization for all the times.
Also Read Introduction To Hive Partitions
Also Read Hive Buckets Introduction
- Hive bucketing is responsible for dividing the data into number of equal parts
- If you want to use bucketing in hive then you should use CLUSTERED BY (Col) command while creating a table in Hive
- We can perform Hive bucketing concept on Hive Managed tables or External tables
- We can perform Hive bucketing optimization only on one column only not more than one.
- The value of this column will be hashed by a user-defined number into buckets.
- bucketing works well when the field has high cardinality and data is evenly distributed among buckets
If you want to perform queries on Date or Timestamp or other columns which are having high records fields at that time Hive bucketing concept is perfectible.
- We can assign number of number buckets while creating the table.
- Bucketing also very useful in doing efficient map-side joins etc.
Clustering aka bucketing on the other hand, will result with a fixed number of files, since you do specify the number of buckets. What hive will do is to take the field, calculate a hash and assign a record to that bucket.
But what happens if you use let’s say 256 buckets and the field you’re bucketing on has a low cardinality (for instance, it’s a US state, so can be only 50 different values) ? You’ll have 50 buckets with data, and 206 buckets with no data.
Advantages with Hive Bucketing
- Due to equal volumes of data in each partition, joins at Map side will be quicker.
- Faster query response like partitioning
Disadvantages with Hive Bucketing
- You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
This is the main Difference Between Partitioning and Bucketing in hive .Share this knowledge ! Join us on Facebook ! Now Whatsapp sharing is supportable ! BookMark our HadoopTpoint.com ! Any Doubts Comment below .