Hive Buckets Optimization Techniques

Our HadoopTpoint App is now available in google play store,please rate and comment it in play store : W3Schools

Hive Buckets Optimization Techniques

Hive Buckets

The table divide into number of partitions is called Hive Partition, The Hive Partition can be further subdivided into Clusters or Buckets

We already learned about Hive partition and SQL partition concepts in our last post.Now we learn about Hive Buckets . Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.For example we have table with columns like date,employee_name,employee_id,salary,leaves etc . In this table just use date column as the top-level partition and the employee_id as the second-level partition leads to too many small partitions.So here employee table is partition by date and bucketing by employee_id.The value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket.

In Partition there is so many chance to create thousands of tiny partitions but coming into Hive Buckets we can’t create number of Hive Buckets the reason is we should declare the number of buckets for a table in the time of table creation. In Hive Partition each partition will be created as directory but in Hive Buckets each bucket will be created as file.In Hive Partition we used PARTITIONED BY but in Hive Buckets we used CLUSTERED BY

Also Read Difference Between Partitioning and Bucketing in hive

Advantages With Hive Buckets

The number of buckets is fixed so it does not fluctuate with data

Enables more efficient queries

makes sampling more efficient

Hash(column) MOD(number of buckets) –evenly distributed

Used for Query optimization Techniques

CREATE TABLE order (
    username     STRING,
    orderdate    STRING,
    amount        DOUBLE,
    tax        DOUBLE,
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;

Here we divided Hive Buckets into  25 parts.Set the maximum number of reducers to the same number of buckets specified in the table metadata (i.e. 25)

set map.reduce.tasks = 25

Use the following command to enforce bucketing:
set hive.enforce.bucketing = true
Better to set default Hive Buckets is 25
HIve Buckets table

HIve Buckets table

Load Data Into Table

HIve Buckets table data load

HIve Buckets table data load

Check below screen and you will realize three files names as 000000_0, 000001_0 and 000002_0 are created these are our data files.

HIve-Buckets-table-data-load-output

HIve-Buckets-table-data-load-output

Comments

  1. Excellent

  2. Excellent info on bucketing. Thank you.

  3. Super :)

  4. New coder says:

    Excellent ! Many thanks

  5. Muthusamy says:

    Nice ……… Great work

  6. Thanks mate, very crisp and fruitful explanation

  7. Hi Mahesh,
    Clear explanation.
    What if your data has 4 or more different countries and tried to insert into above table which consists 3 buckets for countries.

Speak Your Mind

*