Introduction To Hive Partition Big Data

Our HadoopTpoint App is now available in google play store,please rate and comment it in play store : W3Schools

Hive Partition Big Data

Partition

The general definition of Partition is  horizontally  dividing the data into number of slice in a equal and manageable manner.Every partition is stored as directory within data warehouse table. In data warehouse this partition concept is common but there is two types of Partitions are available in data warehouse concepts.There are

i) SQL Partition

ii) Hive Partition

SQL Partition

The main work of SQL Partition is dividing the large amount of data into number of slices by based on table single column only. SQL Partition is very hard comparing to Hive Partition because SQL server only support one column for Partition.It can be very harder to apply the query,indexes take up more space,and SQL Server’s partitioning requires Enterprise Edition on top of that expensive SAN you’re adding to cope with the extra space.

Hive Partition 

The main work of Hive partition is also same as SQL Partition but the main difference between SQL partition and hive partition is SQL partition is only supported for single column in table  but in Hive partition it supported for Multiple columns in a table .In Hive we can apply Hive Partition concept on Managed tables and External tables.If we not crated dynamic partition for hive, Hive also creates an automatic partition scheme when the table is created

Example Hive Partition 

create table cityreport(cityid string,creport string, ctover string)
partitioned by (city string)
row format delimited
fields terminated by ‘|’
stored as textfile;

In the Above table showing how to create a simple Hive partition table.In the above table columns are cityid,creport,ctover and table is partitioned by city.The above table is partitioned by only one column city .

Load Data into Hive Partition Table

hive> load data local inpath ‘/home/mahesh/hive-related/hyderabad.log’ into table cityreport partition (city = ‘hyderabad’);

hive> load data local inpath ‘/home/mahesh/hive-related/bengalore.log’ into  table cityreport  partition (city = ‘banglore’);

hive> load data local inpath ‘/home/mahesh/hive-related/chennai.log’ into table cityreport partition (city = ‘chennai’);

Now the total table data is partitioned by city,so when the query is applied on table it can easily access the required row by the help partitions

hive> select * from cityreport where cityid = ‘1003_hyd’ and city=’hyderabad';

The above query directly searched in city=’hyderabad’ partiton and not search in other city partition so the data look up time is decreased and the details displayed very quickly because of Hive Partition  

Multiple column supported Hive Partition Tables

CREATE TABLE customer (
    id         INT,
    name         STRING,
    address1     STRING,
    address2     STRING,
    city         STRING,
    state         STRING,
    zip         STRING    
)
PARTITION BY (
    REGION        STRING,
    country        STRING
);

 This above approach is great once you’re using Hive in production but it can be tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way.

AUTOMATIC PARTITIONING

Hive lets us change settings for a session using the SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult change, but it’s outside of our scope.

SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict;

 

A few final important notes on partitioning:

i) Not to over partition your data in hive.If your partitions are relativity small(not so matched) then data scanning and retrieving very difficult  comparing to sample scanning.so your table all partitions columns should be close relation to each other columns in a table  .

ii) If you want to use hive partition in your table you should enable below query in your hive-site.xml property.

set hive.mapred.mode=strict

 we can also enable hive partition without changing hive-site.xml,just type the above command in your hive terminal but it only works for that session of time only,If you close the terminal the settings also cleared so better to set in hive-site.xml file.

 

Comments

  1. DHRUV ARORA says:

    Hi Mahesh,

    In the case of cityreport table, it is partitioned by city but city is not the column in the table definition. Can you please help me in understanding this ?
    Thanks
    Dhruv

    • mahesh chimmiri says:

      if you mentioned the column name in table along with partition at that time you will get semanticExcepetion [Error 10035] : Column repeated in partitioning columns

  2. ambati naresh says:

    hi

    how how to write sqoop export commnad for exporting HDFS data into Mysql ,would you please give me the brief idea about

  3. Muthusamy says:

    Can you give the sample data you have tried? so that it will be easy to practices us..

Speak Your Mind

*