Hive Partition Big Data
The general definition of Partition is horizontally dividing the data into number of slice in a equal and manageable manner.Every partition is stored as directory within data warehouse table. In data warehouse this partition concept is common but there is two types of Partitions are available in data warehouse concepts.There are
i) SQL Partition
ii) Hive Partition
The main work of SQL Partition is dividing the large amount of data into number of slices by based on table single column only. SQL Partition is very hard comparing to Hive Partition because SQL server only support one column for Partition.It can be very harder to apply the query,indexes take up more space,and SQL Server’s partitioning requires Enterprise Edition on top of that expensive SAN you’re adding to cope with the extra space.
The main work of Hive partition is also same as SQL Partition but the main difference between SQL partition and hive partition is SQL partition is only supported for single column in table but in Hive partition it supported for Multiple columns in a table .In Hive we can apply Hive Partition concept on Managed tables and External tables.If we not crated dynamic partition for hive, Hive also creates an automatic partition scheme when the table is created
Example Hive Partition
create table cityreport(cityid string,creport string, ctover string)
partitioned by (city string)
row format delimited
fields terminated by ‘|’
stored as textfile;
In the Above table showing how to create a simple Hive partition table.In the above table columns are cityid,creport,ctover and table is partitioned by city.The above table is partitioned by only one column city .
Load Data into Hive Partition Table
hive> load data local inpath ‘/home/mahesh/hive-related/hyderabad.log’ into table cityreport partition (city = ‘hyderabad’);
hive> load data local inpath ‘/home/mahesh/hive-related/bengalore.log’ into table cityreport partition (city = ‘banglore’);
hive> load data local inpath ‘/home/mahesh/hive-related/chennai.log’ into table cityreport partition (city = ‘chennai’);
Now the total table data is partitioned by city,so when the query is applied on table it can easily access the required row by the help partitions
hive> select * from cityreport where cityid = ‘1003_hyd’ and city=’hyderabad';
The above query directly searched in city=’hyderabad’ partiton and not search in other city partition so the data look up time is decreased and the details displayed very quickly because of Hive Partition
Multiple column supported Hive Partition Tables
CREATE TABLE customer ( id INT, name STRING, address1 STRING, address2 STRING, city STRING, state STRING, zip STRING ) PARTITION BY ( REGION STRING, country STRING );
This above approach is great once you’re using Hive in production but it can be tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way.
Hive lets us change settings for a session using the
SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult change, but it’s outside of our scope.
SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict;
A few final important notes on partitioning:
i) Not to over partition your data in hive.If your partitions are relativity small(not so matched) then data scanning and retrieving very difficult comparing to sample scanning.so your table all partitions columns should be close relation to each other columns in a table .
ii) If you want to use hive partition in your table you should enable below query in your hive-site.xml property.
we can also enable hive partition without changing hive-site.xml,just type the above command in your hive terminal but it only works for that session of time only,If you close the terminal the settings also cleared so better to set in hive-site.xml file.