Apache Pig Introduction

Our HadoopTpoint App is now available in google play store,please rate and comment it in play store : W3Schools

Apache Pig Introduction

Pig is a Apache open source project which is run on hadoop,provides engine for data flow in parallel on hadoop.It includes language called pig latin,which is for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing and processing.

pig latin is data flow language,which can read data from one or more input and then processing the data and finally,one or more output should stored in HDFS.This language is different from other languages,it doesn’t have loops and if conditions ,because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program.In mathematically precise,pig scripts describes a Directed acyclic graph(DAG) in which edges are represented as workflows and nodes are operators that process the data.
Pig latin is procedural version of SQl.pig has certainly similarities,more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but dont tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write maultiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

SQL is designed for RDBMS which support to normalization of data and schema and applied constraints to columns in a table.Pig is designed for the Hadoop data-processing environment, where schemas are sometimes unknown or inconsistent.In pig,data is directly loaded into HDFS but not into tables.

In mapreduce,groupby operation performed at reducer side and filter,projection can be implemented in the map phase.pig latin also provides standard-operation similar to mapreduce like orderby and filters,groupby..etc.we can analyze pig script and know data flows ans also early to find the error checking.pig Latin is much lower cost to write and maintain thanJava code for MapReduce.

In three categories,we can use pig .they are 1)ETL data pipline 2)Research on raw data 3)Iterative processing
Most common usecase for pig is data pipeline.Let us take one example, web based compaines gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggeration operations..etc.i,e transformations on data.

If we need to processing GB,TB data , pig is good for that . Looking up many records in random order,pig is not good choice

Datatypes:

pig have two different datatypes
1)scalar datatype
int -4bytes,
float -4bytes,
double -8bytes,
long -8bytes,
chararray,
bytearray

2)complex datatype
It is possible a map where value field in a bag which containing tuple where one the field in map.
map:
map in pig is chararray to data element mapping where element have pig data type including complex data type.
example of map [‘city’#’hyd’,’pin’#500086]
the above example city and pin are data elements(key) mapping to values

tuple:
tuple have fixed length and it have collection datatypes.tuple containing multiple fields and also tuples are ordered.
example, (hyd,500086) which containing two fields.

bag:
A bag containing collection of tuples which are unordered,Bag constants are constructed using braces, with tuples in the bag separated by com-
mas. For example, {(‘hyd’, 500086), (‘chennai’, 510071), (‘bombay’, 500185)}

Speak Your Mind

*