Introduction to Pig Latin in hadoop

Introduction to Pig Latin in Hadoop

pig latin is data flow language.Each processing step result in new dataset.
x= load ‘a’ where x is result stored(relation name) and once made, an assignment is permanent. It is possible to reuse relation names.pig latin have field names.exampe x= load ‘a'(b,c,d) where b,c,d are field names.Both relation names and field names must be start with alphabetic character.

case sensitivity

pig latin is some times not a case sensitive.let us see example,Load is equivalent to load.
A=load ‘b’ is not equivalent to a=load ‘b’
UDF are also case sensitive,count is not equivalent to COUNT

Introduction to Pig Latin in Hadoop

let us see how to represents comments in pig latin
1)SQL-style single-line comments (–)
2)Java-style multiline comments (/* */)

Input and Output


first step in dataflow language we need to specify the input,which is done by using ‘load’ keyword.load looks for your data on HDFS in a tab-delimited file using the default load function ‘PigStorage’.suppose if we want to load data from hbase,we would use the loader for hbase  HBaseStorage‘.
example of pigstorage loader

A = LOAD ‘/home/ravi/work/flight.tsv’ using PigStorage (‘t’) AS (origincode:chararray, destinationcode:chararray, origincity:chararray, destinationcity:chararray, passengers:int, seats:int, flights:int, distance:int, year:int, originpopulation:int, destpopulation:int);
example of hbasestorage loader
x= load ‘a’ using HBaseStorage();

if dont specify any loader function,it will takes built in function is ‘PigStorage
the ‘load’ statement can also have ‘as’ keyword for creating schema,which allows you to specify the schema of the data you are loading.
PigStorage‘ and ‘TextLoader’, the two built-in Pig load functions that operate on HDFS files.

After we have completed process,then result should  write into somewhere,Pig provides the store statement for this purpose
store processed into ‘/data/ex/process';
If you do not specify a store function, PigStorage will be used. You can specify a different store function with a using clause:

store processed into ‘processed’ using HBaseStorage();

we can also pass argument to store function,example,

store processed into ‘processed’ using PigStorage(‘,’);


dump diaplay the output on the screen
dump ‘processed’

Relational operations:
Relational operations are main tools for operating the data.they allow you to transform it by sorting, grouping, joining, projecting, and filtering.

foreach takes a set of expressions and applies them to every record in the data pipeline
A = load ‘input’ as (user:chararray, id:long, address:chararray, phone:chararray,preferences:map[]);
B = foreach A generate user, id;
positional references are preceded by a $ (dollar sign) and start from 0:

c= load d generate $2-$1

suppose we want to four fields by using (..)

A = load ‘input’ as (high,mediumhigh,avg,low)
B=foreach A generate high..low;(produces high,mediumhigh,avg,low)
c=foreach A generate ..low;   (produces high,mediumhigh,avg,low)

To extract data from complex datatyptes such as tuple,bag,map.By using projection operators we can extract data.
for map use ‘#’
bball = load ‘baseball’ as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#’batting_average';

for tuple use ‘.’
A = load ‘input’ as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;

when you project fields in a bag, you are creating a new bag with only those fields:
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.x;

we can also project multiple field in bag
A = load ‘input’ as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.(x, y);

Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!,== and != can be applied to maps and tuples.

A= load ‘inputs’ as(name,address)
B=filter A by symbol matches ‘CM.*';

The group statement collects together records with the same key.In SQL the group by clause creates a group that must feed directly into one or more aggregate functions. In Pig Latin there is no direct connection between group and aggregate functions.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = group input2 by stocks;

The order statement sorts your data for you, producing a total order of your output data.The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data
input2 = load ‘daily’ as (exchanges, stocks);
grpds = order input2 by exchanges;

The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:
input2 = load ‘daily’ as (exchanges, stocks);
grpds = distinct exchanges;

Join select records from one input and join with another input.This is done by indicating keys for each input. When those keys are equal, the two rows are joined.
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by stocks,input3 by stocks;

we can also join multiple keys
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by (exchanges,stocks),input3 by (exchanges,stocks);

Sometimes you want to see only a limited number of results. limit allows you do this:

input2 = load ‘daily’ as (exchanges, stocks);
first10 = limit input2 10;

Speak Your Mind