Data transformation using pig - hadoop

I have a csv file in which there are two variables . I have to add these two variables:- like salary and bonus(in which the salary is comma seperated), but it is not happening in the pig.I tried using the casting also. below is the screenshot of the dataset:-
I used the below pig script:-
register /home/ravimishra/piggybank-0.15.0.jar;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
emp_details_header = LOAD 'data/employee.csv' USING CSVLoader AS (id: int, name: chararray, address: chararray, occupation: chararray,salary: chararray,bonus: double);
ranked = rank emp_details_header;
NoHeader = Filter ranked by (rank_emp_details_header > 1);
B = FOREACH NoHeader GENERATE id,name,address,occupation, (double)salary + bonus as total ;

Related

How to merge rows (items) of same relation in Apache Pig

I'm new to apache pig.
I have data like below.
tempdata =
(linsys4f-PORT42-0211201516244460,dnis=3007047505)
(linsys4f PORT42-0211201516244460,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
(linsys4f-PORT42-0211201516244460,language=ENGLISH)
(linsys4f-PORT42-0211201516244460,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT44-0211201516291287,dnis=3007047505)
(linsys4f-PORT44-0211201516291287,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
I need to merge the rows according to the key that is insys4f-PORT42-0211201516244460, linsys4f-PORT43-0211201516245465 & linsys4f-PORT44-0211201516291287.
and the output should like:
(linsys4f-PORT42-0211201516244460,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=ENGLISH,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=SPANISH)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC).
How can i merge this. Any help will appreciate.
Try using Group BY Operator and Flatten to solve this:
I have separated your first field into link, portname, port id for clearrer picture
A = LOAD '/home/coe_user_1/del/data.txt' USING PigStorage(',') AS
(port : CHARARRAY, dnis : CHARARRAY, incoming_tfn : CHARARRAY, tfn_location : CHARARRAY, ivr_location : CHARARRAY,state : CHARARRAY, language : CHARARRAY, outcome : CHARARRAY, exitType : CHARARRAY, exitState : CHARARRAY);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(port, '-', 3)) as (link: chararray, port: chararray, pid: int),
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
C = FOREACH B GENERATE
port AS port,
--pid AS pid,
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
D = GROUP C BY port;
E = FOREACH D GENERATE
group AS port,FLATTEN(BagToTuple(C.dnis)) AS dnis, FLATTEN(BagToTuple(C.incoming_tfn)) AS incoming_tfn, FLATTEN(BagToTuple(C.tfn_location)) AS tfn_location, FLATTEN(BagToTuple(C.ivr_location)) AS ivr_location ,FLATTEN(BagToTuple(C.state)) AS state,FLATTEN(BagToTuple(C.language)) AS language, FLATTEN(BagToTuple(C.outcome)) AS outcome,FLATTEN(BagToTuple(C.exitType)) AS exitType,FLATTEN(BagToTuple(C.exitState)) AS exitState ;
DUMP E;
Output:
(PORT42,outcome=Transfer to CSR,language=ENGLISH,incoming_tfn=8778816235,dnis=3007047505,exitType=Transfer,,tfn_location=Ashburn Avaya,,exitState=SETDIR2^7990019,,ivr_location=Ashburn Avaya,,,,state=NC,,,,,,,,,,,,,,,,,,,,,)
(PORT43,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,,,,,,)
(PORT44,incoming_tfn=8778816235,dnis=3007047505,tfn_location=Ashburn Avaya,,ivr_location=Ashburn Avaya,,state=NC,,,,,,,,,,,)

How to access each element of Date in Pig Latin?

Query:
records = LOAD 'input' using PigStorage(' ') as (id:int, name:chararray, desination:chararray, date:chararray, salary: long);
Sample input:
(10102,neha,developer,14/02/13,32000)
(10103,deva,admin,02/02/14,40000)
(10102,neha,developer,01/01/14,45000)
(10245,sasi,developer,01/01/14,20000)
(10109,surya,manager,01/02/2014,56000)
(10102,neha,developer,01/02/2014,45000)
(10245,sasi,developer,02/01/2014,25000)
I want to filter the above data based on year of the date(not entire date).
Check if this works for you.
records = LOAD '/home/abhijit/Downloads/movies.txt' using PigStorage(',') as (id:int, name:chararray, desination:chararray, date:chararray, salary:int);
todate_data = foreach records generate id,name,destination,date, salary,ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime );
todate_data = foreach records generate name,desination,ToDate(date,'dd/MM/yyyy') as (date_time:DateTime );
getyear_data = foreach todate_data generate name,desination,GetYear(date_time);
groupByYear = group getyear_data by $3;
The final output will be :
(2013,{(neha,developer,2013)})
(2014,{(deva,admin,2014),(neha,developer,2014),(sasi,developer,2014),(surya,manager,2014),(neha,developer,2014),(sasi,developer,2014)})

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

Below is the data I have and the schema for the same is-
student_name, question_number, actual_result(either - false/Correct)
(b,q1,Correct)
(a,q1,false)
(b,q2,Correct)
(a,q2,false)
(b,q3,false)
(a,q3,Correct)
(b,q4,false)
(a,q4,false)
(b,q5,flase)
(a,q5,false)
What I want is to get the count for each student i.e. a/b for total
correct and false answer he/she has made.
For the use case shared, below pig script is suffice.
Pig Script :
student_data = LOAD 'student_data.csv' USING PigStorage(',') AS (student_name:chararray, question_number:chararray, actual_result:chararray);
student_data_grp = GROUP student_data BY student_name;
student_correct_answer_data = FOREACH student_data_grp {
answers = student_data.actual_result;
correct_answers = FILTER answers BY actual_result=='Correct';
incorrect_answers = FILTER answers BY actual_result=='false';
GENERATE group AS student_name, COUNT(correct_answers) AS correct_ans_count, COUNT(incorrect_answers) AS incorrect_ans_count ;
};
Input : student_data.csv :
b,q1,Correct
a,q1,false
b,q2,Correct
a,q2,false
b,q3,false
a,q3,Correct
b,q4,false
a,q4,false
b,q5,false
a,q5,false
Output : DUMP kpi:
-- schema : (student_name, correct_ans_count, incorrect_ans_count)
(a,1,4)
(b,2,3)
Ref : For more details on nested FOR EACH
http://pig.apache.org/docs/r0.12.0/basic.html#foreach
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach
Use this:
data = LOAD '/abc.txt' USING PigStorage(',') AS (name:chararray, number:chararray,result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
and answer will be like:
(a,false,4)
(a,Correct,1)
(b,false,3)
(b,Correct,2)
Hope this is the output you are looking for

Apache Pig - How to extract sets of records

I'm new user in Apache Pig, I have below data
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
I tried to extract to below records
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
Below are code that I've tried
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
It doesn't work.
Another tried by save into Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
Still doesn't work.
Can you try this?
input:
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
Output:
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)

Loading from mysqldump with PIG

I have a mysqldump of the format:
INSERT INTO `MY_TABLE` VALUES (893024968,'342903068923468','o03gj8ip234qgj9u23q59u','testing123','HTTP','1','4213883b49b74d3eb9bd57b7','blahblash','2011-04-19 00:00:00','448','206',NULL,'GG');
How do I load this data using pig? I have tried;
A = LOAD 'pig-test/test.log' USING PigStorage(',') AS (ID: chararray, USER_ID: chararray, TOKEN: chararray, NODE: chararray, CHANNEL: chararray, CODE: float, KEY: chararray, AGENT: chararray, TIME: chararray, DURATION: float, RESPONSE: chararray, MESSAGE: chararray, TARGET: chararray);
Using , as a delimiter works fine, but I want the ID to be an int and I cannot figure out how to chop off the leading "INSERT INTO MY_TABLE VALUES (" and the trailing ");" when loading.
Also how should I load datetime information so that I can query it?
Any help you can give would be great.
You could load each record as a line of text and then try to regex/extract the field with MyRegExLoader or REGEX_EXTRACT_ALL:
A = LOAD 'data' AS (record: CHARARRAY);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(record, 'INSERT INTO...., \'(\d+)\', ...');
It is a kind of a hack but you can use REPLACE for chopping off the extra text too:
B = FOREACH A
GENERATE
(INT) REPLACE(ID, 'INSERT INTO MY_TABLE VALUES (', ''),
...
REPLACE(TARGET, ');', '');
Currently there is a problem with semicolon so you might need to do your own REPLACE.
There is not native date type in Pig but you can jungle with the date utils in PiggyBank or build your own UDF in order to convert it to a Unix long.
Another way would also be doing a simple script (Python...) for preparing the data for loading.

Resources