Apache Pig - How to extract sets of records - hadoop

I'm new user in Apache Pig, I have below data
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
I tried to extract to below records
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
Below are code that I've tried
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
It doesn't work.
Another tried by save into Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
Still doesn't work.

Can you try this?
input:
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
Output:
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)

Related

I am getting error wihile using filter in pig ,when i dump result it gives error

code used in pig is :
studentsR = LOAD 'hdfs://quickstart.cloudera:8020/students/students' using PigStorage() as (name:chararray,rollno:int);
resultR = LOAD 'hdfs://quickstart.cloudera:8020/students/results' using PigStorage() as (rollno:int,result:chararray);
joniR = JOIN studentsR BY rollno,resultR BY rollno;
filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result) ;
filterRPass = FILTER filterR BY resultR.result == 'pass';
dump filterRPass;
error coming as below :
ERROR 0: Scalar has more than one row in the output. 1st : (1,fail), 2nd :(2,fail)
Try dump and describe for your every result set to see the output of each alias used.
Refer : scalar-has-more-than-one-row-in-the-output
studentsR = LOAD '/home/user/students' using PigStorage(' ') as (name:chararray,rollno:int);
dump studentsR;
resultR = LOAD '/home/user/results' using PigStorage(' ') as (rollno:int,result:chararray);
dump resultR;
joniR = JOIN studentsR BY rollno,resultR BY rollno;
dump joniR;
filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
dump filterR;
filterRPass = FILTER filterR BY resultR::result == 'pass';
dump filterRPass;
Modifications:
I used space in the input files as delimiter so used PigStorage(' ')
In filterR i removed the opening and closing round braces () around studentsR::name,studentsR::rollno,resultR::result since output of dump was having additional round braces.
grunt> filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result);
grunt> describe filterR;
filterR: {org.apache.pig.builtin.totuple_studentsR::name_100: (studentsR::name: chararray,studentsR::rollno: int,resultR::result: chararray)}
grunt> filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
grunt> describe filterR;
filterR: {studentsR::name: chararray,studentsR::rollno: int,resultR::result: chararray}
Used resultR::result instead of resultR.result in fifilterRPass
I have used a local set of files and executed pig in local mode for testing.
cat students
a 1
b 2
c 3
cat results
3 pass
2 fail
5 pass
Dump results:
dump studentsR
(a,1)
(b,2)
(c,3)
dump resultR
(3,pass)
(2,fail)
(5,pass)
dump joniR
(b,2,2,fail)
(c,3,3,pass)
dump filterR --filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result);
((b,2,fail))
((c,3,pass))
dump filterR --filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
(b,2,fail)
(c,3,pass)
dump filterRPass; --filterRPass = FILTER filterR BY resultR::result == 'pass'; --or-- filterRPass = FILTER filterR BY $2 == 'pass';
(c,3,pass)

How to merge rows (items) of same relation in Apache Pig

I'm new to apache pig.
I have data like below.
tempdata =
(linsys4f-PORT42-0211201516244460,dnis=3007047505)
(linsys4f PORT42-0211201516244460,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
(linsys4f-PORT42-0211201516244460,language=ENGLISH)
(linsys4f-PORT42-0211201516244460,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT44-0211201516291287,dnis=3007047505)
(linsys4f-PORT44-0211201516291287,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
I need to merge the rows according to the key that is insys4f-PORT42-0211201516244460, linsys4f-PORT43-0211201516245465 & linsys4f-PORT44-0211201516291287.
and the output should like:
(linsys4f-PORT42-0211201516244460,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=ENGLISH,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=SPANISH)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC).
How can i merge this. Any help will appreciate.
Try using Group BY Operator and Flatten to solve this:
I have separated your first field into link, portname, port id for clearrer picture
A = LOAD '/home/coe_user_1/del/data.txt' USING PigStorage(',') AS
(port : CHARARRAY, dnis : CHARARRAY, incoming_tfn : CHARARRAY, tfn_location : CHARARRAY, ivr_location : CHARARRAY,state : CHARARRAY, language : CHARARRAY, outcome : CHARARRAY, exitType : CHARARRAY, exitState : CHARARRAY);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(port, '-', 3)) as (link: chararray, port: chararray, pid: int),
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
C = FOREACH B GENERATE
port AS port,
--pid AS pid,
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
D = GROUP C BY port;
E = FOREACH D GENERATE
group AS port,FLATTEN(BagToTuple(C.dnis)) AS dnis, FLATTEN(BagToTuple(C.incoming_tfn)) AS incoming_tfn, FLATTEN(BagToTuple(C.tfn_location)) AS tfn_location, FLATTEN(BagToTuple(C.ivr_location)) AS ivr_location ,FLATTEN(BagToTuple(C.state)) AS state,FLATTEN(BagToTuple(C.language)) AS language, FLATTEN(BagToTuple(C.outcome)) AS outcome,FLATTEN(BagToTuple(C.exitType)) AS exitType,FLATTEN(BagToTuple(C.exitState)) AS exitState ;
DUMP E;
Output:
(PORT42,outcome=Transfer to CSR,language=ENGLISH,incoming_tfn=8778816235,dnis=3007047505,exitType=Transfer,,tfn_location=Ashburn Avaya,,exitState=SETDIR2^7990019,,ivr_location=Ashburn Avaya,,,,state=NC,,,,,,,,,,,,,,,,,,,,,)
(PORT43,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,,,,,,)
(PORT44,incoming_tfn=8778816235,dnis=3007047505,tfn_location=Ashburn Avaya,,ivr_location=Ashburn Avaya,,state=NC,,,,,,,,,,,)

Data transformation using pig

I have a csv file in which there are two variables . I have to add these two variables:- like salary and bonus(in which the salary is comma seperated), but it is not happening in the pig.I tried using the casting also. below is the screenshot of the dataset:-
I used the below pig script:-
register /home/ravimishra/piggybank-0.15.0.jar;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
emp_details_header = LOAD 'data/employee.csv' USING CSVLoader AS (id: int, name: chararray, address: chararray, occupation: chararray,salary: chararray,bonus: double);
ranked = rank emp_details_header;
NoHeader = Filter ranked by (rank_emp_details_header > 1);
B = FOREACH NoHeader GENERATE id,name,address,occupation, (double)salary + bonus as total ;

How to access each element of Date in Pig Latin?

Query:
records = LOAD 'input' using PigStorage(' ') as (id:int, name:chararray, desination:chararray, date:chararray, salary: long);
Sample input:
(10102,neha,developer,14/02/13,32000)
(10103,deva,admin,02/02/14,40000)
(10102,neha,developer,01/01/14,45000)
(10245,sasi,developer,01/01/14,20000)
(10109,surya,manager,01/02/2014,56000)
(10102,neha,developer,01/02/2014,45000)
(10245,sasi,developer,02/01/2014,25000)
I want to filter the above data based on year of the date(not entire date).
Check if this works for you.
records = LOAD '/home/abhijit/Downloads/movies.txt' using PigStorage(',') as (id:int, name:chararray, desination:chararray, date:chararray, salary:int);
todate_data = foreach records generate id,name,destination,date, salary,ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime );
todate_data = foreach records generate name,desination,ToDate(date,'dd/MM/yyyy') as (date_time:DateTime );
getyear_data = foreach todate_data generate name,desination,GetYear(date_time);
groupByYear = group getyear_data by $3;
The final output will be :
(2013,{(neha,developer,2013)})
(2014,{(deva,admin,2014),(neha,developer,2014),(sasi,developer,2014),(surya,manager,2014),(neha,developer,2014),(sasi,developer,2014)})

SUM, AVG, in Pig are not working

I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.
Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)
Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;

Resources