Apache Pig - How to extract sets of records

Apache Pig - How to extract sets of records - hadoop

I'm new user in Apache Pig, I have below data
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
I tried to extract to below records
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
Below are code that I've tried
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
It doesn't work.
Another tried by save into Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
Still doesn't work.

Can you try this?
input:
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
Output:
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)

Related

I am getting error wihile using filter in pig ,when i dump result it gives error

code used in pig is :
studentsR = LOAD 'hdfs://quickstart.cloudera:8020/students/students' using PigStorage() as (name:chararray,rollno:int);
resultR = LOAD 'hdfs://quickstart.cloudera:8020/students/results' using PigStorage() as (rollno:int,result:chararray);
joniR = JOIN studentsR BY rollno,resultR BY rollno;
filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result) ;
filterRPass = FILTER filterR BY resultR.result == 'pass';
dump filterRPass;
error coming as below :
ERROR 0: Scalar has more than one row in the output. 1st : (1,fail), 2nd :(2,fail)

Try dump and describe for your every result set to see the output of each alias used.
Refer : scalar-has-more-than-one-row-in-the-output
studentsR = LOAD '/home/user/students' using PigStorage(' ') as (name:chararray,rollno:int);
dump studentsR;
resultR = LOAD '/home/user/results' using PigStorage(' ') as (rollno:int,result:chararray);
dump resultR;
joniR = JOIN studentsR BY rollno,resultR BY rollno;
dump joniR;
filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
dump filterR;
filterRPass = FILTER filterR BY resultR::result == 'pass';
dump filterRPass;
Modifications:
I used space in the input files as delimiter so used PigStorage(' ')
In filterR i removed the opening and closing round braces () around studentsR::name,studentsR::rollno,resultR::result since output of dump was having additional round braces.
grunt> filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result);
grunt> describe filterR;
filterR: {org.apache.pig.builtin.totuple_studentsR::name_100: (studentsR::name: chararray,studentsR::rollno: int,resultR::result: chararray)}
grunt> filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
grunt> describe filterR;
filterR: {studentsR::name: chararray,studentsR::rollno: int,resultR::result: chararray}
Used resultR::result instead of resultR.result in fifilterRPass
I have used a local set of files and executed pig in local mode for testing.
cat students
a 1
b 2
c 3
cat results
3 pass
2 fail
5 pass
Dump results:
dump studentsR
(a,1)
(b,2)
(c,3)
dump resultR
(3,pass)
(2,fail)
(5,pass)
dump joniR
(b,2,2,fail)
(c,3,3,pass)
dump filterR --filterR = FOREACH joniR GENERATE (studentsR::name,studentsR::rollno,resultR::result);
((b,2,fail))
((c,3,pass))
dump filterR --filterR = FOREACH joniR GENERATE studentsR::name,studentsR::rollno,resultR::result;
(b,2,fail)
(c,3,pass)
dump filterRPass; --filterRPass = FILTER filterR BY resultR::result == 'pass'; --or-- filterRPass = FILTER filterR BY $2 == 'pass';
(c,3,pass)

How to merge rows (items) of same relation in Apache Pig

I'm new to apache pig.
I have data like below.
tempdata =
(linsys4f-PORT42-0211201516244460,dnis=3007047505)
(linsys4f PORT42-0211201516244460,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
(linsys4f-PORT42-0211201516244460,language=ENGLISH)
(linsys4f-PORT42-0211201516244460,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT44-0211201516291287,dnis=3007047505)
(linsys4f-PORT44-0211201516291287,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC)
I need to merge the rows according to the key that is insys4f-PORT42-0211201516244460, linsys4f-PORT43-0211201516245465 & linsys4f-PORT44-0211201516291287.
and the output should like:
(linsys4f-PORT42-0211201516244460,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=ENGLISH,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019)
(linsys4f-PORT43-0211201516245465,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC,language=SPANISH)
(linsys4f-PORT43-0211201516245465,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,dnis=3007047505,incoming_tfn=8778816235,tfn_location=Ashburn Avaya,ivr_location=Ashburn Avaya,state=NC).
How can i merge this. Any help will appreciate.

Try using Group BY Operator and Flatten to solve this:
I have separated your first field into link, portname, port id for clearrer picture
A = LOAD '/home/coe_user_1/del/data.txt' USING PigStorage(',') AS
(port : CHARARRAY, dnis : CHARARRAY, incoming_tfn : CHARARRAY, tfn_location : CHARARRAY, ivr_location : CHARARRAY,state : CHARARRAY, language : CHARARRAY, outcome : CHARARRAY, exitType : CHARARRAY, exitState : CHARARRAY);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(port, '-', 3)) as (link: chararray, port: chararray, pid: int),
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
C = FOREACH B GENERATE
port AS port,
--pid AS pid,
dnis AS dnis,
incoming_tfn AS incoming_tfn,
tfn_location AS tfn_location,
ivr_location AS ivr_location,
state AS state,
language AS language,
outcome AS outcome,
exitType AS exitType,
exitState AS exitState;
D = GROUP C BY port;
E = FOREACH D GENERATE
group AS port,FLATTEN(BagToTuple(C.dnis)) AS dnis, FLATTEN(BagToTuple(C.incoming_tfn)) AS incoming_tfn, FLATTEN(BagToTuple(C.tfn_location)) AS tfn_location, FLATTEN(BagToTuple(C.ivr_location)) AS ivr_location ,FLATTEN(BagToTuple(C.state)) AS state,FLATTEN(BagToTuple(C.language)) AS language, FLATTEN(BagToTuple(C.outcome)) AS outcome,FLATTEN(BagToTuple(C.exitType)) AS exitType,FLATTEN(BagToTuple(C.exitState)) AS exitState ;
DUMP E;
Output:
(PORT42,outcome=Transfer to CSR,language=ENGLISH,incoming_tfn=8778816235,dnis=3007047505,exitType=Transfer,,tfn_location=Ashburn Avaya,,exitState=SETDIR2^7990019,,ivr_location=Ashburn Avaya,,,,state=NC,,,,,,,,,,,,,,,,,,,,,)
(PORT43,outcome=Transfer to CSR,exitType=Transfer,exitState=SETDIR2^7990019,,,,,,)
(PORT44,incoming_tfn=8778816235,dnis=3007047505,tfn_location=Ashburn Avaya,,ivr_location=Ashburn Avaya,,state=NC,,,,,,,,,,,)

Data transformation using pig

I have a csv file in which there are two variables . I have to add these two variables:- like salary and bonus(in which the salary is comma seperated), but it is not happening in the pig.I tried using the casting also. below is the screenshot of the dataset:-
I used the below pig script:-
register /home/ravimishra/piggybank-0.15.0.jar;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
emp_details_header = LOAD 'data/employee.csv' USING CSVLoader AS (id: int, name: chararray, address: chararray, occupation: chararray,salary: chararray,bonus: double);
ranked = rank emp_details_header;
NoHeader = Filter ranked by (rank_emp_details_header > 1);
B = FOREACH NoHeader GENERATE id,name,address,occupation, (double)salary + bonus as total ;

How to access each element of Date in Pig Latin?

Query:
records = LOAD 'input' using PigStorage(' ') as (id:int, name:chararray, desination:chararray, date:chararray, salary: long);
Sample input:
(10102,neha,developer,14/02/13,32000)
(10103,deva,admin,02/02/14,40000)
(10102,neha,developer,01/01/14,45000)
(10245,sasi,developer,01/01/14,20000)
(10109,surya,manager,01/02/2014,56000)
(10102,neha,developer,01/02/2014,45000)
(10245,sasi,developer,02/01/2014,25000)
I want to filter the above data based on year of the date(not entire date).

Check if this works for you.
records = LOAD '/home/abhijit/Downloads/movies.txt' using PigStorage(',') as (id:int, name:chararray, desination:chararray, date:chararray, salary:int);
todate_data = foreach records generate id,name,destination,date, salary,ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime );
todate_data = foreach records generate name,desination,ToDate(date,'dd/MM/yyyy') as (date_time:DateTime );
getyear_data = foreach todate_data generate name,desination,GetYear(date_time);
groupByYear = group getyear_data by $3;
The final output will be :
(2013,{(neha,developer,2013)})
(2014,{(deva,admin,2014),(neha,developer,2014),(sasi,developer,2014),(surya,manager,2014),(neha,developer,2014),(sasi,developer,2014)})

SUM, AVG, in Pig are not working

I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.

Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)

Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Apache Pig - How to extract sets of records - hadoop

Related

I am getting error wihile using filter in pig ,when i dump result it gives error

How to merge rows (items) of same relation in Apache Pig

Data transformation using pig

How to access each element of Date in Pig Latin?

SUM, AVG, in Pig are not working

Categories

Resources