Pig: Pivoting & Sum 3 relations - hadoop

i have 3 different relations as mentioned below & i can get the output using UDFs but looking for implementation in PIG. Referred other stuff in the forums but not getting concrete idea over this problem.
Proc:
FN1,10
FN2,20
FN3,23
FN4,25
FN5,15
FN7,40
FN10,56
Rej:
FN1,12
FN2,13
FN3,33
FN6,60
FN8,23
FN9,44
FN10,4
AllFN:
FN1
FN2
FN3
FN4
FN5
FN6
FN7
FN8
FN9
FN10
Output required is:
FN1,10,12,22
FN2,20,13,33
FN3,23,33,56
FN4,25,0,25
FN5,15,0,15
FN6,0,60,60
FN7,40,0,40
FN8,0,23,23
FN9,0,44,44
FN10,56,4,60

Asumming your relations are in test.txt test2.txt and test3.txt
A = LOAD 'test.txt' using PigStorage(',');
B = LOAD 'test2.txt' using PigStorage(',');
C = LOAD 'test3.txt' using PigStorage(',');
D = COGROUP A by $0, B by $0;
E = COGROUP C by $0, D by $0;
F = FOREACH E generate $0, FLATTEN(D.A), FLATTEN(D.B);
G = FOREACH F generate $0, $1.$1, $2.$1;
H = FOREACH G generate $0, FLATTEN((IsEmpty($1)?null:$1)), FLATTEN((IsEmpty($2)?null:$2));
I = foreach H generate $0, ($1 is null?0:$1),($2 is null?0:$2),($1 is null?0:$1)+($2 is null?$0:$2);
dump I;
Output
(FN1,10,12,22)
(FN2,20,13,33)
(FN3,23,33,56)
(FN4,25,0,)
(FN5,15,0,)
(FN6,0,60,60)
(FN7,40,0,)
(FN8,0,23,23)
(FN9,0,44,44)
(FN10,56,4,60)

You can use COGROUP to achieve this

Related

Issue with output format in PIG

This is the code I have written in PIG.
I want to print the output like:
John, 3.850000023841858
Mary, 3.925000011920929
Instead of below output
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
Instead of A.name use group
C = FOREACH B GENERATE group, AVG(A.gpa);

PIG latin Script - using group & TOBAG

I have a file with following contents
INPUT:
TOYID;TOYSeries;ModuleID;ID;PART_NUMBER;SUPPLIER;LAND
394107;C204; 731305; 69807402;A0001532122;ABC;AT
394107;C204; 731307; 69807402;A0001532122;ABC;AT
394107;C204; 731315; 69807402;A0001532122;ABC;AT
394107;C204; 731325; 69807402;A0001532122;ABC;AT
394107;C204; 731335; 69807402;A0001532122;ABC;AT
394107;C204; 731345; 69807402;A0001532122;ABC;AT
I want output like this
Output:
SUPPLIER;LAND; COUNT(SUPPLIER,LAND); TOYID TOYSeries; ModuleID; ID; PART_NUMBER
ABC;AT; 6 ; 394107 C204; 731305; 69807402; A0001532122
ABC;AT 6 ; 394107 C204; 731307; 69807402; A0001532122
I tried:
A = LOAD 'hdfs://localhost:8020/BigData_POC/....../TOY_Detail.txt' USING PigStorage(';') AS (TOYID:chararray,TOYSeries:chararray,ModuleID:chararray,ID:c‌​hararray,DESCRIPTION‌​:chararray,PART_NUMB‌​ER:chararray,SUPPLIE‌​R:chararray,LAND:cha‌​rarray);
B = FOREACH A GENERATE TOYID,ModuleID,DESCRIPTION,PART_NUMBER,SUPPLIER,LAND;
C = GROUP B by (SUPPLIER,LAND);
D = foreach C generate group, COUNT(B) as cnt, B.TOYID,B.ModuleID,B.PART_NUMBER;
I am getting output like this:
(SUPPLIER,LAND) COUNT {(TOYID) (TOYID) (TOYID)...(TOYID) (MODULEID) (MODULEID) (MODULEID)... (MODULEID)(PARTNUMBER) (PARTNUMBER)... (PARTNUMBER)}
Do you know any pig latin script available for this?
Based on your comment, could you try this as a solution ? I have not verified it myself so may need some tweak as well.
D = foreach C generate group, COUNT(B) as cnt;
E = foreach D generate group.supplier as supplier, group.land as land, cnt;
F = Join B by (supplier,land),E by (supplier,land)

Pig Latin - foreach generate method does not work without the first field

I am facing a strange problem with pig generate function where if I do not use the first field the data generated seems to be wrong. Is this the expected behaviour ?
a = load '/input/temp2.txt' using PigStorage(' ','-tagFile') as (fname:chararray,line:chararray) ;
grunt> b = foreach a generate $1;
grunt> dump b;
(temp2.txt)
(temp2.txt)
grunt> c = foreach a generate $0,$1;
grunt> dump c;
(temp2.txt,field1,field2)
(temp2.txt,field1,field22)
$cat temp2.txt
field1,field2
field1,field22
pig -version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35
In the example I was expecting dump b to return data file values instead of the file name
in your example , you use PigStorage(' ','-tagFile') ,so each line were split by space .
then:
$0 ->field1,field2
$1 -> nothing ,
just use PigStorage(',','-tagFile') .

SUM, AVG, in Pig are not working

I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.
Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)
Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;

Apache Pig - How to extract sets of records

I'm new user in Apache Pig, I have below data
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
I tried to extract to below records
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
Below are code that I've tried
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
It doesn't work.
Another tried by save into Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
Still doesn't work.
Can you try this?
input:
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
Output:
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)

Resources