Sum up values in Pig Tuple - hadoop

I have the following output of Pig tuple:
dump g:
()
(97)
(245)
(870)
(480)
describe g:
g: {long}
I'm looking to sum up the total of the #'s above so I tried this:
h = foreach g generate SUM($0);
I received this error:
Please use an explicit cast.
I then tried to cast the value to (int) and still did not work.
The output I'm looking for is like this:
1692
Here is the code leading up to:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
g = foreach f generate (long)$0 * $1;

You would need to do something like this:
H = GROUP G ALL;
I = FOREACH H GENERATE SUM(G.$0);

Related

NOT IN , MATCHES in pig

i have a two relations in pig:
A,B
DUMP A;
Sandeep Rohan Mohan
DUMP B;
MOHAN
i need to get output as A - B;
Relation C should give me
Sandeep,Rohan
since they not present in B
try this :
A1 = LOAD 'Sandeep Rohan Mohan' USING PigStorage() AS (line:chararray);
B1 = LOAD 'MOHAN' USING PigStorage() AS (line:chararray);
A = FOREACH A1 GENERATE UPPER(line) AS line;
B = FOREACH B1 GENERATE UPPER(line) AS line;
C = COGROUP A BY line, B BY line;
D = FILTER C BY IsEmpty(B);
E = FOREACH D GENERATE group AS name;
DUMP E;
(ROHAN) (SANDEEP)
also refer sets operations in apache pig
achieved it with a left outer join, considered only those tuples which had nulls in $1

Compare two variables in PIG

I had two documents, where I need to filter second document words with first document words
I had tried but not working
lines = LOAD 'abc_doc1.txt';
words = FOREACH lines GENERATE word;
C = GROUP words all;
lines1 = LOAD 'abc_doc2.txt';
words1 = FOREACH lines GENERATE word;
C1 = GROUP words1 all;
D = foreach C1 generate $0 as searchwrd
E= Filter D by (searchwrd!=(foreach C generate $0))
Instead of filtering I used join
a. Inner join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as wordz;
x = JOIN words1 by word , words2 by wordz;
grouped = group x BY word;
D = foreach grouped generate COUNT(x), group;
Dump D;
b.Cross Join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as word;
C= CROSS words1,words2;
CC = foreach C generate $0 as first ,$1 as second;
R = FILTER CC by first==second;
grouped = group R BY first;
D = foreach grouped generate group, COUNT(R);
Dump D;
Your requirement seems to be :-
You have 2 files A and B. You want exclude all words from file B which are present in file A. You can use left outer join for this.
Script will look like :-
file1 = load 'A' using PigStorage() as (word1:chararray);
file2 = load 'B' using PigStorage() as (word2:chararray);
joined = join file2 by word2 left outer , file1 by word1 ;
filtered = filter joined by word1 is null ;
dump filtered ;
Explanation :- Left outer will make sure to include all words from file2. So all matched words in file1 and file2 will have non-null value. If you filter out NULL value word1 they are the remaining words which are present in file2 but not in file1

PIG - Get Highest & Lowest Medal Winning Nations , GROUPed by Year

Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;

what is the purpose of FLATTEN operator in PIG Latin

A = load 'data' as (x, y);
B = load 'data' as (x, z);
C = cogroup A by x, B by x;
D = foreach C generate flatten(A), flatten(b);
E = group D by A::x
what exactly done in the above statements and where we use flatten in realtime scenario.
A = load 'input1' USING PigStorage(',') as (x, y);
(x,y) --> (1,2)(1,3)(2,3)
B = load 'input2' USING PigStorage(',') as (x, z);`
(x,z) --> (1,4)(1,2)(3,2)*/
C = cogroup A by x, B by x;`
result:
(1,{(1,2),(1,3)},{(1,4),(1,2)})
(2,{(2,3)},{})
(3,{},{(3,2)})
D = foreach C generate group, flatten(A), flatten(B);`
when both bags flattened, the cross product of tuples are returned.
result:
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)
E = group D by A::x`
here your are grouping with x column of relation A.
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

Resources