Compare two variables in PIG - hadoop

I had two documents, where I need to filter second document words with first document words
I had tried but not working
lines = LOAD 'abc_doc1.txt';
words = FOREACH lines GENERATE word;
C = GROUP words all;
lines1 = LOAD 'abc_doc2.txt';
words1 = FOREACH lines GENERATE word;
C1 = GROUP words1 all;
D = foreach C1 generate $0 as searchwrd
E= Filter D by (searchwrd!=(foreach C generate $0))

Instead of filtering I used join
a. Inner join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as wordz;
x = JOIN words1 by word , words2 by wordz;
grouped = group x BY word;
D = foreach grouped generate COUNT(x), group;
Dump D;
b.Cross Join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as word;
C= CROSS words1,words2;
CC = foreach C generate $0 as first ,$1 as second;
R = FILTER CC by first==second;
grouped = group R BY first;
D = foreach grouped generate group, COUNT(R);
Dump D;

Your requirement seems to be :-
You have 2 files A and B. You want exclude all words from file B which are present in file A. You can use left outer join for this.
Script will look like :-
file1 = load 'A' using PigStorage() as (word1:chararray);
file2 = load 'B' using PigStorage() as (word2:chararray);
joined = join file2 by word2 left outer , file1 by word1 ;
filtered = filter joined by word1 is null ;
dump filtered ;
Explanation :- Left outer will make sure to include all words from file2. So all matched words in file1 and file2 will have non-null value. If you filter out NULL value word1 they are the remaining words which are present in file2 but not in file1

Related

NOT IN , MATCHES in pig

i have a two relations in pig:
A,B
DUMP A;
Sandeep Rohan Mohan
DUMP B;
MOHAN
i need to get output as A - B;
Relation C should give me
Sandeep,Rohan
since they not present in B
try this :
A1 = LOAD 'Sandeep Rohan Mohan' USING PigStorage() AS (line:chararray);
B1 = LOAD 'MOHAN' USING PigStorage() AS (line:chararray);
A = FOREACH A1 GENERATE UPPER(line) AS line;
B = FOREACH B1 GENERATE UPPER(line) AS line;
C = COGROUP A BY line, B BY line;
D = FILTER C BY IsEmpty(B);
E = FOREACH D GENERATE group AS name;
DUMP E;
(ROHAN) (SANDEEP)
also refer sets operations in apache pig
achieved it with a left outer join, considered only those tuples which had nulls in $1

PIG - Get Highest & Lowest Medal Winning Nations , GROUPed by Year

Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;

Sum up values in Pig Tuple

I have the following output of Pig tuple:
dump g:
()
(97)
(245)
(870)
(480)
describe g:
g: {long}
I'm looking to sum up the total of the #'s above so I tried this:
h = foreach g generate SUM($0);
I received this error:
Please use an explicit cast.
I then tried to cast the value to (int) and still did not work.
The output I'm looking for is like this:
1692
Here is the code leading up to:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
g = foreach f generate (long)$0 * $1;
You would need to do something like this:
H = GROUP G ALL;
I = FOREACH H GENERATE SUM(G.$0);

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

(hadoop.pig) multiple counts in single table

So, I have a data that has two values, string, and a number.
data(string:chararray, number:int)
and I am counting in 5 different rules,
1: int being 0~1.
2: int being 1~2.
~
5: int being 4~5.
So I was able to count them individually,
zero_to_one = filter avg_user by average_stars >= 0 and average_stars <= 1;
A = GROUP zero_to_one ALL;
zto_count = FOREACH A GENERATE COUNT(zero_to_one);
one_to_two = filter avg_user by average_stars > 1 and average_stars <= 2;
B = GROUP one_to_two ALL;
ott_count = FOREACH B GENERATE COUNT(one_to_two);
two_to_three = filter avg_user by average_stars > 2 and average_stars <= 3;
C = GROUP two_to_three ALL;
ttt_count = FOREACH C GENERATE COUNT( two_to_three);
three_to_four = filter avg_user by average_stars > 3 and average_stars <= 4;
D = GROUP three_to_four ALL;
ttf_count = FOREACH D GENERATE COUNT( three_to_four);
four_to_five = filter avg_user by average_stars > 4 and average_stars <= 5;
E = GROUP four_to_five ALL;
ftf_count = FOREACH E GENERATE COUNT( four_to_five);
So, this can be done, but
this only results in 5 individual table.
I want to see if there is any way (is ok to be fancy, I love fancy stuff)
T can make the result in single table.
Which means if
zto_count = 1
ott_count = 3
. = 2
. = 3
. = 5
then the table will be {1,3,2,3,5}
It just is easy to parse data, and organize them that way.
Is there any ways?
Using this as input:
foo 2
foo 3
foo 2
foo 3
foo 5
foo 4
foo 0
foo 4
foo 4
foo 5
foo 1
foo 5
(0 and 1 each appear once, 2 and 3 each appear twice, 4 and 5 each appear thrice)
This script:
A = LOAD 'myData' USING PigStorage(' ') AS (name: chararray, number: int);
B = FOREACH (GROUP A BY number) GENERATE group AS number, COUNT(A) AS count ;
C = FOREACH (GROUP B ALL) {
zto = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) ;
ott = FOREACH B GENERATE (number==1?count:0) + (number==2?count:0) ;
ttt = FOREACH B GENERATE (number==2?count:0) + (number==3?count:0) ;
ttf = FOREACH B GENERATE (number==3?count:0) + (number==4?count:0) ;
ftf = FOREACH B GENERATE (number==4?count:0) + (number==5?count:0) ;
GENERATE SUM(zto) AS zto,
SUM(ott) AS ott,
SUM(ttt) AS ttt,
SUM(ttf) AS ttf,
SUM(ftf) AS ftf ;
}
Produces this output:
C: {zto: long,ott: long,ttt: long,ttf: long,ftf: long}
(2,3,4,5,6)
The number of FOREACHs in C shouldn't really matter because C is going to only have 5 elements at most, but if it is then then they can be put together like this:
C = FOREACH (GROUP B ALL) {
total = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) AS zto,
(number==1?count:0) + (number==2?count:0) AS ott,
(number==2?count:0) + (number==3?count:0) AS ttt,
(number==3?count:0) + (number==4?count:0) AS ttf,
(number==4?count:0) + (number==5?count:0) AS ftf ;
GENERATE SUM(total.zto) AS zto,
SUM(total.ott) AS ott,
SUM(total.ttt) AS ttt,
SUM(total.ttf) AS ttf,
SUM(total.ftf) AS ftf ;
}

Resources