I had two documents, where I need to filter second document words with first document words
I had tried but not working
lines = LOAD 'abc_doc1.txt';
words = FOREACH lines GENERATE word;
C = GROUP words all;
lines1 = LOAD 'abc_doc2.txt';
words1 = FOREACH lines GENERATE word;
C1 = GROUP words1 all;
D = foreach C1 generate $0 as searchwrd
E= Filter D by (searchwrd!=(foreach C generate $0))
Instead of filtering I used join
a. Inner join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as wordz;
x = JOIN words1 by word , words2 by wordz;
grouped = group x BY word;
D = foreach grouped generate COUNT(x), group;
Dump D;
b.Cross Join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as word;
C= CROSS words1,words2;
CC = foreach C generate $0 as first ,$1 as second;
R = FILTER CC by first==second;
grouped = group R BY first;
D = foreach grouped generate group, COUNT(R);
Dump D;
Your requirement seems to be :-
You have 2 files A and B. You want exclude all words from file B which are present in file A. You can use left outer join for this.
Script will look like :-
file1 = load 'A' using PigStorage() as (word1:chararray);
file2 = load 'B' using PigStorage() as (word2:chararray);
joined = join file2 by word2 left outer , file1 by word1 ;
filtered = filter joined by word1 is null ;
dump filtered ;
Explanation :- Left outer will make sure to include all words from file2. So all matched words in file1 and file2 will have non-null value. If you filter out NULL value word1 they are the remaining words which are present in file2 but not in file1
Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;
A = load 'data' as (x, y);
B = load 'data' as (x, z);
C = cogroup A by x, B by x;
D = foreach C generate flatten(A), flatten(b);
E = group D by A::x
what exactly done in the above statements and where we use flatten in realtime scenario.
A = load 'input1' USING PigStorage(',') as (x, y);
(x,y) --> (1,2)(1,3)(2,3)
B = load 'input2' USING PigStorage(',') as (x, z);`
(x,z) --> (1,4)(1,2)(3,2)*/
C = cogroup A by x, B by x;`
result:
(1,{(1,2),(1,3)},{(1,4),(1,2)})
(2,{(2,3)},{})
(3,{},{(3,2)})
D = foreach C generate group, flatten(A), flatten(B);`
when both bags flattened, the cross product of tuples are returned.
result:
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)
E = group D by A::x`
here your are grouping with x column of relation A.
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)