NOT IN , MATCHES in pig - hadoop

i have a two relations in pig:
A,B
DUMP A;
Sandeep Rohan Mohan
DUMP B;
MOHAN
i need to get output as A - B;
Relation C should give me
Sandeep,Rohan
since they not present in B

try this :
A1 = LOAD 'Sandeep Rohan Mohan' USING PigStorage() AS (line:chararray);
B1 = LOAD 'MOHAN' USING PigStorage() AS (line:chararray);
A = FOREACH A1 GENERATE UPPER(line) AS line;
B = FOREACH B1 GENERATE UPPER(line) AS line;
C = COGROUP A BY line, B BY line;
D = FILTER C BY IsEmpty(B);
E = FOREACH D GENERATE group AS name;
DUMP E;
(ROHAN) (SANDEEP)
also refer sets operations in apache pig

achieved it with a left outer join, considered only those tuples which had nulls in $1

Related

Compare two variables in PIG

I had two documents, where I need to filter second document words with first document words
I had tried but not working
lines = LOAD 'abc_doc1.txt';
words = FOREACH lines GENERATE word;
C = GROUP words all;
lines1 = LOAD 'abc_doc2.txt';
words1 = FOREACH lines GENERATE word;
C1 = GROUP words1 all;
D = foreach C1 generate $0 as searchwrd
E= Filter D by (searchwrd!=(foreach C generate $0))
Instead of filtering I used join
a. Inner join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as wordz;
x = JOIN words1 by word , words2 by wordz;
grouped = group x BY word;
D = foreach grouped generate COUNT(x), group;
Dump D;
b.Cross Join:
A = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray);
words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as word;
C= CROSS words1,words2;
CC = foreach C generate $0 as first ,$1 as second;
R = FILTER CC by first==second;
grouped = group R BY first;
D = foreach grouped generate group, COUNT(R);
Dump D;
Your requirement seems to be :-
You have 2 files A and B. You want exclude all words from file B which are present in file A. You can use left outer join for this.
Script will look like :-
file1 = load 'A' using PigStorage() as (word1:chararray);
file2 = load 'B' using PigStorage() as (word2:chararray);
joined = join file2 by word2 left outer , file1 by word1 ;
filtered = filter joined by word1 is null ;
dump filtered ;
Explanation :- Left outer will make sure to include all words from file2. So all matched words in file1 and file2 will have non-null value. If you filter out NULL value word1 they are the remaining words which are present in file2 but not in file1

Sum up values in Pig Tuple

I have the following output of Pig tuple:
dump g:
()
(97)
(245)
(870)
(480)
describe g:
g: {long}
I'm looking to sum up the total of the #'s above so I tried this:
h = foreach g generate SUM($0);
I received this error:
Please use an explicit cast.
I then tried to cast the value to (int) and still did not work.
The output I'm looking for is like this:
1692
Here is the code leading up to:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
g = foreach f generate (long)$0 * $1;
You would need to do something like this:
H = GROUP G ALL;
I = FOREACH H GENERATE SUM(G.$0);

what is the purpose of FLATTEN operator in PIG Latin

A = load 'data' as (x, y);
B = load 'data' as (x, z);
C = cogroup A by x, B by x;
D = foreach C generate flatten(A), flatten(b);
E = group D by A::x
what exactly done in the above statements and where we use flatten in realtime scenario.
A = load 'input1' USING PigStorage(',') as (x, y);
(x,y) --> (1,2)(1,3)(2,3)
B = load 'input2' USING PigStorage(',') as (x, z);`
(x,z) --> (1,4)(1,2)(3,2)*/
C = cogroup A by x, B by x;`
result:
(1,{(1,2),(1,3)},{(1,4),(1,2)})
(2,{(2,3)},{})
(3,{},{(3,2)})
D = foreach C generate group, flatten(A), flatten(B);`
when both bags flattened, the cross product of tuples are returned.
result:
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)
E = group D by A::x`
here your are grouping with x column of relation A.
(1,1,2,1,4)
(1,1,2,1,2)
(1,1,3,1,4)
(1,1,3,1,2)

Pig: Create new column based off of two other columns

I'm wondering if it's possible to do something like this is in pig:
There are three columns:
A "type1","type2","type3"
B 101 , 159 , 74
I want to define columns C as such:
If A == "type1" then C = B; else C = 0
Is this possible in pig?
Yes, this is possible. You would write it as below:
data = LOAD '$dataSource' using AvroStorage();
-- data = {A, B}
data2 = FOREACH data
GENERATE
A,
B,
(A == 'type1' ? B : 0) AS C;
dump data2;

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

Resources