NOT IN clause in PIG - hadoop

I'm trying like
select * from A where A.ID NOT IN (select id from B) (in sql)
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c= FOREACH destnew GENERATE ID;
D=FILTER sourcenew BY NOT ID (c.ID);
org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1.
Was expecting one of:
<EOF>
"cat" ...
"clear" ...<EOF>
any help on this to resolve error, getting this on the execution of last line.

Use LEFT OUTER JOIN and FILTER the nulls
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c = FOREACH destnew GENERATE ID;
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID;
e = FILTER d by destnew.ID is null;
NOTE
I wrote a sample script with couple of test files and below is the working solution.In you case check to see if you are loading the data correctly from your files.
test1.txt
1 abc
2 def
3 ghi
4 jkl
5 mno
6 pqr
7 stu
8 vwx
1 abc
2 def
3 ghi
4 jkl
1 abc
2 def
3 ghi
1 abc
2 def
test2.txt
1
2
3
4
Script
A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int);
C = JOIN A BY aid LEFT OUTER,B BY bid;
D = FILTER C BY bid is null;
DUMP D;
So in the above example records 5,6,7,8 should be in the result since those Ids are not in test2.txt.

Related

How to join two relations in pig with multiple fields

I've two CSV files:
1- Fertiltiy.csv :
2- Life Expectency.csv :
I want to join them in pig so that the result will be like this:
I am new to pig, I couldn't get the correct answer, but here is my code:
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by country, lifeExpectency by country;
B = JOIN fertility by year, lifeExpectency by year;
C = UNION A,B;
DUMP C;
Here is the result of my code:
You have the join by country and year and select the necessary columns needed for your final output.
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by (country,year), lifeExpectency by (country,year);
B = FOREACH A GENERATE fertility::country,fertility::year,fertility::fertility,lifeExpectency::lifeExpectency;
DUMP B;

Combination of Union and Join in apache pig

I have two files in hdfs containing data as follows, File1:
id,name,age
1,x1,15
2,x2,14
3,x3,16
File2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
I want to produce the following output :
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
I am using Apache pig to perform the operation, is it possible to get the above output in pig. This is kind of Union and Join both.
As you can do unions and joins in pig this is of course possible.
Without digging into the exact syntax, I can tell you this should work (have used similar solutions in the past).
Suppose we have A and B.
Take the first two columns of A and B to be A2 and B2
Union A2 and B2 into M2
Distinct M2
Now you have your 'index' matrix, and we just need to add the extra columns.
Left join M2 with A and B
Generate the relevant columns
Thats it!
A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
2nd approach is better according to performance
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
Hope this will help!!
u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;

How to join bag in pig

First I have two data files.
largefile.txt:
1001 {(1,-1),(2,-1),(3,-1),(4,-1)}
smallfile.txt:
1002 {(1,0.04),(2,0.02),(4,0.03)}
and I want smallfile.txt like this:
1002 {(1,0.04),(2,0.02),(3,-1),(4,0.03)}
What type of join that I can do something like this?
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
Can you clear your requirement a bit ? Do you want to join on first column/field from largefile.txt and smallfile.txt with same value (for eg 1002). If that is the case you can simple do this :-
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
A = Foreach A generate id , FLATTEN(a) as time,value ;
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
B = Foreach B generate id , FLATTEN(b) as time,value ;
joined = join A by A.id , B by B.id;

Top 3 records in inside group by query in pig

Sample data for my problem :
1 12 1234
2 12 1233
1 13 5555
1 15 4444
2 34 2222
7 89 1111
Field Description :
col1 cust_id ,col2 zip_code , col 3 transaction_id.
Using pig scripting i need to find the below question :
for each cust_id i need to find the zip code mostly used for last 3 transactions .
Approach I used so far :
1) Group records with cust_id :
(1,{(1,12,1234),(1,13,5555),(1,15,4444),(1,12,3333),(1,13,2323),(1,13,3434),(1,13,5755),(1,18,4424),(1,12,3383),(1,13,2823)})
(2,{(2,34,2222),(2,12,1233),(2,34,6666),(2,34,6666),(2,34,2422)})
(6,{(6,14,2312),(6,15,8888),(6,14,4634),(6,14,2712),(6,15,8288)})
(7,{(7,45,4244),(7,89,1111),(7,45,4544),(7,89,1121)})
2) Sort them and restrict them on latest 3 transactions.
Using nested foreach i have sorted by transaction id and limit that to 3
nested = foreach group_by { sor = order zip by $2 desc ; limi = limit sor 3 ; generate limi; };
After grouping data is :
({(1,12,1234),(1,13,2323),(1,13,2823)})
({(2,12,1233),(2,34,2222),(2,34,2422)})
({(6,14,2312),(6,14,2712),(6,14,4634)})
({(7,89,1111),(7,89,1121),(7,45,4244)})
why my above data is not getting sorted on the basis of descending order ?
Even on ascending order , Now how do i find the most used zip code for last 3 transactions .
Result should be
1) 13
2) 34
3) 14
4) 89
Can you try this?
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(CustomerId:int,ZipCode:int,TransactionId:int);
B = GROUP A BY CustomerId;
C = FOREACH B {
SortTxnId = ORDER A BY $2 DESC;
TxnIdLimit = LIMIT SortTxnId 3;
GENERATE group,TxnIdLimit;
}
D = FOREACH C GENERATE FLATTEN($1);
E = GROUP D BY ($0,$1);
F = FOREACH E GENERATE group,COUNT(D);
G = GROUP F BY group.$0;
I = FOREACH G {
SortZipCode = ORDER F BY $1 DESC;
ZipCodeLimit = LIMIT SortZipCode 1;
GENERATE FLATTEN(ZipCodeLimit.group);
}
J = FOREACH I GENERATE FLATTEN($0.TxnIdLimit::ZipCode);
DUMP J;
Output:
(13)
(34)
(14)
(89)
input.txt
1,12,1234
1,13,5555
1,15,4444
1,12,3333
1,13,5755
1,18,4424
2,34,2222
2,12,1233
2,33,6666
2,34,6666
2,34,2422
6,14,2312
6,15,8888
6,14,4634
6,14,2712
7,45,4244
7,89,1111
7,89,3111
7,89,1121

Count frequency of values in a column in PIG?

I have something like this:
ColA ColB
a xxx
b yyy
c xxx
d yyy
e xxx
I need to find out the number of times each value of ColB occurs.
Output:
xxx 3
yyy 2
Here's what I've been trying:
Considering A has my data,
grunt> B = GROUP A by ColB;
grunt> DESCRIBE B;
B: {group: chararray,A: {(ColA: chararray,ColB: chararray)}}
Now I'm confused, do I do something like this?
grunt> C = FOREACH B GENERATE COUNT(B.ColB)
So I need the output to be like this,
xxx 3
yyy 2
I figured it out.
C = FOREACH B GENERATE GROUP AS ColB, COUNT(A) as count;
Use lower-case for 'group as', it works for me:
C = FOREACH B GENERATE group as ColB, COUNT(A) as count;

Resources