First I have two data files.
largefile.txt:
1001 {(1,-1),(2,-1),(3,-1),(4,-1)}
smallfile.txt:
1002 {(1,0.04),(2,0.02),(4,0.03)}
and I want smallfile.txt like this:
1002 {(1,0.04),(2,0.02),(3,-1),(4,0.03)}
What type of join that I can do something like this?
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
Can you clear your requirement a bit ? Do you want to join on first column/field from largefile.txt and smallfile.txt with same value (for eg 1002). If that is the case you can simple do this :-
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
A = Foreach A generate id , FLATTEN(a) as time,value ;
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
B = Foreach B generate id , FLATTEN(b) as time,value ;
joined = join A by A.id , B by B.id;
Related
I've two CSV files:
1- Fertiltiy.csv :
2- Life Expectency.csv :
I want to join them in pig so that the result will be like this:
I am new to pig, I couldn't get the correct answer, but here is my code:
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by country, lifeExpectency by country;
B = JOIN fertility by year, lifeExpectency by year;
C = UNION A,B;
DUMP C;
Here is the result of my code:
You have the join by country and year and select the necessary columns needed for your final output.
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by (country,year), lifeExpectency by (country,year);
B = FOREACH A GENERATE fertility::country,fertility::year,fertility::fertility,lifeExpectency::lifeExpectency;
DUMP B;
Below is the data
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,dollar
61,62,63,64,pound
col1,col2,col3 will form the combination of unique keys. The use case is to filter the data based on col5.
For the unique key combination we need to filter the record where col5 value is "dollar", only if the same combination has "pound" value.
The expected output is
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,pound
How to proceed further since there is no special operators in Pig like Hive.
A = load 'test1.csv' using PigStorage(',') as (col1:int,col2:int,col3:int,col4:int,col5:chararray);
B = FILTER A BY col5 == 'pound';
Get all the records with 'pound', then get all records with 'dollar' that does not match with the id combination with 'pound' in col5. Finally, marry them off ... UNION.
B = FILTER A BY col5 == 'pound';
C = JOIN A BY (col1,col2,col3) LEFT OUTER,B BY (col1,col2,col3);
D = FILTER C BY (B::col1 is null);
E = FOREACH D GENERATE A::col1,A::col2,A::col3,A::col4,A::col5;
F = UNION B,E;
DUMP F;
Output
I'm trying like
select * from A where A.ID NOT IN (select id from B) (in sql)
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c= FOREACH destnew GENERATE ID;
D=FILTER sourcenew BY NOT ID (c.ID);
org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1.
Was expecting one of:
<EOF>
"cat" ...
"clear" ...<EOF>
any help on this to resolve error, getting this on the execution of last line.
Use LEFT OUTER JOIN and FILTER the nulls
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c = FOREACH destnew GENERATE ID;
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID;
e = FILTER d by destnew.ID is null;
NOTE
I wrote a sample script with couple of test files and below is the working solution.In you case check to see if you are loading the data correctly from your files.
test1.txt
1 abc
2 def
3 ghi
4 jkl
5 mno
6 pqr
7 stu
8 vwx
1 abc
2 def
3 ghi
4 jkl
1 abc
2 def
3 ghi
1 abc
2 def
test2.txt
1
2
3
4
Script
A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int);
C = JOIN A BY aid LEFT OUTER,B BY bid;
D = FILTER C BY bid is null;
DUMP D;
So in the above example records 5,6,7,8 should be in the result since those Ids are not in test2.txt.
I have two files in hdfs containing data as follows, File1:
id,name,age
1,x1,15
2,x2,14
3,x3,16
File2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
I want to produce the following output :
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
I am using Apache pig to perform the operation, is it possible to get the above output in pig. This is kind of Union and Join both.
As you can do unions and joins in pig this is of course possible.
Without digging into the exact syntax, I can tell you this should work (have used similar solutions in the past).
Suppose we have A and B.
Take the first two columns of A and B to be A2 and B2
Union A2 and B2 into M2
Distinct M2
Now you have your 'index' matrix, and we just need to add the extra columns.
Left join M2 with A and B
Generate the relevant columns
Thats it!
A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
2nd approach is better according to performance
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
Hope this will help!!
u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;
I have two tables:
A: (feature:chararray, value:float)
B:(multiplier:charray, value:float)
where A is a table with thousands of rows and B has only one row.
What I wanna do is take all the rows in A and multiply A.value by B.value.
e.g.
A:[('f1', 1.5) , ('f2', 2.3)]
B:[('mul', 2)]
I'd like to product a table C
C: [('f1', 3), ('f2', 4.6)]
Is there an easy way to do so?
You can do a CROSS and a FOREACH ... GENERATE.
X = A CROSS B;
Y = FOREACH X GENERATE A::feature, A::value * B::value;
The above code has not been tested.
If You are very sure that the 2nd table has only one row then take the first column
of 2nd table and hardcode the same value as last column in 1st table and then
do the inner join and the you can easily multiply
Let say first file as plain.txt
(f1,1.5)
(f2,2)
here is the second file as multi.txt
(mul,2)
A = load '/user/cloudera/inputfiles/plain.txt' USING PigStorage(',') AS(feature:chararray,value:double);
B = load '/user/cloudera/inputfiles/multi.txt' USING PigStorage(',') AS(operation:chararray,no:int);
C = foreach A generate feature,value,'mul' as ope;
D = join C by ope, B by operation;
E = foreach D generate feature,(value*no) as multiplied_value;