Combination of Union and Join in apache pig - hadoop

I have two files in hdfs containing data as follows, File1:
id,name,age
1,x1,15
2,x2,14
3,x3,16
File2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
I want to produce the following output :
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
I am using Apache pig to perform the operation, is it possible to get the above output in pig. This is kind of Union and Join both.

As you can do unions and joins in pig this is of course possible.
Without digging into the exact syntax, I can tell you this should work (have used similar solutions in the past).
Suppose we have A and B.
Take the first two columns of A and B to be A2 and B2
Union A2 and B2 into M2
Distinct M2
Now you have your 'index' matrix, and we just need to add the extra columns.
Left join M2 with A and B
Generate the relevant columns
Thats it!

A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
2nd approach is better according to performance
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
Hope this will help!!

u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;

Related

How to query names from a record with multiple IDs in LINQ

I have a table [A] that has columns such as CreatedBy(ID), AuthorizedBy(ID), SentTo(ID) and I need to join them to a table [B] containing user names (UserID, FullName). How can I write a join that connects each record of table A to multiple records in table B to fill in the CreatedBy/AuthorizedBy/SentTo names using LINQ?
can give try as below , basically you have to join B with A three times
form a in A
join b in B on b.Id = a.Createdby
join b1 in B on b1.Id = a.Authrizedby
join b2 in B on b2.Id = a.SentTo
select new {
a.Id,
CreatedBy= b.FullName,
AuthorizedBy = b1.FullName,
SentTo= b2.FullName};
or
from a in A
select new {
a.ID
CreatedBy= b.FirstOrDefault(a.CreatedBy== b.Id).FullName,
AuthorizedBy = b.FirstOrDefault(a.AuthorizedBy== b.Id).FullName,
SentTo= b.FirstOrDefault(a.SentTo== b.Id).FullName
}

How to join two relations in pig with multiple fields

I've two CSV files:
1- Fertiltiy.csv :
2- Life Expectency.csv :
I want to join them in pig so that the result will be like this:
I am new to pig, I couldn't get the correct answer, but here is my code:
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by country, lifeExpectency by country;
B = JOIN fertility by year, lifeExpectency by year;
C = UNION A,B;
DUMP C;
Here is the result of my code:
You have the join by country and year and select the necessary columns needed for your final output.
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by (country,year), lifeExpectency by (country,year);
B = FOREACH A GENERATE fertility::country,fertility::year,fertility::fertility,lifeExpectency::lifeExpectency;
DUMP B;

How to join bag in pig

First I have two data files.
largefile.txt:
1001 {(1,-1),(2,-1),(3,-1),(4,-1)}
smallfile.txt:
1002 {(1,0.04),(2,0.02),(4,0.03)}
and I want smallfile.txt like this:
1002 {(1,0.04),(2,0.02),(3,-1),(4,0.03)}
What type of join that I can do something like this?
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
Can you clear your requirement a bit ? Do you want to join on first column/field from largefile.txt and smallfile.txt with same value (for eg 1002). If that is the case you can simple do this :-
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
A = Foreach A generate id , FLATTEN(a) as time,value ;
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
B = Foreach B generate id , FLATTEN(b) as time,value ;
joined = join A by A.id , B by B.id;

Calculating an Average across multiple columns in Hadoop Hive

I am trying to calculate an average of three columns in Hive but with no luck. Below is my code.
select c.university_name, c.country, AVG(c.world_rank) as AvgC, AVG(s.world_rank) as AvgS, AVG(t.world_rank) as AvgT, SUM(AvgC+AvgS+AvgT)/3 as TotalAvg
from cwur c
join shanghai s on (c.university_name = s.university_name and c.year = s.year)
join times t on (c.university_name = t.university_name and c.year = t.year)
Is Hive even capable of averaging across three calculated columns?
You are missing the group by clause
select
c.university_name,
c.country,
AVG(c.world_rank) as AvgC,
AVG(s.world_rank) as AvgS,
AVG(t.world_rank) as AvgT,
(AvgC+AvgS+AvgT)/3 as TotalAvg
from cwur c
join shanghai s on (c.university_name = s.university_name and c.year = s.year)
join times t on (c.university_name = t.university_name and c.year = t.year)
group by c.university_name, c.country

Pig multiply a number from a table to all the values of another table

I have two tables:
A: (feature:chararray, value:float)
B:(multiplier:charray, value:float)
where A is a table with thousands of rows and B has only one row.
What I wanna do is take all the rows in A and multiply A.value by B.value.
e.g.
A:[('f1', 1.5) , ('f2', 2.3)]
B:[('mul', 2)]
I'd like to product a table C
C: [('f1', 3), ('f2', 4.6)]
Is there an easy way to do so?
You can do a CROSS and a FOREACH ... GENERATE.
X = A CROSS B;
Y = FOREACH X GENERATE A::feature, A::value * B::value;
The above code has not been tested.
If You are very sure that the 2nd table has only one row then take the first column
of 2nd table and hardcode the same value as last column in 1st table and then
do the inner join and the you can easily multiply
Let say first file as plain.txt
(f1,1.5)
(f2,2)
here is the second file as multi.txt
(mul,2)
A = load '/user/cloudera/inputfiles/plain.txt' USING PigStorage(',') AS(feature:chararray,value:double);
B = load '/user/cloudera/inputfiles/multi.txt' USING PigStorage(',') AS(operation:chararray,no:int);
C = foreach A generate feature,value,'mul' as ope;
D = join C by ope, B by operation;
E = foreach D generate feature,(value*no) as multiplied_value;

Resources