How to join two relations in pig with multiple fields - hadoop

I've two CSV files:
1- Fertiltiy.csv :
2- Life Expectency.csv :
I want to join them in pig so that the result will be like this:
I am new to pig, I couldn't get the correct answer, but here is my code:
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by country, lifeExpectency by country;
B = JOIN fertility by year, lifeExpectency by year;
C = UNION A,B;
DUMP C;
Here is the result of my code:

You have the join by country and year and select the necessary columns needed for your final output.
fertility = LOAD 'fertility' USING org.apache.hcatalog.pig.HCatLoader();
lifeExpectency = LOAD 'lifeExpectency' USING org.apache.hcatalog.pig.HCatLoader();
A = JOIN fertility by (country,year), lifeExpectency by (country,year);
B = FOREACH A GENERATE fertility::country,fertility::year,fertility::fertility,lifeExpectency::lifeExpectency;
DUMP B;

Related

Linq query not returning expected results even when using DefaultIfEmpty

I have the following query in one of my Entity Framework Core API controllers:
var plotData = await (from nl in _context.BookList
join ql in _context.PlotList on nl.PlotId equals ql.PlotId
join qc in _context.PlotChoices on ql.PlotId equals qc.PlotId
join nk in _context.BookLinks.DefaultIfEmpty() on qc.ChoiceId equals nk.ChoiceId
where nl.Id == ID
select new
{ .. }
I need it to return all rows even if data doesn't exist in the BookLinks table.
However, it's not returning rows if there is no data data in the BookLinks table for that row.
But this SQL query, from which I'm trying to model from, does return data...it returns nulls if there is no data in BookLinks.
select * from BookList bl
left join PlotList pl ON bl.plotId = bl.plotId
left join PlotChoices pc ON pl.plotId = pc.plotId
left join BookLinks bk ON pc.choiceID = bk.choiceID
where nl.caseID = '2abv1'
From what I read online, adding 'DefaultIfEmpty()' to the end of BookLinks should fix that, but it hasn't.
What am I doing wrong?
Thanks!
When using left join , you can try below code sample :
var plotData = (from nl in _context.BookList
join ql in _context.PlotList on nl.PlotId equals ql.PlotId
join qc in _context.PlotChoices on ql.PlotId equals qc.PlotId
join nk in _context.BookLinks on qc.ChoiceId equals nk.ChoiceId into Details
from m in Details.DefaultIfEmpty()
where nl.Id == ID
select new
{
}).ToList();

Pig script to find the max, min,avg,sum of Salary in each department

I get stuck after grouping the data by department no.The steps followed by me
grunt> A = load '/home/cloudera/naveen1/hive_data/emp_data.txt' using PigStorage(',') as (eno:int,ename:chararray,job:chararray,sal:float,comm:float,dno:int);
grunt> B = group A by don;
grunt> describe B;
B: {group: int,A: {(eno: int,ename: chararray,job: chararray,sal: float,comm: float,dno: int)}}
Please let me know the steps after this.I am bit confused about the Nested Foreach statement execution.
Data contains eno,ename,sal,job,commisson,deptno and i want extract the max sal in each dept and the employee getting the highest salary.
Similary for min sal.
Use the aggregate functions after grouping.
C = FOREACH B GENERATE group,MAX(A.sal),MIN(A.sal),AVG(A.sal),SUM(A.sal);
DUMP C;
To get the name,eno and max sal in each dept,sort the records and get the top row
C = FOREACH B {
max_sal = ORDER A BY sal DESC;
max_limit = LIMIT max_sal 1;
GENERATE FLATTEN(max_limit);
}
DUMP C;

Combination of Union and Join in apache pig

I have two files in hdfs containing data as follows, File1:
id,name,age
1,x1,15
2,x2,14
3,x3,16
File2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
I want to produce the following output :
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
I am using Apache pig to perform the operation, is it possible to get the above output in pig. This is kind of Union and Join both.
As you can do unions and joins in pig this is of course possible.
Without digging into the exact syntax, I can tell you this should work (have used similar solutions in the past).
Suppose we have A and B.
Take the first two columns of A and B to be A2 and B2
Union A2 and B2 into M2
Distinct M2
Now you have your 'index' matrix, and we just need to add the extra columns.
Left join M2 with A and B
Generate the relevant columns
Thats it!
A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
2nd approach is better according to performance
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
Hope this will help!!
u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;

How to join bag in pig

First I have two data files.
largefile.txt:
1001 {(1,-1),(2,-1),(3,-1),(4,-1)}
smallfile.txt:
1002 {(1,0.04),(2,0.02),(4,0.03)}
and I want smallfile.txt like this:
1002 {(1,0.04),(2,0.02),(3,-1),(4,0.03)}
What type of join that I can do something like this?
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
Can you clear your requirement a bit ? Do you want to join on first column/field from largefile.txt and smallfile.txt with same value (for eg 1002). If that is the case you can simple do this :-
A = LOAD './largefile.txt' USING PigStorage('\t') AS (id:int, a:bag{tuple(time:int,value:float)});
A = Foreach A generate id , FLATTEN(a) as time,value ;
B = LOAD './smallfile.txt' USING PigStorage('\t') AS (id:int, b:bag{tuple(time:int,value:float)});
B = Foreach B generate id , FLATTEN(b) as time,value ;
joined = join A by A.id , B by B.id;

Calculating an Average across multiple columns in Hadoop Hive

I am trying to calculate an average of three columns in Hive but with no luck. Below is my code.
select c.university_name, c.country, AVG(c.world_rank) as AvgC, AVG(s.world_rank) as AvgS, AVG(t.world_rank) as AvgT, SUM(AvgC+AvgS+AvgT)/3 as TotalAvg
from cwur c
join shanghai s on (c.university_name = s.university_name and c.year = s.year)
join times t on (c.university_name = t.university_name and c.year = t.year)
Is Hive even capable of averaging across three calculated columns?
You are missing the group by clause
select
c.university_name,
c.country,
AVG(c.world_rank) as AvgC,
AVG(s.world_rank) as AvgS,
AVG(t.world_rank) as AvgT,
(AvgC+AvgS+AvgT)/3 as TotalAvg
from cwur c
join shanghai s on (c.university_name = s.university_name and c.year = s.year)
join times t on (c.university_name = t.university_name and c.year = t.year)
group by c.university_name, c.country

Resources