Pig join two Relations only with join partner - hadoop

im new at programming in Pig Latin and i have a question.
Let's say i have the following two relations (A and B):
Relation A: http://i.stack.imgur.com/Aa5Rd.png
Relation B: http://i.stack.imgur.com/m467q.png
Now, the Relations should be joined, but only when in A a key (id) exists. Otherwise not. So the Result should look like:
Relation Result: i.stack.imgur.com/3elgh.png (i cannot post more than 2 links)
How i can solve that?
My approach result = JOIN A BY id, B BY id; because it creates a result relation with all ids & texts :/
Thank you very much in advance,
Stefanos

Your approach is right. I got the correct output as you mentioned but not sure why you didn't get the output. Can you cross check your pigscript with the below one?
input1:
1
4
6
input2:
1,peter
2,jay
3,dan
4,knut
5,Gnu
6,rafael
7,hans
PigScript:
A = LOAD 'input1' AS (id:int);
B = LOAD 'input2' USING PigStorage(',') AS (id:int,text:chararray);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,B::text as text;
DUMP D;
Output:
(1,peter)
(4,knut)
(6,rafael)

Related

How to perform Group by then use DISTINCT on other column in pig

I have just starting learning PIG and need small help with the question below . thanks in advance !
For eg: I have input like:
Occupation Category Name
Actress Acting Marion Cotillard
Actor Acting Liam Nelson
Tennis Plyr Athletics Roger Federer
Football Plyr Athletics Neymar
Actor Acting Tom Hanks
Actress Acting Elizabeth Banks
US Senator Politics Elizabeth Warren
Football Plyr Athletics Mesut Ozil
I want to know how many types are there in single category.
For eg:- Acting has two types one is Actress and other is Actor. Hence , result will be 2.
Problem facing : Not able to DISTINCT the output from 'group by Category' using 'Occupation' column. :(
Try this:
x= load '<data>' using PigStorage('\t') as (occupation:chararray,category:chararray,name:chararray);
x_grouped= group x by category;
x_grouped_distinct= foreach x_grouped { cat= distinct $1.occupation; generate $0, cat, COUNT(cat);};
dump x_grouped_distinct;
Distinct first and then Group By Category.Assuming you have already loaded the data into relation A.
Select the 2 columns after the load.
Distinct the relation
Group By category
Count Occupation for each Category
B = FOREACH A GENERATE Occupation as Occupation,Category as Category;
C = DISTINCT B;
D = GROUP C BY $1;
E = FOREACH D GENERATE group,COUNT(C.Occupation);
DUMP E;

Aggregate data grouping by two columns in Pig

I have these data that I need to group by two columns and then sum up two other fields.
Suppose the name for these four columns are:OS,device,view,click. I basically want to know the count for each OS and device, how many views they have and how many clicks it have.
(2,3346,1,)
(3,3953,1,1)
(25,4840,1,1)
(2,94840,1,1)
(14,0526,1,1)
(37,4864,1,)
(2,7353,1,)
This is what I have so far
A is data: OS,device,view,click
B = GROUP A BY (OS,device);
Result = FOREACH B {
GENERATE group AS OS,device, SUM(view) AS visits, SUM(click) AS clicks;};
dump Result;
This one won't work, error message is: Projected field [OS] does not exist in schema: group:tuple(OS:int,device:long),B:bag{:tuple(OS:int,device:long,view:int,click:int)}.
Here is the code which is tested, you are missing FLATTEN:
A = LOAD '/user/root/pig_data' using PigStorage(',') AS (OS, device, view, click);
B = GROUP A BY (OS, device);
RESULT = FOREACH B GENERATE FLATTEN(group) AS (OS, device), SUM(A.view) as views, SUM(A.click) as clicks;
dump RESULT;
I think you meant B in your example instead of J2 or J3, which may be in your actual code. Try:
B = GROUP A BY (OS, device);
Result = FOREACH B GENERATE
group.OS AS OS:int,
group.device AS device:long,
SUM(B.view) AS visits:int,
SUM(B.click) AS clicks:int;
dump Result;

How to pass the value from one load statement into another load statement in pig script

Hi i have two load statements A and B.I want to pass the particular column values from A to B .I tried the following code.
A = LOAD '/user/bangalore/part-m-00000-bangalore' using PigStorage ('\t') as (generatekey:chararray,PropertyID:chararray,ssk:chararray,ptsk:chararray,ptid:chararray,BuiltUpArea:chararray,Price:chararray,pn:chararray,NoOfBedRooms:chararray,NoOfBathRooms:chararray,balconies:chararray,Furnished:chararray,TowerNo:chararray,NoOfTowers:chararray,UnitsOntheFloor:chararray,FloorNoOfProperty:chararray,TotalFloors:chararray,NumberOfLifts:chararray,Facing:chararray,Description:chararray,NewResale:chararray,Possession:chararray,Age:chararray,Ownership:chararray,Type:chararray,PropertyAddress:chararray,Property_Address2:chararray,city:chararray,state:chararray,Property_PinCode:chararray,Locality:chararray,Landmark:chararray,PropertyFeatures:chararray,NearByFacilities:chararray,ReferenceURL:chararray,Flooring:chararray,OverLooking:chararray,ListedOn:chararray,Sellerinfo:chararray,CompanyAddress:chararray,Agency_Address2:chararray,city2:chararray,state2:chararray,Agency_Pincode:chararray,Agency_Phone1:chararray,Agency_Phone2:chararray,ContactName:chararray,Agency_Email:chararray,Agency_WebSite:chararray);
B = foreach A generate Locality;
C = LOAD '/user/april_data/bangalore' using PigStorage ('\t') as (SourceWebSite:chararray,PropertyID:chararray,ListedOn:chararray,ContactName:chararray,TotalViews:int,Price:chararray,PriceperArea:chararray,NoOfBedRooms:int,NoOfBathRooms:int,FloorNoOfProperty:chararray,TotalFloors:int,Possession:chararray,BuiltUpArea:chararray,Furnished:chararray,Ownership:chararray,NewResale:chararray,Facing:chararray,title:chararray,PropertyAddress:chararray,NearByFacilities:chararray,PropertyFeatures:chararray,Sellerinfo:chararray,Description:chararray,emp:chararray);
D = FORACH C generate title
E = join B by Locality,D by title;
the locality column is empty.I want to pass the values from the title column to locality column.the above code prints null only.any help will be appreciated.

How Pig's COGROUP operator works?

How does the COGROUP operator works here?
How and why we are getting empty bag in the last two lines of output(No website explained in details about the data arrangement in COGROUP) ?
A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = cogroup A by age, B by age;
dump X;
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
There is a very clear example in Definitive Guide book. I hope the below snippet helps you to understand the cogroup concept.
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each tuple
is the key, and the remaining fields are bags of tuples from the relations with a matching
key. The first bag contains the matching tuples from relation A with the same key.
Similarly, the second bag contains the matching tuples from relation B with the same
key.
If for a particular key a relation has no matching key, then the bag for that relation is
empty. For example, since no one has bought a scarf (with ID 1), the second bag in the
tuple for that row is empty. This is an example of an outer join, which is the default
type for COGROUP.

get things out of bag in pig

In the pig example:
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
The last output A.name is a bag. How can I get things out of bag:
(John, 3.850000023841858)
(Mary, 3.925000011920929)
GROUP creats a magical item called group, which is what you grouped on. This is made for exactly this purpose.
B = GROUP A BY name;
C = FOREACH B GENERATE group AS name, AVG(A.gpa);
Check out DESCRIBE B;, you'll see that group is in there. It is a single value that represents what was in the BY ... part of the GROUP.

Resources