Related
I have three variables namely redcount, greencount and bluecount (already calculated in same pig program)
I want to compare their value and display largest of three with name.
existing code as below
countryflags = LOAD '/home/rahul/countryprojectdata/Country.txt' USING PigStorage(',') AS (country:chararray,landmass:int,zone:int,area_1ksqmtr:int,popoulation_million:int,language:int,religion:int,n_vbars:int,n_stripes:int,n_colors:int,redcolour:int,greencolour:int,bluecolour:int,goldcolour:int,whitecolour:int,blackcolour:int,orangecolour:int,mainhue:chararray,n_circles:int,n_upcrosses:int,n_digonalcrosses:int,n_quarteredsections:int,n_sunstars:int,crescent:int,triangle:int,icon:int,animate:int,text:int,topleftcolour:chararray,bottomrightcolour:chararray);
grpred = GROUP countryflags BY redcolour;
redcount = FOREACH grpred GENERATE SUM(countryflags.redcolour);
grpgreen = GROUP countryflags BY greencolour;
greencount = FOREACH grpgreen GENERATE SUM(countryflags.greencolour);
grpblue = GROUP countryflags BY bluecolour;
bluecount = FOREACH grpblue GENERATE SUM(countryflags.bluecolour);
Please help.
UNION the three relations, sort and get the top record.Assuming you just want the name of the color and the largest count.
grpred = GROUP countryflags BY redcolour;
redcount = FOREACH grpred GENERATE 'red' as name,SUM(countryflags.redcolour) as red_sum;
grpgreen = GROUP countryflags BY greencolour;
greencount = FOREACH grpgreen GENERATE 'green' as name,SUM(countryflags.greencolour) green_sum;
grpblue = GROUP countryflags BY bluecolour;
bluecount = FOREACH grpblue GENERATE 'blue' as name,SUM(countryflags.bluecolour) as blue_sum;
A = UNION redcount,greencount,bluecount;
B = ORDER BY $1 DESC; -- Note: $1 because,we are creating a new column name else $0;
C = LIMIT B 1;
DUMP C;
Need help with discarding nulls in the result of full outer join in pig Latin. Below are two data sets :
A:
(BOS,2)
(BUR,81)
(LAS,8)
B:
(BUR,56)
(EWR,2)
(LAS,88)
After full outer join :
C :
(BOS,2,,)
(BUR,81,BUR,56)
(,,EWR,2)
(LAS,8,LAS,88)
I need to get the output in below format :
(BOS,2)
(BUR,137)
(EWR,2)
(LAS,96)
Tried different combinations of group by , flatten , bagtotuple ... but was not able to figure out the solution . Many thanks for help.
airline = load '/demo/data/airline/airline.csv' using PigStorage(',') as (Origin: chararray, Dest: chararray);
traffic_in = GROUP airline by Origin;
traffic_in_count= FOREACH traffic_in generate group as Origin , COUNT(airline) as count ;
traffic_out = GROUP airline by Dest;
traffic_out_count = FOREACH traffic_out generate group as Dest ,COUNT (airline) as count;
traffic_top = JOIN traffic_in_count by Origin FULL OUTER , traffic_out_count by Dest ;
EDIT
Instead of using OUTER JOIN use UNION and then SUM the 2nd column values.
A = LOAD 'test1.txt' using PigStorage(',') as (A1:chararray, A2:int);
B = LOAD 'test2.txt' using PigStorage(',') as (B1:chararray, B2:int);
C = UNION A,B;
D = GROUP C BY $0;
E = FOREACH D GENERATE group,SUM(C.$1);
DUMP E;
Output
After performing multi-level filtering inside Pig, I get the below results -
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
The column names are movieid,country,year and genre respectively. I need to aggregate these results and produce something like this -
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
Either that or something like this -
(2343433,France,Germany,Netherlands,Argentina,2015,Sci-Fi,Drama,Family)
Below is my code to get the above results -
A = LOAD '/user/a1.csv' USING PigStorage('|') as (movie_id,movie_name,prod_year);
B = LOAD '/user/a2.csv' USING PigStorage('|') as (g_movieid,genres);
C = LOAD '/user/a3.csv' USING PigStorage('|') as (c_movieid,country_released);
D = JOIN A by movie_id, B by g_movieid;
E = JOIN D by g_movieid, C by c_movieid;
F = FOREACH E GENERATE movie_id,country,year,genre;
Any idea on how to achieve this using Pig?
try this,
Dump F;
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
G = GROUP F BY (movie_id, country, year);
H = foreach G generate FLATTEN(group) as (movie_id, country, year), $1.$3 AS (genre:{T:(value:chararray)});
I = foreach H generate movie_id, country, year, FLATTEN(BagToTuple(genre.value));
Dump I;
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)
I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))
I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.