select count distinct using pig latin - hadoop

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;

You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};

You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.

If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.

Related

compare variables in Pig

I have three variables namely redcount, greencount and bluecount (already calculated in same pig program)
I want to compare their value and display largest of three with name.
existing code as below
countryflags = LOAD '/home/rahul/countryprojectdata/Country.txt' USING PigStorage(',') AS (country:chararray,landmass:int,zone:int,area_1ksqmtr:int,popoulation_million:int,language:int,religion:int,n_vbars:int,n_stripes:int,n_colors:int,redcolour:int,greencolour:int,bluecolour:int,goldcolour:int,whitecolour:int,blackcolour:int,orangecolour:int,mainhue:chararray,n_circles:int,n_upcrosses:int,n_digonalcrosses:int,n_quarteredsections:int,n_sunstars:int,crescent:int,triangle:int,icon:int,animate:int,text:int,topleftcolour:chararray,bottomrightcolour:chararray);
grpred = GROUP countryflags BY redcolour;
redcount = FOREACH grpred GENERATE SUM(countryflags.redcolour);
grpgreen = GROUP countryflags BY greencolour;
greencount = FOREACH grpgreen GENERATE SUM(countryflags.greencolour);
grpblue = GROUP countryflags BY bluecolour;
bluecount = FOREACH grpblue GENERATE SUM(countryflags.bluecolour);
Please help.
UNION the three relations, sort and get the top record.Assuming you just want the name of the color and the largest count.
grpred = GROUP countryflags BY redcolour;
redcount = FOREACH grpred GENERATE 'red' as name,SUM(countryflags.redcolour) as red_sum;
grpgreen = GROUP countryflags BY greencolour;
greencount = FOREACH grpgreen GENERATE 'green' as name,SUM(countryflags.greencolour) green_sum;
grpblue = GROUP countryflags BY bluecolour;
bluecount = FOREACH grpblue GENERATE 'blue' as name,SUM(countryflags.bluecolour) as blue_sum;
A = UNION redcount,greencount,bluecount;
B = ORDER BY $1 DESC; -- Note: $1 because,we are creating a new column name else $0;
C = LIMIT B 1;
DUMP C;

Discarding nulls after full outer join in PIG

Need help with discarding nulls in the result of full outer join in pig Latin. Below are two data sets :
A:
(BOS,2)
(BUR,81)
(LAS,8)
B:
(BUR,56)
(EWR,2)
(LAS,88)
After full outer join :
C :
(BOS,2,,)
(BUR,81,BUR,56)
(,,EWR,2)
(LAS,8,LAS,88)
I need to get the output in below format :
(BOS,2)
(BUR,137)
(EWR,2)
(LAS,96)
Tried different combinations of group by , flatten , bagtotuple ... but was not able to figure out the solution . Many thanks for help.
airline = load '/demo/data/airline/airline.csv' using PigStorage(',') as (Origin: chararray, Dest: chararray);
traffic_in = GROUP airline by Origin;
traffic_in_count= FOREACH traffic_in generate group as Origin , COUNT(airline) as count ;
traffic_out = GROUP airline by Dest;
traffic_out_count = FOREACH traffic_out generate group as Dest ,COUNT (airline) as count;
traffic_top = JOIN traffic_in_count by Origin FULL OUTER , traffic_out_count by Dest ;
EDIT
Instead of using OUTER JOIN use UNION and then SUM the 2nd column values.
A = LOAD 'test1.txt' using PigStorage(',') as (A1:chararray, A2:int);
B = LOAD 'test2.txt' using PigStorage(',') as (B1:chararray, B2:int);
C = UNION A,B;
D = GROUP C BY $0;
E = FOREACH D GENERATE group,SUM(C.$1);
DUMP E;
Output

Aggregate values in Pig Latin

After performing multi-level filtering inside Pig, I get the below results -
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
The column names are movieid,country,year and genre respectively. I need to aggregate these results and produce something like this -
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
Either that or something like this -
(2343433,France,Germany,Netherlands,Argentina,2015,Sci-Fi,Drama,Family)
Below is my code to get the above results -
A = LOAD '/user/a1.csv' USING PigStorage('|') as (movie_id,movie_name,prod_year);
B = LOAD '/user/a2.csv' USING PigStorage('|') as (g_movieid,genres);
C = LOAD '/user/a3.csv' USING PigStorage('|') as (c_movieid,country_released);
D = JOIN A by movie_id, B by g_movieid;
E = JOIN D by g_movieid, C by c_movieid;
F = FOREACH E GENERATE movie_id,country,year,genre;
Any idea on how to achieve this using Pig?
try this,
Dump F;
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
G = GROUP F BY (movie_id, country, year);
H = foreach G generate FLATTEN(group) as (movie_id, country, year), $1.$3 AS (genre:{T:(value:chararray)});
I = foreach H generate movie_id, country, year, FLATTEN(BagToTuple(genre.value));
Dump I;
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

To find maximum occurance names in a list of tuple in PIG

I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.

Resources