Perform count on distinct values of a bag in Pig - hadoop

I have a question on Pig when performing what seems like two levels of groupings. As an example, let's say I had some example input data like:
email_id:chararray from:chararray to:bag{recipients:tuple(recipient:chararray)}
e1 user1#example.com {(friend1#example.com),(friend2#example.com),(friend3#myusers.com)}
e2 user1#example.com {(friend1#example.com),(friend4#example.com)}
e3 user1#example.com {(friend5#example.com)}
e4 user2#example.com {(friend2#example.com),(friend4#example.com)}
So each line is an email from a user "from" to user(s) "to".
And I ultimately want a list of all senders and all the people they've sent emails to, including the # of emails sent for each person, sorted from highest to lowest, for example:
user1#example.com {(friend1#example.com, 2), (friend2#example.com, 1), (friend3#example.com, 1), (friend4#example.com, 1), (friend5#example.com, 1)}
user2#example.com {(friend2#example.com, 1), (friend4#example.com, 1)}
Ideas on the best way to tackle this in Pig would be appreciated!

Here is one version of the script:
inpt = load '/pig_data/pig_fun/input/from_senders.txt' as (email_id:chararray, from:chararray, to:bag{recipients:tuple(recipient:chararray)});
pivot = foreach inpt generate from, FLATTEN(to);
pivot = foreach pivot generate from, to::recipient as recipient;
dump pivot;
/*
(user1#example.com,friend1#example.com)
(user1#example.com,friend2#example.com)
(user1#example.com,friend3#myusers.com)
(user1#example.com,friend1#example.com)
(user1#example.com,friend4#example.com)
(user1#example.com,friend5#example.com)
(user2#example.com,friend2#example.com)
(user2#example.com,friend4#example.com)
*/
grp = group pivot by (from, recipient);
with_count = foreach grp generate FLATTEN(group), COUNT(pivot) as count;
dump with_count;
/*
(user1#example.com,friend1#example.com,2)
(user1#example.com,friend2#example.com,1)
(user1#example.com,friend3#myusers.com,1)
(user1#example.com,friend4#example.com,1)
(user1#example.com,friend5#example.com,1)
(user2#example.com,friend2#example.com,1)
(user2#example.com,friend4#example.com,1)
*/
to_bag = group with_count by from;
result = foreach to_bag {
order_by_count = order with_count by count desc;
generate group as from, order_by_count.(recipient, count);
};
dump result;
/*
(user1#example.com,{(friend1#example.com,2),(friend2#example.com,1),(friend3#myusers.com,1),(friend4#example.com,1),(friend5#example.com,1)})
(user2#example.com,{(friend2#example.com,1),(friend4#example.com,1)})
*/
Hope it helps.

Related

Removing duplicates using PigLatin and retaining the last element

I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key.
Input:
User1 7 LA
User1 8 NYC
User1 9 NYC
User2 3 NYC
User2 4 DC
Output:
User1 9 NYC
User2 4 DC
Here the first filed is a key. And I want the last record of that particular key to be retained in the output.
I know how to retain the first element. It is as below. But not able to retain the last element.
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
Can anybody help me on this? Thanks in advance!
#Anil : If you order by one of the fields in descending order. You will be able to get the last record. In the below code, have ordered by second field of input (field name : no in script)
Input :
User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC
Pig snippet :
user_details = LOAD 'user_details.csv' USING PigStorage(',') AS (user_name:chararray,no:long,city:chararray);
user_details_grp_user = GROUP user_details BY user_name;
required_user_details = FOREACH user_details_grp_user {
user_details_sorted_by_no = ORDER user_details BY no DESC;
top_record = LIMIT user_details_sorted_by_no 1;
GENERATE FLATTEN(top_record);
}
Output : DUMP required_user_details
(User1,9,NYC )
(User2,4,DC)
Ok.. You can use RANK Operator .
Hope the below code helps.
rec = LOAD '/user/cloudera/inputfiles/sample.txt' USING PigStorage(',') AS(user:chararray,no:int,loc:chararray);
rec_rank = rank rec;
rec_rank_each = FOREACH rec_rank GENERATE $0 as rank_key, user, no, loc;
rec_rank_grp = GROUP rec_rank_each by user;
rec_rank_max = FOREACH rec_rank_grp GENERATE group as temp_user, MAX(rec_rank_each.rank_key) as max_rank;
rec_join = JOIN rec_rank_each BY (user,rank_key) , rec_rank_min BY(temp_user,max_rank);
rec_output = FOREACH rec_join GENERATE user,no,loc;
dump rec_output;
Ensure that you run this from pig 0.11 version as rank operator introduced from pig 0.11

Pig - MAX is not working after grouping

I am working with Pig 0.12.1 and Map-R. I am trying to find max of a field after grouping the relation on some other field. Refer the following pig script and structure of relation in comments-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
I am getting the following error
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
Script Explained-
I am flattening group of SomeRelation into c1, c2 and then regrouping
on c1 to generate max of c2 with each c1 group.
Please suggest.
I'm not sure if you can use the group keyword under the flatten. Also, have you considered tokenizing the group before flattening it. See this for example:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
This is actually for word count, You can probably build on this to find max value based on what you want to do.
Well it looks like the problem is that Pig doesn't allow MAX(or for that matter aggregate functions like SUM etc) on biginteger. Had to use long as a datatype for this to work. Refer the following-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
Strangely, there's no documentation highlighting this almost like datatypes biginteger and bigdecimal.
We now know the problem was the unability of MAX to handle biginteger.
You should be able to group and get the max like this, and compare results with combination of order + limit :
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER $1 BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}

SUM function in Pig script

I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
sum1 = SUM(A.rcvd);
sum2 = SUM(A.sent);
generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;
When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:
(223,123)
(334,444)
(21,12344)
(...,...)
All I really want to do is add all those records together and show that total amount of received and sent: (?,?).
Note: I have tried changing the variable type to int, long, and chararray with no success either.
Some of the errors I am getting while trying to solve this are:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First make sure that the fields that you are summing up are of type int
Use - DESCRIBE A; to check the data type
After that, I think since you have used filter condition and then used group by on F1 -
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
So, while summing up you should use F1 instead of A -
counter = foreach grpd1 {
sum1 = SUM(F1.rcvd);
sum2 = SUM(F1.sent);
generate sum1, sum2;
};
Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A'
I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps.
PS - I am also a student and new to PIG.
A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C1 = foreach grpd1 generate SUM(F1.rcvd);
dump C1;
C2 = foreach grpd1 generate SUM(F1.sent);
dump C2;
NOTE: More info here.
Hope I helped a little!
Please try the following
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;

To find maximum occurance names in a list of tuple in PIG

I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.

How to handle spill memory in pig

My code like like this:
pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);
pymt_grp = GROUP pymt BY key
results = FOREACH pymt_grp {
/*
* some kind of logic, filter, count, distinct, sum, etc.
*/
}
But now I find many logs like that:
org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 207012796 bytes from 1 objects. init = 5439488(5312K) used = 424200488(414258K) committed = 559284224(546176K) max = 559284224(546176K)
Actually I find the cause, the majority reason is that there is a "hot" key, some thing like key=0 as ip address, but I don't want to filter this key. is there any solution? I have implemented algebraic and accumulator interface in my UDF.
I had similar issues with heavily skewed data or DISTINCT nested in FOREACH (as PIG will do an in memory distinct). The solution was to take the DISTINCT out of the FOREACH as an example see my answer to How to optimize a group by statement in PIG latin?
If you do not want to do DISTINCT before your SUM and COUNT than I would suggest to use 2 GROUP BY. The first one groups on Key column plus another column or random number mod 100, it acts as a Salt (to spread the data of a single key into multiple Reducers). Than second GROUP BY just on Key column and calculate the final SUM of the group 1 COUNT or Sum.
Ex:
inpt = load '/data.csv' using PigStorage(',') as (Key, Value);
view = foreach inpt generate Key, Value, ((int)(RANDOM() * 100)) as Salt;
group_1 = group view by (Key, Salt);
group_1_count = foreach group_1 generate group_1.Key as Key, COUNT(view) as count;
group_2 = group group_1_count by Key;
final_count = foreach group_2 generate flatten(group) as Key, SUM(group_1_count.count) as count;

Resources