Pig: Pulling individual fields out after a GROUP - hadoop

In PigLatin, I want to pull the other fields out of a record I want to select because of an aggregate, such as MAX.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the name of the oldest person at a household:
Relation A is four columns, (name, address, zipcode, age)
B = GROUP A BY (address, zipcode); # group by the address
# generate the address, the person's age, but how do I grab that person's name?
C = FOREACH B GENERATE FLATTEN(group), MAX(age), ??? Name ???;
How do I generate the name of the person with the MAX age?

The problem with your logic is there can be more then 1 people with the MAX(age). Then you have to GROUP BY (name, address, age). But to give you a quick answer I will write that gets only one of the max ages. (I am not sure its the optimum way though)
C = FOREACH B {
DA = ORDER A BY age DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.age), FLATTEN(DB.name);
}

Be careful with frail's answer which is accepted, as it would have undesirable behavior if the number in the LIMIT command is higher than 1. In particular, in that case the output would be a cross-product between all ages and names due to the last two FLATTEN calls. Then, if the value in the LIMIT is N, there would be N^2 output rows instead of intended N.
Much safer is to do the following in the GENERATE line, which would give exactly the same result as the accepted answer when 'LIMIT 1' is used:
GENERATE FLATTEN(group) AS (address, zipcode), FLATTEN(DB.(age, name)) AS (age, name);

Related

Wrong sorting while using Query function

I've been trying to do a report about the quantity of breakdonws of products in our company. The problem is that the QUERY function is operating as normal, but the sorting order is well - a bit strange.
The data I'm trying to sort are as follows (quantities are blacked out since I cannot share those informations):
Raw data
First column - name of the product, second, it's EAN code, third, breakdown rate for last year, last column - average breakdown rate. "b/d" means "brak danych" or no data.
What I want to achieve is to get the end table with values sorted by average breakdown rate.
My query is as follows:
=query(Robocze!A2:D;"select A where A is not null and NOT D contains 'b/d' order by D desc")
Final result
As You can see, we have descending order, but there are strange artifacts - like the 33.33% after 4,00% and before 3,92%.
Why is that!?
try:
=INDEX(LAMBDA(x; SORT(x; INDEX(x;; 4)*1; 0))
(QUERY(Robocze!A2:D; "where A is not null and NOT D contains 'b/d'"; 0));; 4)

Find Maximum Columns in a grouped row. [using PIG]

I have to find maximum number of posts created by person with some given set of data, where I am provided with user id, display name, age, comments count, view count, date, score and title of each post.
To get the number of maximum post, I think, we can group by user id.Now, after grouping, I need to check the id which has the most no. of columns. I don't understand how would I solve the latter part. Please help.
As What, I understand from your question. I am giving you answer Accordingly.
Let be try this code :
a = load '<path>' using PigStorage(',') as(userId,displayName,age,commentsCount,viewCount,date,score,title)
b = group a by userId;
c = foreach b generate group,COUNT(a.title);
dump c;

Pig - how to select only some values from the list (not just simple distinct)?

Let's say I have intput_file.txt (user_id, event_code, event_date):
1,a,1
1,b,2
2,a,3
2,b,4
2,b,5
2,b,6
2,c,7
2,b,8
as you can see, user_id = 2, has events like this: abbbcb
I'd like to have a result like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
So when we have few events, with the same code, I'd like to take only the last one.
Can you please share any hints?
Regards
Pawel
The main thing you are describing is what GROUP BY does.
In this case:
B = GROUP A BY user_id;
Gets your records together by user_id. Your data will now look like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
You say you only want the last one (I assume you mean the one with the greatest event_date). To do this, you can do a nested FOREACH with an ORDER BY to sort by date, and then take the first one with LIMIT. Note that this has arbitrary behavior when there are ties.
C = FOREACH B {
DA = ORDER A BY event_date DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.event_code), FLATTEN(DB.event_date);
}
Your data should now look like this:
1,b,2
2,b,8
Another option would be to use a UDF to write some custom behavior on the groups given by GROUP BY:
B = GROUP A BY user_id;
C = FOREACH B GENERATE YourUDFThatYouBuilt(group, A);
In that UDF you'd write whatever custom behavior you want (in this case return the tuple with the greatest date)
It seems like you could use the DistinctBy UDF from Apache DataFu to achieve this. This UDF, given a bag, returns the first instance found for a given field. In your case the field you care about is event_code. But we have to reverse the order, as you actually want the last instance.
One clarification though. Correct me if I'm wrong, but I think the intended output is:
1,{(a,1),(b,2)}
2,{(a,3),(b,6),(c,7),(b,8)}
That is, the (a,3) event occurs for member 2. The (a,2) event occurs for member 1.
Here's how you can do it:
-- pass in 1 because we want distinct by event code (position 1)
define DistinctBy datafu.pig.bags.DistinctBy('1');
FOREACH (GROUP A BY user_id) {
-- reverse so we can take the last event code occurrence
A_reversed = ORDER A BY event_date DESC;
-- use DistinctBy to get the first tuple having an occurrence of a field value
A_distinct_by_code = DistinctBy(A_reversed);
-- put back in order again
A_ordered = ORDER A_distinct_by_code BY event_date ASC;
GENERATE group as user_id, A_ordered.(event_code,event_date);
}

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Efficient algorithm that takes a Twitter user and finds top users by order of how many of his followers they follow

The title is very wordy. So I'll explain with an example.
We have a database of 10,000 twitter users with each following up to 2000 users. The algorithm takes as input one random never before seen user (including the people that follow him), and returns the twitter users from the database by order of how many of his followers they follow.
i.e.
We have:
User A follows 1,2,3,4
User B follows 3,4,5,6
User C follows 4,8,9
We enter user X who has users 3,4,5 following him.
The algorithm should return:
B: 3 matches (3,4,5)
A: 2 matches (3,4)
C: 1 match (4)
Store the data as a sparse integer matrix A of size 10^5x10^5 with ones at the appropriate places. Then, given a user i, compute A[i,] * A (matrix multiplication). Then sort.
Assuming you have a table structure similar to this:
Table Users
Id (PK, uniqueidentifier, not null)
Username (nvarchar(50), not null)
Table UserFollowers
UserId (FK, uniqueidentifier, not null)
FollowerId (uniqueidentifier, not null)
You can use the following query to get the common parents of followers of the followers of the user in query
SELECT Users_Inner.Username, COUNT(Users_Inner.Id) AS [Total Common Parents]
FROM Users INNER JOIN
UserFollowers ON Users.Id = UserFollowers.FollowerId INNER JOIN
UserFollowers AS UserFollowers_Inner ON UserFollowers.FollowerId = UserFollowers_Inner.UserId INNER JOIN
Users AS Users_Inner ON UserFollowers_Inner.FollowerId = Users_Computed.Id
WHERE (UserFollowers.UserId = 'BD34A1FF-FCF5-4D35-B8A3-EFFB1587A874')
GROUP BY Users_Inner.Username
ORDER BY COUNT(Users_Inner.Id) DESC
would something like this work?
for f in followers(x)
for ff in followers(f)
count[ff]++ // assume it is initially 0
sort the ff-s by their counts
Unlike the matrix solution, the complexity of this is proportional to the number of people involved rather than the number of users on twitter.

Resources