Find Maximum Columns in a grouped row. [using PIG] - hadoop

I have to find maximum number of posts created by person with some given set of data, where I am provided with user id, display name, age, comments count, view count, date, score and title of each post.
To get the number of maximum post, I think, we can group by user id.Now, after grouping, I need to check the id which has the most no. of columns. I don't understand how would I solve the latter part. Please help.

As What, I understand from your question. I am giving you answer Accordingly.
Let be try this code :
a = load '<path>' using PigStorage(',') as(userId,displayName,age,commentsCount,viewCount,date,score,title)
b = group a by userId;
c = foreach b generate group,COUNT(a.title);
dump c;

Related

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Linq - group by a name, but still get the id

I'm trying to find duplicates in linq by a particular column (the name column), but I also wish to return the unique id, as I wish to bind to the ID to display additional information about the row.
I've dug around on stackoverflow, but can only find ways of finding duplicates in the fashion off:
By the whole object
By a particular property
Getting the number of duplicates
The closest thing I could find was by specifying "Key" in my group by, but I'm ensure if that is working.
Ideally I'm hoping to output something that has the ID, Number of Duplicates.
Thanks
Assume you have people collection:
from p in people
group p by p.Name into g
select new {
Name = g.Key,
NumberOfDuplicates = g.Count(),
IDs = g.Select(x => x.ID)
}

SOQL - single row per each group

I have the following SOQL query to display List of ABCs in my Page block table.
Public List<ABC__c> getABC(){
List<ABC__c> ListABC = [Select WB1__c, WB2__c, WB3__c, Number, tentative__c, Actual__c, PrepTime__c, Forecast__c from ABC__c ORDER BY WB3__c];
return ListABC;
}
As you can see in the above image, WB3 has number of records for A, B and C. But I want to display only 1 record for each WB3 group based on Actual__c. Only latest Actual__c must be displayed for each WB3 Group.
i.e., Ideally I want to display only 3 rows(one each for A,B,C) in this example.
For this, I have used GROUPBY and displayed the result using AggregateResults. Here is the result.
I got the Latest Actual Date for each WB3 as shown above. But the Tentative date is not corresponding to it. The Tentative Date is also the MAX in the list.
Here is the code I used
public List<SiteMonitoringOverview> getSPM(){
AggregateResult[] AgR = [Select WB_3__c, MAX(Tentaive_Date__c) dtTentativeDate , MAX(Actual_Date__c) LatestCDate FROM Site_progress_Monitoring__c GROUP BY WBS_3__c];
if(AgR.size()>0){
for(AggregateResult SalesList : AgR){
CustSumList.add(new SiteMonitoringOverview(String.ValueOf(SalesList.ge​t('WB_3__c')), String.valueOf(SalesList.get('dtTentativeDate')), String.valueOF(SalesList.get('LatestCDate')) ));
}
}
return CustSumList;
}
I am forced to use MAX() for tentative date. I want the corresponding Tentative date of the MAX Actual Date. Not the Max Tentative Date.
For group A, the Tentative Date of Max Actual Date is 12/09/2012. But it is displaying the MAX tentative date: 27/02/2013. It should display 12/09/2012. This is because I am using MAX(Tentative_Date__c) in my code. Every column in the SOQL query must be either GROUPED or AGGREGATED. That's weird.
How do I get the required 3 rows in this example?
Any suggestions? Any different approach (looping within in groups)? how?
Just ran into this issue myself. The solution I came up with only works if you want the oldest or newest record from each grouping. Unfortunately it probably won't work in your case. I'll still leave this here incase it does happen to help someone searching for a solution to this issue.
AggregateResult[] groupedResults = [Select Max(Id), WBS_3__c FROM Site_progress_Monitoring__c GROUP BY WBS_3__c];
Calling MAX or MIN on the Id will let you get 1 record per group condition. You can then query other information. I my case I just need 1 record from each group and didn't really care which one it was.

Pig: Pulling individual fields out after a GROUP

In PigLatin, I want to pull the other fields out of a record I want to select because of an aggregate, such as MAX.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the name of the oldest person at a household:
Relation A is four columns, (name, address, zipcode, age)
B = GROUP A BY (address, zipcode); # group by the address
# generate the address, the person's age, but how do I grab that person's name?
C = FOREACH B GENERATE FLATTEN(group), MAX(age), ??? Name ???;
How do I generate the name of the person with the MAX age?
The problem with your logic is there can be more then 1 people with the MAX(age). Then you have to GROUP BY (name, address, age). But to give you a quick answer I will write that gets only one of the max ages. (I am not sure its the optimum way though)
C = FOREACH B {
DA = ORDER A BY age DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.age), FLATTEN(DB.name);
}
Be careful with frail's answer which is accepted, as it would have undesirable behavior if the number in the LIMIT command is higher than 1. In particular, in that case the output would be a cross-product between all ages and names due to the last two FLATTEN calls. Then, if the value in the LIMIT is N, there would be N^2 output rows instead of intended N.
Much safer is to do the following in the GENERATE line, which would give exactly the same result as the accepted answer when 'LIMIT 1' is used:
GENERATE FLATTEN(group) AS (address, zipcode), FLATTEN(DB.(age, name)) AS (age, name);

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources