Pig grouping users while maintaining other fields

Pig grouping users while maintaining other fields - hadoop

I guess this question is similar to this one:
Selecting fields after grouping in Pig
but here is my question for the following made up sample data:
user_name, movie_name, company, rating
Jim, Jaws, A, 4
Jim, Baseball, B, 4
Matt, Halo, A, 5
Matt, Baseball, B, 4
Matt, History of Chairs, B, 3.5
Pat, History of Chairs, B, 3
John, History of Chairs, B, 2
Frank, Battle Tanks, A, 3
Frank, History of Chairs, B, 5
How can I group together all movies a user has seen without losing the other information like company, and rating.
I want to add the cross of all ratings a user gave from movie company A and movie company B.
Jim, Jaws, Baseball, 8
Matt, Halo, Baseball, 9
Frank, Battle Tanks, History of Chairs, 8
would be the output in the format:
user, companyA, companyB, rating
I started with a load followed by
r1 = LOAD 'data.csv' USING PigStorage(',') as (user_name:chararray, movie_name:chararray, company_name:chararray, rating:int);
r2 = group r1 by user_name;
r3 = foreach r2 generate group as user_name, flatten(r1);
r4A = filter r3 by company_name == 'A';
r4B = filter r3 by company_name == 'B';
but then I have something like
(Frank,Frank,Battle Tanks,A,3)
I then plan to do a cross of r4A and r4B and sum of the ratings. But I'm not sure if the repeated user_name will increase the inefficiencies.
Is this the proper approach? Any ideas to make this better?
Any help would be appreciated!

Can you try this?
input:
Jim,Jaws,A,4
Jim,Baseball,B,4
Matt,Halo,A,5
Matt,Baseball,B,4
Matt,History of Chairs,B,3.5
Pat,History of Chairs,B,3
John,History of Chairs,B,2
Frank,Battle Tanks,A,3
Frank,History of Chairs,B,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (user_name:chararray, movie_name:chararray, company:chararray, rating:float);
B = GROUP A BY user_name;
C = FOREACH B {
filterCompanyA = FILTER A BY company=='A';
sumA = SUM(filterCompanyA.rating);
filterCompanyB = FILTER A BY company=='B';
sumB = SUM(filterCompanyB.rating);
GENERATE group AS user,
FLATTEN(REPLACE(BagToString(filterCompanyA.movie_name),'_',',')) AS companyA,
FLATTEN(REPLACE(BagToString(filterCompanyB.movie_name),'_',',')) AS companyB,
(((sumA is null)?0:sumA)+((sumB is null)?0:sumB)) AS Rating;
}
D = FOREACH C GENERATE user,companyA,companyB,Rating;
DUMP D;
Output:
(Jim,Jaws,Baseball,8.0)
(Pat,,History of Chairs,3.0)
(John,,History of Chairs,2.0)
(Matt,Halo,Baseball,History of Chairs,12.5)
(Frank,Battle Tanks,History of Chairs,8.0)
In the above output Pat and John haven't seen any movie in the CompanyA, so that output is null ie empty

Related

Using COUNTIFS to find only unique values

I currently have a table with five columns:
A = Campaign
B = Person
C = Opportunity Name
D = Total Cost of Campaign
E = Date
I'm trying to use COUNTIFS to count the number of rows that match the exact value in cell H2 to column A and has a date range, in column E, that is greater than the value in cell I2.
I have something like this so far:
=countifs($A$2:$A, $H$2, $E$2:$E, ">"&$I$2).
However, I'm having a tough time to trying to dedupe this - it should only count unique rows based on the data in column C, where duplicate names exist. Please refer to my data table as reference:
Campaign Person Opportunity Name Total Cost of Campaign Date
A Bob Airbnb 5000 3/2/2017
B Jim Sony 10000 3/2/2017
B Jane Coca-Cola 10000 3/2/2017
C Jim Sony 200 3/2/2017
B Daniel Sony 10000 3/2/2017
B April Coca-Cola 10000 3/5/2017
For example:
=countifs($A$2:$A, $H$2, $E$2:$E, ">"&$I$2)
with B in H2 and 3/1/2017 in I2 will give me a result of 4 but I'm really trying to extract a value of 2, given that there are only two unique names in Column C (Sony and Coca-Cola).
How could I do this?

You need to include column C in your formula and use COUNTUNIQUE function as #Jeeped have suggested. Here is the final formula that you can use:
=COUNTUNIQUE(IFERROR(FILTER(C:C,A:A=H2,E:E>I2)))

Use COUNTUNIQUE with QUERY
=countunique(QUERY(A:E,"Select C where A = '"&H2&"' and E > date '" & text(I2,"yyyy-mm-dd") & "'",0))

How to perform Group by then use DISTINCT on other column in pig

I have just starting learning PIG and need small help with the question below . thanks in advance !
For eg: I have input like:
Occupation Category Name
Actress Acting Marion Cotillard
Actor Acting Liam Nelson
Tennis Plyr Athletics Roger Federer
Football Plyr Athletics Neymar
Actor Acting Tom Hanks
Actress Acting Elizabeth Banks
US Senator Politics Elizabeth Warren
Football Plyr Athletics Mesut Ozil
I want to know how many types are there in single category.
For eg:- Acting has two types one is Actress and other is Actor. Hence , result will be 2.
Problem facing : Not able to DISTINCT the output from 'group by Category' using 'Occupation' column. :(

Try this:
x= load '<data>' using PigStorage('\t') as (occupation:chararray,category:chararray,name:chararray);
x_grouped= group x by category;
x_grouped_distinct= foreach x_grouped { cat= distinct $1.occupation; generate $0, cat, COUNT(cat);};
dump x_grouped_distinct;

Distinct first and then Group By Category.Assuming you have already loaded the data into relation A.
Select the 2 columns after the load.
Distinct the relation
Group By category
Count Occupation for each Category
B = FOREACH A GENERATE Occupation as Occupation,Category as Category;
C = DISTINCT B;
D = GROUP C BY $1;
E = FOREACH D GENERATE group,COUNT(C.Occupation);
DUMP E;

Pig join two Relations only with join partner

im new at programming in Pig Latin and i have a question.
Let's say i have the following two relations (A and B):
Relation A: http://i.stack.imgur.com/Aa5Rd.png
Relation B: http://i.stack.imgur.com/m467q.png
Now, the Relations should be joined, but only when in A a key (id) exists. Otherwise not. So the Result should look like:
Relation Result: i.stack.imgur.com/3elgh.png (i cannot post more than 2 links)
How i can solve that?
My approach result = JOIN A BY id, B BY id; because it creates a result relation with all ids & texts :/
Thank you very much in advance,
Stefanos

Your approach is right. I got the correct output as you mentioned but not sure why you didn't get the output. Can you cross check your pigscript with the below one?
input1:
1
4
6
input2:
1,peter
2,jay
3,dan
4,knut
5,Gnu
6,rafael
7,hans
PigScript:
A = LOAD 'input1' AS (id:int);
B = LOAD 'input2' USING PigStorage(',') AS (id:int,text:chararray);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,B::text as text;
DUMP D;
Output:
(1,peter)
(4,knut)
(6,rafael)

LINQ to entities Query group by and count from different tables

i have two tables
1) Logs
2) Jobs
structure of both are as follows
Logs :- id, Emailid, LogDate
sampledata:- 1, a#a.com, jan24 1999
2, b#a.com, jan25 1999
3, a#a.com, jan25 1999
4, c#a.com jan26,1999
5, a#a.com jan27,1999
Jobs :- jid, job_name, job_viewed_by
sampledata:- j01, painter, a#a.com
j02, teacher, a#a.com
j01, painter, b#a.com
job_viewed_by is foreign key in jobs table and is related with Emailid in Logs table.
now i want a linq to entitites query which can give me
all Emailids from the logs tables who haved logged recently along with the no of jobs viewed (count of jobs) by them.
so as per above sample data my requirement is
a#a.com last logged on 27th jan.1999 and had viewed 2 jobs so far
b#a.com last logged on 24th jan.1999 had viewed 1 jobs so far
c#a.com last logged on 26th jan.1999. no jobs viewed
i know how to write it in SQL but i need to convert it using LinqtoEntities.
i tried a query but it give me number of recent logins rather than job counts.
var q= (from p in context.Logs
from x in context.ViewedJobs.Where(v=>p.EmailId ==v.ViewedBy)
group p by p.EmailId into grp
select new{ EmailId = grp.Key,
LastDate = grp.Max(g => g.LogDate),
Count=grp.Count() }).OrderByDescending(m=>m.LogDate);

Just smiple to try:
var q = from p in context.Logs
group p by p.Emailid into g
select new
{
EmailId=g.Key,
LastDate= g.Max(x => x.LogDate),
Count=context.ViewedJobs.Count(v=>v.ViewedBy==g.Key)
};
Update Version:
var q = from p in context.Logs
group p by p.Emailid into g
join j in context.ViewedJobs
on g.Key equlas j.ViewedBy into leftGroup
select new
{
EmailId=g.Key,
LastDate= g.Max(x => x.LogDate),
Count=leftGroup.Any()?leftGroup.Count():0
};

Regroup By in PigLatin

In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'

Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig grouping users while maintaining other fields - hadoop

Related

Using COUNTIFS to find only unique values

How to perform Group by then use DISTINCT on other column in pig

Pig join two Relations only with join partner

LINQ to entities Query group by and count from different tables

Regroup By in PigLatin

Categories

Resources