Hadoop Pig Max Command - hadoop

I have one file that contains data of all the countries from all over the world.
I want to find out the country which has maximum airport.
I have written below code:
A = load 'airports.dat' USING PigStorage (',') AS(AirportID:int,Name:chararray,City:chararray,Country:chararray,IATA:chararray,IATAothers:chararray,Latitude:float,Longitude:float,Altitude:float,Timezone:float,DST:chararray,Zone:chararray);
B= GROUP A BY Country;
C= FOREACH B GENERATE A.Country, COUNT(A) AS Count;
but after this I am not getting how to find the maximum.
Can anybody please help.

You have created the number of airports per country. What you need to do now, is take the row with the highest number:
D = order C by $1 DESC;
E = limit D 1;
dump E;

Related

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int).
I group the relation based on featureid and want to calculate the max feature value of each group, heres the code:
grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;
Its giving me Invalid scalar projection: grpd Exception. I read on different forums that MAX takes in a "bag" format for such functions, but when I take the dump of grpd, it shows me a bag format. Here's a small part of the output from the dump:
(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})
Whats the issue ?
The issue was with column addressing, heres the correct working code:
grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

To find maximum occurance names in a list of tuple in PIG

I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.

Regroup By in PigLatin

In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources