Piglatin limit and flatten produces wrong results - limit

B = GROUP A BY state;
C = FOREACH B {
DA = ORDER A BY population DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.name), FLATTEN(DB.population);
}
The problem is that I get the name of the city 5 times instead of 1. I get something like:
(ALASKA,M,27257)
(ALASKA,M,23696)
(ALASKA,M,19949)
(ALASKA,M,19926)
(ALASKA,M,19833)
(ALASKA,H,27257)
(ALASKA,H,23696)
(ALASKA,H,19949)
(ALASKA,H,19926)
(ALASKA,H,19833)
And the output I need is:
(ALASKA,M,27257)
(ALASKA,H,23696)

2 flattens: FLATTEN(DB.name), FLATTEN(DB.population); cause a Cartezian product between 2 bags, replace it with one
B = GROUP A BY state;
C = FOREACH B {
DA = ORDER A BY population DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.(name, population));
}
Or as the bags created by the GROUP BY carry all of the original tuples with all of the columns you can do this:
B = GROUP A BY state;
C = FOREACH B {
DA = ORDER A BY population DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(DB);
}

Related

PIG - Get Highest & Lowest Medal Winning Nations , GROUPed by Year

Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;

Select one relations all fields & one or two from other relation on PIG JOIN, how?

A = load '$input1' using pigStorage() AS (a,b,c,d,e)
B = load '$input2' using pigStorage() AS (a,b1,c1,d1,e1)
C = JOIN A by a, B by a;
D = do something;
'D' should be of format (a,b,c,d,e,b1)
How to achieve this?
D = FOREACH C GENERATE A::a .. A::e, B::b AS b1;

(hadoop.pig) multiple counts in single table

So, I have a data that has two values, string, and a number.
data(string:chararray, number:int)
and I am counting in 5 different rules,
1: int being 0~1.
2: int being 1~2.
~
5: int being 4~5.
So I was able to count them individually,
zero_to_one = filter avg_user by average_stars >= 0 and average_stars <= 1;
A = GROUP zero_to_one ALL;
zto_count = FOREACH A GENERATE COUNT(zero_to_one);
one_to_two = filter avg_user by average_stars > 1 and average_stars <= 2;
B = GROUP one_to_two ALL;
ott_count = FOREACH B GENERATE COUNT(one_to_two);
two_to_three = filter avg_user by average_stars > 2 and average_stars <= 3;
C = GROUP two_to_three ALL;
ttt_count = FOREACH C GENERATE COUNT( two_to_three);
three_to_four = filter avg_user by average_stars > 3 and average_stars <= 4;
D = GROUP three_to_four ALL;
ttf_count = FOREACH D GENERATE COUNT( three_to_four);
four_to_five = filter avg_user by average_stars > 4 and average_stars <= 5;
E = GROUP four_to_five ALL;
ftf_count = FOREACH E GENERATE COUNT( four_to_five);
So, this can be done, but
this only results in 5 individual table.
I want to see if there is any way (is ok to be fancy, I love fancy stuff)
T can make the result in single table.
Which means if
zto_count = 1
ott_count = 3
. = 2
. = 3
. = 5
then the table will be {1,3,2,3,5}
It just is easy to parse data, and organize them that way.
Is there any ways?
Using this as input:
foo 2
foo 3
foo 2
foo 3
foo 5
foo 4
foo 0
foo 4
foo 4
foo 5
foo 1
foo 5
(0 and 1 each appear once, 2 and 3 each appear twice, 4 and 5 each appear thrice)
This script:
A = LOAD 'myData' USING PigStorage(' ') AS (name: chararray, number: int);
B = FOREACH (GROUP A BY number) GENERATE group AS number, COUNT(A) AS count ;
C = FOREACH (GROUP B ALL) {
zto = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) ;
ott = FOREACH B GENERATE (number==1?count:0) + (number==2?count:0) ;
ttt = FOREACH B GENERATE (number==2?count:0) + (number==3?count:0) ;
ttf = FOREACH B GENERATE (number==3?count:0) + (number==4?count:0) ;
ftf = FOREACH B GENERATE (number==4?count:0) + (number==5?count:0) ;
GENERATE SUM(zto) AS zto,
SUM(ott) AS ott,
SUM(ttt) AS ttt,
SUM(ttf) AS ttf,
SUM(ftf) AS ftf ;
}
Produces this output:
C: {zto: long,ott: long,ttt: long,ttf: long,ftf: long}
(2,3,4,5,6)
The number of FOREACHs in C shouldn't really matter because C is going to only have 5 elements at most, but if it is then then they can be put together like this:
C = FOREACH (GROUP B ALL) {
total = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) AS zto,
(number==1?count:0) + (number==2?count:0) AS ott,
(number==2?count:0) + (number==3?count:0) AS ttt,
(number==3?count:0) + (number==4?count:0) AS ttf,
(number==4?count:0) + (number==5?count:0) AS ftf ;
GENERATE SUM(total.zto) AS zto,
SUM(total.ott) AS ott,
SUM(total.ttt) AS ttt,
SUM(total.ttf) AS ttf,
SUM(total.ftf) AS ftf ;
}

finding mean using pig or hadoop

I have a huge text file of form
data is saved in directory data/data1.txt, data2.txt and so on
merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
1234, 0124, 230
and so on..
What I want to do is for each merchant, find the average amount..
so basically in the end i want to save the output in file.
something like
merchant_id, average_amount
1234, avg_amt_1234 a
and so on.
How do I calculate the standard deviation as well?
Sorry for asking such a basic question. :(
Any help would be appreciated. :)
Apache PIG is well adapted for such tasks. See example:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate group as id, sum/count as mean, sum as sum, count as count;
};
Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.
PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.
UPDATE 1:
Calculating variance and standard deviation:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
dif = (amnt - avg) * (amnt - avg) ;
generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum;
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;
It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.
So what do you want? You want the running java code or the abstract map-reduce process? For the second:
The map step:
record -> (merchant_id as key, amount as value)
The reduce step:
(merchant_id, amount) -> (merchant_id, aggregate the value you want)
As in the reduce step, you will be provided with a stream of record having the same key and you can do almost everything you can including the average, variance.
you can calculate the standard deviation just in one step; using the formula
var=E(x^2)-(Ex)^2
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
sum2 = SUM(inpt.amnt**2);
count = COUNT(inpt);
generate flatten(inpt), sum/count as avg, count as count, sum2/count- (sum/count)**2 as std;
};
that's it!
I calculated all stats(min, max, mean and standard deviation) in just 1 loop. FILTER_DATA contains data-set.
GROUP_SYMBOL_YEAR = GROUP FILTER_DATA BY (SYMBOL, SUBSTRING(TIMESTAMP,0,4));
STATS_ALL = FOREACH GROUP_SYMBOL_YEAR {
MINIMUM = MIN(FILTER_DATA.CLOSE);
MAXIMUM = MAX(FILTER_DATA.CLOSE);
MEAN = AVG(FILTER_DATA.CLOSE);
CNT = COUNT(FILTER_DATA.CLOSE);
CSQ = FOREACH FILTER_DATA GENERATE CLOSE * CLOSE AS (CC:DOUBLE);
GENERATE group.$0 AS (SYMBOL:CHARARRAY), MINIMUM AS (MIN:DOUBLE), MAXIMUM AS (MAX:DOUBLE), ROUND_TO(MEAN,6) AS (MEAN:DOUBLE), ROUND_TO(SQRT(SUM(CSQ.CC) / (CNT * 1.0) - (MEAN * MEAN)),6) AS (STDDEV:DOUBLE), group.$1 AS (YEAR:INT);
};

How to write order by in linq

This code output is like this
a 1
b 12
I wont to get out put like this
b 12
a 1
Query:
var x1 = (from v in db3.VoteRecords
join v2 in db3.Partis on v.PartiID equals v2.ID
where v.ProvinceID == (int)cmbProvience.SelectedValue
&& v.DistrictID == (int)cmbDistrict.SelectedValue
group v by new { v2.PartiName } into g
select new
{
Parti = g.Key.PartiName,
Votes = (from vt in g
select g.Key.PartiName).Count()
});
dataGridView1.DataSource = x1;
You can add this at the end
{
Parti = g.Key.PartiName,
Votes = (from vt in g
select g.Key.PartiName).Count()
}).OrderByDescending(l =>l.Parti);
If you want to order by the Votes column. Do this:
{
Parti = g.Key.PartiName,
Votes = (from vt in g
select g.Key.PartiName).Count()
}).OrderByDescending(l =>l.Votes);
Or if you first want to order by Parti and then by Votes do this:
{
Parti = g.Key.PartiName,
Votes = (from vt in g
select g.Key.PartiName).Count()
}).OrderByDescending(l =>l.Parti).ThenByDescending (l =>l.Votes);
Or if you first want to order by Votes and then by Parti do this:
{
Parti = g.Key.PartiName,
Votes = (from vt in g
select g.Key.PartiName).Count()
}).OrderByDescending(l =>l.Votes ).ThenByDescending (l =>l.Parti);

Resources