(hadoop.pig) multiple counts in single table - hadoop

So, I have a data that has two values, string, and a number.
data(string:chararray, number:int)
and I am counting in 5 different rules,
1: int being 0~1.
2: int being 1~2.
~
5: int being 4~5.
So I was able to count them individually,
zero_to_one = filter avg_user by average_stars >= 0 and average_stars <= 1;
A = GROUP zero_to_one ALL;
zto_count = FOREACH A GENERATE COUNT(zero_to_one);
one_to_two = filter avg_user by average_stars > 1 and average_stars <= 2;
B = GROUP one_to_two ALL;
ott_count = FOREACH B GENERATE COUNT(one_to_two);
two_to_three = filter avg_user by average_stars > 2 and average_stars <= 3;
C = GROUP two_to_three ALL;
ttt_count = FOREACH C GENERATE COUNT( two_to_three);
three_to_four = filter avg_user by average_stars > 3 and average_stars <= 4;
D = GROUP three_to_four ALL;
ttf_count = FOREACH D GENERATE COUNT( three_to_four);
four_to_five = filter avg_user by average_stars > 4 and average_stars <= 5;
E = GROUP four_to_five ALL;
ftf_count = FOREACH E GENERATE COUNT( four_to_five);
So, this can be done, but
this only results in 5 individual table.
I want to see if there is any way (is ok to be fancy, I love fancy stuff)
T can make the result in single table.
Which means if
zto_count = 1
ott_count = 3
. = 2
. = 3
. = 5
then the table will be {1,3,2,3,5}
It just is easy to parse data, and organize them that way.
Is there any ways?

Using this as input:
foo 2
foo 3
foo 2
foo 3
foo 5
foo 4
foo 0
foo 4
foo 4
foo 5
foo 1
foo 5
(0 and 1 each appear once, 2 and 3 each appear twice, 4 and 5 each appear thrice)
This script:
A = LOAD 'myData' USING PigStorage(' ') AS (name: chararray, number: int);
B = FOREACH (GROUP A BY number) GENERATE group AS number, COUNT(A) AS count ;
C = FOREACH (GROUP B ALL) {
zto = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) ;
ott = FOREACH B GENERATE (number==1?count:0) + (number==2?count:0) ;
ttt = FOREACH B GENERATE (number==2?count:0) + (number==3?count:0) ;
ttf = FOREACH B GENERATE (number==3?count:0) + (number==4?count:0) ;
ftf = FOREACH B GENERATE (number==4?count:0) + (number==5?count:0) ;
GENERATE SUM(zto) AS zto,
SUM(ott) AS ott,
SUM(ttt) AS ttt,
SUM(ttf) AS ttf,
SUM(ftf) AS ftf ;
}
Produces this output:
C: {zto: long,ott: long,ttt: long,ttf: long,ftf: long}
(2,3,4,5,6)
The number of FOREACHs in C shouldn't really matter because C is going to only have 5 elements at most, but if it is then then they can be put together like this:
C = FOREACH (GROUP B ALL) {
total = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) AS zto,
(number==1?count:0) + (number==2?count:0) AS ott,
(number==2?count:0) + (number==3?count:0) AS ttt,
(number==3?count:0) + (number==4?count:0) AS ttf,
(number==4?count:0) + (number==5?count:0) AS ftf ;
GENERATE SUM(total.zto) AS zto,
SUM(total.ott) AS ott,
SUM(total.ttt) AS ttt,
SUM(total.ttf) AS ttf,
SUM(total.ftf) AS ftf ;
}

Related

PIG - Get Highest & Lowest Medal Winning Nations , GROUPed by Year

Pretty new to Pig , I have a dataset which consists of Olympics data
for 4-5 years. I am trying to generate highest and lowest medal
winning countries split by every year. Hers's a sample with header.
ATHLETE,COUNTRY,YEAR, SPORT,GOLD,SILVER,BRONZE,TOTAL
Yang Yilin,China,2008,Gymnastics,1,0,2,3
Leisel Jones,Australia,2000,Swimming,0,2,0,2
Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2
Chen Ruolin,China,2008,Diving,2,0,0,2
Katie Ledecky,United States,2012,Swimming,1,0,0,1
Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1
Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1
Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1
Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1
Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
I tried my options as per my knowledge to get this , but with little
sucess.
This is what i have now. Any help on solving this will be
appreciated !
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E {
E1 = ORDER D BY TOT DESC;
GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1)));
};
G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
MyOutput : ( Considering there are many nations with same TOTAL Medals
, I expect more than one country may share one RANK )
(2000,Cuba,65,1)
(2000,Iran,4,1)
(2000,Chile,17,1)
(2000,China,79,1)
(2000,India,7,1)
(2000,Italy,65,1)
(2000,Japan,42,1)
(2000,Kenya,7,1)
(2000,Qatar,1,1)
(2000,Spain,42,1)
(2000,Brazil,48,1)
Expected Ouput : 1
YEAR COUNTRY MAX(TOTAL)
2001 India 50
2003 UK 90
2006 Japan 56
&
Expected Ouput : 2
YEAR COUNTRY MIN(TOTAL)
2001 India 5
2003 UK 10
2006 Japan 6
********* Updated Query ( Working Well as expected ) ****
Here's the updated query which gave me my desired result.
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
G = GROUP F BY YEAR;
H = FOREACH G {
G1 = ORDER F BY MTOT DESC;
GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1)));
};
J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3;
**Ouput : **
YEAR COUNTRY MAX(TOTAL).RANKING
(2000,United States,242,1)
(2000,Russia,187,2)
(2000,Australia,182,3)
(2002,United States,84,1)
(2002,Canada,74,2)
(2002,Germany,61,3)
(2004,United States,265,1)
(2004,Russia,190,2)
(2004,Australia,156,3)
If you would like to get the MAX and MIN total medals by country by year,just use MAX and MIN.
B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;
C = GROUP B BY (YEAR,COUNTRY);
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;
E = GROUP D BY (YEAR,COUNTRY);
F = FOREACH E GENERATE group as (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;

positional sum of 2 numbers

How to sum 2 numbers digit by digit with pseudo code?
Note: You don't know the length of the numbers - if it has tens, hundreds, thousands...
Units should be add to units, tens to tens, hundreds to hundreds.....
If there is a value >= 10 in adding the units you need to put the value of that ten with "the tens"....
I tried
Start
Do
Add digit(x) in A to Sum(x)
Add digit(x) in B to Sum(x)
If Sum(x) > 9, then (?????)
digit(x) = digit(x+1)
while digit(x) in A and digit(x) in B is > 0
How to show the result?
I am lost with that.....
Please help!
Try this,
n = minDigit(a, b) where a and b are the numbers.
let sum be a number.
m = maxDigit(a,b)
allocate maxDigit(a,b) + 1 memory for sum
carry = 0;
for (i = 1 to n)
temp = a[i] + b[i] + carry
// reset carry
carry = 0
if (temp > 10)
carry = 1
temp = temp - 10;
sum[i] = temp
// one last step to get the leftover carry
if (digits(a) == digits(b)
sum[n + 1] = carry
return
if (digits(a) > digits(b)
toCopy = a
else
toCopy = b
for (i = n to m)
temp = toCopy[i] + carry
// reset carry
carry = 0
if (temp > 10)
carry = 1
temp = temp - 10;
sum[i] = temp
Let me know if it helps
A and B are the integers you want to sum.
Note that the while loop ends when all the three integers are equal to zero.
carry = 0
sum = 0
d = 1
while (A > 0 or B > 0 or carry > 0)
tmp = carry + A mod 10 + B mod 10
sum = sum + (tmp mod 10) * d
carry = tmp / 10
d = d * 10
A = A / 10
B = B / 10

Conditional Filter in GROUP BY in Pig

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Piglatin limit and flatten produces wrong results

B = GROUP A BY state;
C = FOREACH B {
DA = ORDER A BY population DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.name), FLATTEN(DB.population);
}
The problem is that I get the name of the city 5 times instead of 1. I get something like:
(ALASKA,M,27257)
(ALASKA,M,23696)
(ALASKA,M,19949)
(ALASKA,M,19926)
(ALASKA,M,19833)
(ALASKA,H,27257)
(ALASKA,H,23696)
(ALASKA,H,19949)
(ALASKA,H,19926)
(ALASKA,H,19833)
And the output I need is:
(ALASKA,M,27257)
(ALASKA,H,23696)
2 flattens: FLATTEN(DB.name), FLATTEN(DB.population); cause a Cartezian product between 2 bags, replace it with one
B = GROUP A BY state;
C = FOREACH B {
DA = ORDER A BY population DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.(name, population));
}
Or as the bags created by the GROUP BY carry all of the original tuples with all of the columns you can do this:
B = GROUP A BY state;
C = FOREACH B {
DA = ORDER A BY population DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(DB);
}

Algorithm to evenly distribute items into 3 columns

I'm looking for an algorithm that will evenly distribute 1 to many items into three columns. No column can have more than one more item than any other column. I typed up an example of what I'm looking for below. Adding up Col1,Col2, and Col3 should equal ItemCount.
Edit: Also, the items are alpha-numeric and must be ordered within the column. The last item in the column has to be less than the first item in the next column.
Items Col1,Col2,Col3
A A
AB A,B
ABC A,B,C
ABCD AB,C,D
ABCDE AB,CD,E
ABCDEF AB,CD,EF
ABCDEFG ABC,DE,FG
ABCDEFGH ABC,DEF,GH
ABCDEFGHI ABC,DEF,GHI
ABCDEFHGIJ ABCD,EFG,HIJ
ABCDEFHGIJK ABCD,EFGH,IJK
Here you go, in Python:
NumCols = 3
DATA = "ABCDEFGHIJK"
for ItemCount in range(1, 12):
subdata = DATA[:ItemCount]
Col1Count = (ItemCount + NumCols - 1) / NumCols
Col2Count = (ItemCount + NumCols - 2) / NumCols
Col3Count = (ItemCount + NumCols - 3) / NumCols
Col1 = subdata[:Col1Count]
Col2 = subdata[Col1Count:Col1Count+Col2Count]
Col3 = subdata[Col1Count+Col2Count:]
print "%2d %5s %5s %5s" % (ItemCount, Col1, Col2, Col3)
# Prints:
# 1 A
# 2 A B
# 3 A B C
# 4 AB C D
# 5 AB CD E
# 6 AB CD EF
# 7 ABC DE FG
# 8 ABC DEF GH
# 9 ABC DEF GHI
# 10 ABCD EFG HIJ
# 11 ABCD EFGH IJK
This answer is now obsolete because the OP decided to simply change the question after I answered it. I’m just too lazy to delete it.
function getColumnItemCount(int items, int column) {
return (int) (items / 3) + (((items % 3) >= (column + 1)) ? 1 : 0);
}
This question was the closest thing to my own that I found, so I'll post the solution I came up with. In JavaScript:
var items = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']
var columns = [[], [], []]
for (var i=0; i<items.length; i++) {
columns[Math.floor(i * columns.length / items.length)].push(items[i])
}
console.log(columns)
just to give you a hint (it's pretty easy, so figure out yourself)
divide ItemCount by 3, rounding down. This is what is at least in every column.
Now you do ItemCount % 3 (modulo), which is either 1 or 2 (because else it would be dividable by 3, right) and you distribute that.
I needed a C# version so here's what I came up with (the algorithm is from Richie's answer):
// Start with 11 values
var data = "ABCDEFGHIJK";
// Split in 3 columns
var columnCount = 3;
// Find out how many values to display in each column
var columnCounts = new int[columnCount];
for (int i = 0; i < columnCount; i++)
columnCounts[i] = (data.Count() + columnCount - (i + 1)) / columnCount;
// Allocate each value to the appropriate column
int iData = 0;
for (int i = 0; i < columnCount; i++)
for (int j = 0; j < columnCounts[i]; j++)
Console.WriteLine("{0} -> Column {1}", data[iData++], i + 1);
// PRINTS:
// A -> Column 1
// B -> Column 1
// C -> Column 1
// D -> Column 1
// E -> Column 2
// F -> Column 2
// G -> Column 2
// H -> Column 2
// I -> Column 3
// J -> Column 3
// K -> Column 3
It's quite simple
If you have N elements indexed from 0 to N-1 and column indexed from 0to 2, the i-th element will go in column i mod 3 (where mod is the modulo operator, % in C,C++ and some other languages)
Do you just want the count of items in each column? If you have n items, then
the counts will be:
round(n/3), round(n/3), n-2*round(n/3)
where "round" round to the nearest integer (e.g. round(x)=(int)(x+0.5))
If you want to actually put the items there, try something like this Python-style pseudocode:
def columnize(items):
i=0
answer=[ [], [], [] ]
for it in items:
answer[i%3] += it
i += 1
return answer
Here's a PHP version I hacked together for all the PHP hacks out there like me (yup, guilt by association!)
function column_item_count($items, $column, $maxcolumns) {
return round($items / $maxcolumns) + (($items % $maxcolumns) >= $column ? 1 : 0);
}
And you can call it like this...
$cnt = sizeof($an_array_of_data);
$col1_cnt = column_item_count($cnt,1,3);
$col2_cnt = column_item_count($cnt,2,3);
$col3_cnt = column_item_count($cnt,3,3);
Credit for this should go to #Bombe who provided it in Java (?) above.
NB: This function expects you to pass in an ordinal column number, i.e. first col = 1, second col = 2, etc...

Resources