Conditional Filter in GROUP BY in Pig - hadoop

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?

I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Related

top-K values for each Group using mapreduce

I want to write a map-reduce algorithm for finding top N values ( A or D order) for each Group
Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1
c,1
c,9
c,-2
d,1
b,1
a,10
1,19
output type 1
a 1,4,7,9 ,10 , 19
b 1,,1,3,5
c -2,1,9
d 1
output type 2
a 19, 10 , 9,7,4,1
b 5,3,1,1
c 9,1,-2
d 1
output type 1 for top 3
a 1,4,7
b 1,,1,3
c -2,1
d 1
Please guide me
You need to write a mapper that will split the input line by comma and produce a pair of Text, IntWritable:
Text('a,1') -> (mapper) -> Text('a'), IntWritable(1)
In reducer you will have the group and the list of values. You need to select the top K values from the list with priority queue:
// add all values to priority queue
PriorityQueue<Integer> queue = new PriorityQueue<Integer>();
for (IntWritable value : values)
queue.add(value.get());
// get first K elements from priority queue
String topK = String.valueOf(queue.poll());
for (int i = 0; i < K - 1; ++i)
topK += ", " + queue.poll();
In Scalding (assuming data in tsv) it would be something like
Tsv(path, ('key, 'value)).groupBy('key)(_.sortWithTake('value -> 'value, N))
.write(Tsv(outputPath))

Sorting a table by nested value in Lua [duplicate]

This question already has answers here:
Associatively sorting a table by value in Lua
(7 answers)
Closed 8 years ago.
I have a program which aggregates for every user the total number of downloads performed with an aggregate of the total downloaded data in kb.
local table = {}
table[userID] = {5, 23498502}
My aim is that the output of the printTable function will produce the entire list of users ordered in descending order by the amount of kb downloaded v[2]
local aUsers = {}
...
function topUsers(key, nDownloads, totalSize)
if aUsers[key] then
aUsers[key][1] = aUsers[key][1] + nDownloads
aUsers[key][2] = aUsers[key][2] + totalSize
else
aUsers[key] = {nDownloads, totalSize}
end
end
function printTable(t)
local str = ""
-- How to sort 't' so that it prints in v[2] descending order?
for k,v in pairs(t) do
str = str .. k .. ", " .. v[1] .. ", " .. v[2] .. "\n"
end
return str
end
...
Any ideas how could I do that?
You can get the keys into a separate table and then sort that table using the criteria you need:
local t = {
a = {1,2},
b = {2,3},
c = {4,1},
d = {9,9},
}
local keys = {}
for k in pairs(t) do table.insert(keys, k) end
table.sort(keys, function(a, b) return t[a][2] > t[b][2] end)
for _, k in ipairs(keys) do print(k, t[k][1], t[k][2]) end
will print:
d 9 9
b 2 3
a 1 2
c 4 1

(hadoop.pig) multiple counts in single table

So, I have a data that has two values, string, and a number.
data(string:chararray, number:int)
and I am counting in 5 different rules,
1: int being 0~1.
2: int being 1~2.
~
5: int being 4~5.
So I was able to count them individually,
zero_to_one = filter avg_user by average_stars >= 0 and average_stars <= 1;
A = GROUP zero_to_one ALL;
zto_count = FOREACH A GENERATE COUNT(zero_to_one);
one_to_two = filter avg_user by average_stars > 1 and average_stars <= 2;
B = GROUP one_to_two ALL;
ott_count = FOREACH B GENERATE COUNT(one_to_two);
two_to_three = filter avg_user by average_stars > 2 and average_stars <= 3;
C = GROUP two_to_three ALL;
ttt_count = FOREACH C GENERATE COUNT( two_to_three);
three_to_four = filter avg_user by average_stars > 3 and average_stars <= 4;
D = GROUP three_to_four ALL;
ttf_count = FOREACH D GENERATE COUNT( three_to_four);
four_to_five = filter avg_user by average_stars > 4 and average_stars <= 5;
E = GROUP four_to_five ALL;
ftf_count = FOREACH E GENERATE COUNT( four_to_five);
So, this can be done, but
this only results in 5 individual table.
I want to see if there is any way (is ok to be fancy, I love fancy stuff)
T can make the result in single table.
Which means if
zto_count = 1
ott_count = 3
. = 2
. = 3
. = 5
then the table will be {1,3,2,3,5}
It just is easy to parse data, and organize them that way.
Is there any ways?
Using this as input:
foo 2
foo 3
foo 2
foo 3
foo 5
foo 4
foo 0
foo 4
foo 4
foo 5
foo 1
foo 5
(0 and 1 each appear once, 2 and 3 each appear twice, 4 and 5 each appear thrice)
This script:
A = LOAD 'myData' USING PigStorage(' ') AS (name: chararray, number: int);
B = FOREACH (GROUP A BY number) GENERATE group AS number, COUNT(A) AS count ;
C = FOREACH (GROUP B ALL) {
zto = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) ;
ott = FOREACH B GENERATE (number==1?count:0) + (number==2?count:0) ;
ttt = FOREACH B GENERATE (number==2?count:0) + (number==3?count:0) ;
ttf = FOREACH B GENERATE (number==3?count:0) + (number==4?count:0) ;
ftf = FOREACH B GENERATE (number==4?count:0) + (number==5?count:0) ;
GENERATE SUM(zto) AS zto,
SUM(ott) AS ott,
SUM(ttt) AS ttt,
SUM(ttf) AS ttf,
SUM(ftf) AS ftf ;
}
Produces this output:
C: {zto: long,ott: long,ttt: long,ttf: long,ftf: long}
(2,3,4,5,6)
The number of FOREACHs in C shouldn't really matter because C is going to only have 5 elements at most, but if it is then then they can be put together like this:
C = FOREACH (GROUP B ALL) {
total = FOREACH B GENERATE (number==0?count:0) + (number==1?count:0) AS zto,
(number==1?count:0) + (number==2?count:0) AS ott,
(number==2?count:0) + (number==3?count:0) AS ttt,
(number==3?count:0) + (number==4?count:0) AS ttf,
(number==4?count:0) + (number==5?count:0) AS ftf ;
GENERATE SUM(total.zto) AS zto,
SUM(total.ott) AS ott,
SUM(total.ttt) AS ttt,
SUM(total.ttf) AS ttf,
SUM(total.ftf) AS ftf ;
}

Vectorize matrix operation in R

I have a R x C matrix filled to the k-th row and empty below this row. What i need to do is to fill the remaining rows. In order to do this, i have a function that takes 2 entire rows as arguments, process these rows and output 2 fresh rows (these outputs will fill the empty rows of the matrix, in batches of 2). I have a fixed matrix containing all 'pairs' of rows to be processed, but my for loop is not helping performance:
# the processRows function:
processRows = function(r1, r2)
{
# just change a little bit the two rows and return it in a compact way
nr1 = r1 * 0.1
nr2 = -r2 * 0.1
matrix (c(nr1, nr2), ncol = 2)
}
# M is the matrix
# nrow(M) and k are even, so nLeft is even
M = matrix(1:48, ncol = 3)
# half to fill (can be more or less, but k is always even)
k = nrow(M)/2
# simulate empty rows to be filled
M[-(1:k), ] = 0
cat('before fill')
print(M)
# number of empty rows to fill
nLeft = nrow(M) - k
nextRow = k + 1
# each row in idxList represents a 'pair' of rows to be processed
# any pairwise combination of non-empty rows could happen
# make it reproducible
set.seed(1)
idxList = matrix (sample(1:k, k), ncol = 2, byrow = TRUE)
for ( i in 1 : (nLeft / 2))
{
row1 = M[idxList[i, 1],]
row2 = M[idxList[i, 2],]
# the two columns in 'results' will become 2 rows in M
results = processRows(row1, row2)
# fill the matrix
M[nextRow, ] = results[, 1]
nextRow = nextRow + 1
M[nextRow, ] = results[, 2]
nextRow = nextRow + 1
}
cat('after fill')
print(M)
Okay, here is your code first. We run this so that we have a copy of the "true" matrix, the one we hope to reproduce, faster.
#### Original Code (aka Gold Standard) ####
M = matrix(1:48, ncol = 3)
k = nrow(M)/2
M[-(1:k), ] = 0
nLeft = nrow(M) - k
nextRow = k + 1
idxList = matrix(1:k, ncol = 2)
for ( i in 1 : (nLeft / 2))
{
row1 = M[idxList[i, 1],]
row2 = M[idxList[i, 2],]
results = matrix(c(2*row1, 3*row2), ncol = 2)
M[nextRow, ] = results[, 1]
nextRow = nextRow + 1
M[nextRow, ] = results[, 2]
nextRow = nextRow + 1
}
Now here is the vectorized code. The basic idea is if you have 4 rows you are processing. Rather than passing them as vectors one at a time, do it at once. That is:
(1:3) * 2
(1:3) * 2
(1:3) * 2
(1:3) * 2
is the same (but slower) as:
c(1:3, 1:3, 1:3, 1:3) * 2
So first, we will use your same setup code, then create the rows to be processed as two long vectors (where all 4 original rows are just strung together as in my simple example above). Then, we take those results, and transform them into matrices with the appropriate dimensions. The last trick is to assign the results back in in just two steps. You can assign to multiple rows of a matrix at once, so we use seq() to get odd and even numbers so assign the first and second column of the results to, respectively.
#### Vectorized Code (testing) ####
M2 = matrix(1:48, ncol = 3)
k2 = nrow(M2)/2
M2[-(1:k2), ] = 0
nLeft2 = nrow(M2) - k2
nextRow2 = k2 + 1
idxList2 = matrix(1:k2, ncol = 2)
## create two long vectors of all rows to be processed
row12 <- as.vector(t(M2[idxList2[, 1],]))
row22 <- as.vector(t(M2[idxList2[, 2],]))
## get all results
results2 = matrix(c(2*row12, 3*row22), ncol = 2)
## add results back
M2[seq(nextRow2, nextRow2 + nLeft2-1, by = 2), ] <- matrix(results2[,1], nLeft2/2, byrow=TRUE)
M2[seq(nextRow2+1, nextRow2 + nLeft2, by = 2), ] <- matrix(results2[,2], nLeft2/2, byrow=TRUE)
## check that vectorized code matches your examples
all.equal(M, M2)
Which on my machine gives:
> all.equal(M, M2)
[1] TRUE

Algorithm to evenly distribute items into 3 columns

I'm looking for an algorithm that will evenly distribute 1 to many items into three columns. No column can have more than one more item than any other column. I typed up an example of what I'm looking for below. Adding up Col1,Col2, and Col3 should equal ItemCount.
Edit: Also, the items are alpha-numeric and must be ordered within the column. The last item in the column has to be less than the first item in the next column.
Items Col1,Col2,Col3
A A
AB A,B
ABC A,B,C
ABCD AB,C,D
ABCDE AB,CD,E
ABCDEF AB,CD,EF
ABCDEFG ABC,DE,FG
ABCDEFGH ABC,DEF,GH
ABCDEFGHI ABC,DEF,GHI
ABCDEFHGIJ ABCD,EFG,HIJ
ABCDEFHGIJK ABCD,EFGH,IJK
Here you go, in Python:
NumCols = 3
DATA = "ABCDEFGHIJK"
for ItemCount in range(1, 12):
subdata = DATA[:ItemCount]
Col1Count = (ItemCount + NumCols - 1) / NumCols
Col2Count = (ItemCount + NumCols - 2) / NumCols
Col3Count = (ItemCount + NumCols - 3) / NumCols
Col1 = subdata[:Col1Count]
Col2 = subdata[Col1Count:Col1Count+Col2Count]
Col3 = subdata[Col1Count+Col2Count:]
print "%2d %5s %5s %5s" % (ItemCount, Col1, Col2, Col3)
# Prints:
# 1 A
# 2 A B
# 3 A B C
# 4 AB C D
# 5 AB CD E
# 6 AB CD EF
# 7 ABC DE FG
# 8 ABC DEF GH
# 9 ABC DEF GHI
# 10 ABCD EFG HIJ
# 11 ABCD EFGH IJK
This answer is now obsolete because the OP decided to simply change the question after I answered it. I’m just too lazy to delete it.
function getColumnItemCount(int items, int column) {
return (int) (items / 3) + (((items % 3) >= (column + 1)) ? 1 : 0);
}
This question was the closest thing to my own that I found, so I'll post the solution I came up with. In JavaScript:
var items = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']
var columns = [[], [], []]
for (var i=0; i<items.length; i++) {
columns[Math.floor(i * columns.length / items.length)].push(items[i])
}
console.log(columns)
just to give you a hint (it's pretty easy, so figure out yourself)
divide ItemCount by 3, rounding down. This is what is at least in every column.
Now you do ItemCount % 3 (modulo), which is either 1 or 2 (because else it would be dividable by 3, right) and you distribute that.
I needed a C# version so here's what I came up with (the algorithm is from Richie's answer):
// Start with 11 values
var data = "ABCDEFGHIJK";
// Split in 3 columns
var columnCount = 3;
// Find out how many values to display in each column
var columnCounts = new int[columnCount];
for (int i = 0; i < columnCount; i++)
columnCounts[i] = (data.Count() + columnCount - (i + 1)) / columnCount;
// Allocate each value to the appropriate column
int iData = 0;
for (int i = 0; i < columnCount; i++)
for (int j = 0; j < columnCounts[i]; j++)
Console.WriteLine("{0} -> Column {1}", data[iData++], i + 1);
// PRINTS:
// A -> Column 1
// B -> Column 1
// C -> Column 1
// D -> Column 1
// E -> Column 2
// F -> Column 2
// G -> Column 2
// H -> Column 2
// I -> Column 3
// J -> Column 3
// K -> Column 3
It's quite simple
If you have N elements indexed from 0 to N-1 and column indexed from 0to 2, the i-th element will go in column i mod 3 (where mod is the modulo operator, % in C,C++ and some other languages)
Do you just want the count of items in each column? If you have n items, then
the counts will be:
round(n/3), round(n/3), n-2*round(n/3)
where "round" round to the nearest integer (e.g. round(x)=(int)(x+0.5))
If you want to actually put the items there, try something like this Python-style pseudocode:
def columnize(items):
i=0
answer=[ [], [], [] ]
for it in items:
answer[i%3] += it
i += 1
return answer
Here's a PHP version I hacked together for all the PHP hacks out there like me (yup, guilt by association!)
function column_item_count($items, $column, $maxcolumns) {
return round($items / $maxcolumns) + (($items % $maxcolumns) >= $column ? 1 : 0);
}
And you can call it like this...
$cnt = sizeof($an_array_of_data);
$col1_cnt = column_item_count($cnt,1,3);
$col2_cnt = column_item_count($cnt,2,3);
$col3_cnt = column_item_count($cnt,3,3);
Credit for this should go to #Bombe who provided it in Java (?) above.
NB: This function expects you to pass in an ordinal column number, i.e. first col = 1, second col = 2, etc...

Resources