top-K values for each Group using mapreduce - hadoop

I want to write a map-reduce algorithm for finding top N values ( A or D order) for each Group
Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1
c,1
c,9
c,-2
d,1
b,1
a,10
1,19
output type 1
a 1,4,7,9 ,10 , 19
b 1,,1,3,5
c -2,1,9
d 1
output type 2
a 19, 10 , 9,7,4,1
b 5,3,1,1
c 9,1,-2
d 1
output type 1 for top 3
a 1,4,7
b 1,,1,3
c -2,1
d 1
Please guide me

You need to write a mapper that will split the input line by comma and produce a pair of Text, IntWritable:
Text('a,1') -> (mapper) -> Text('a'), IntWritable(1)
In reducer you will have the group and the list of values. You need to select the top K values from the list with priority queue:
// add all values to priority queue
PriorityQueue<Integer> queue = new PriorityQueue<Integer>();
for (IntWritable value : values)
queue.add(value.get());
// get first K elements from priority queue
String topK = String.valueOf(queue.poll());
for (int i = 0; i < K - 1; ++i)
topK += ", " + queue.poll();

In Scalding (assuming data in tsv) it would be something like
Tsv(path, ('key, 'value)).groupBy('key)(_.sortWithTake('value -> 'value, N))
.write(Tsv(outputPath))

Related

MapReduce fundamentals

1) `
map(nr, txt)
words = split (txt, ' ')
for(i=0; i< |words| - 1; i++)
emit(words[i]+' '+words[i+1], 1)
reduce(key, vals)
s=0
for v : vals
s += v
if(s = 5)
emit(key,s)`
2) `map(nr, txt)
words = split (txt, ' ')
for(i=0; i < |words|; i++)
emit(txt, length(words[i]))
reduce(key, vals)
s=0
c=0
for v : vals
s += v
c += 1
r = s/c
emit(key,r)`
I am new to MapReduce and when I am not able to understand if the "if condition in the code(1) will ever satisfy"
Q1 We need to determine what this MapReduce function do in both the code?
Could you please give any input on the above question.
The first block of code emits all bigrams that appear more than 5 times. The reducer if condition satisfies if a pair of adjacent words exists at least 5 times
The second block emits every word of the input text with its length. It attempts to calculate the average length of each word, but since a reducer only sees a single key, then that calculation wouldn't do anything (seeing "foo" 1000 times still has a length of 3)

Unable to find error sement tree : minimum in subarray

I am new to data structures and algo, and unable to find error in my code for the question
Range Minimum Query
Given an array A of size N, there are two types of queries on this array.
q l r: In this query you need to print the minimum in the sub-array A[l:r].
u x y: In this query you need to update A[x]=y.
Input: First line of the test case contains two integers, N and Q, size of array A and number of queries.
Second line contains N space separated integers, elements of A.
Next Q lines contain one of the two queries.
Output:
For each type 1 query, print the minimum element in the sub-array A[l:r].
Constraints:
1 ≤ N,Q,y ≤ 10^5
1 ≤ l,r,x≤N
#include<bits/stdc++.h>
using namespace std;
long a [100001];
//global array to store input
long tree[400004];
//global array to store tree
// FUNCTION TO BUILD SEGMENT TREE //////////
void build(long i,long start,long end) //i = tree node
{
if(start==end)
{
tree[i]=a[start];
return;
}
long mid=(start+end)/2;
build(i*2,start,mid);
build(i*2+1,mid+1,end);
tree[i] = min(tree[i*2] , tree[i*2+1]);
}
// FUNCTION TO UPDATE SEGMENT TREE //////////
void update (long i,long start,long end,long idx,long val)
//idx = index to be updated
// val = new value to be given at that index
{
if(start==end)
tree[i]=a[idx]=val;
else
{
int mid=(start+end)/2;
if(start <= idx and idx <= mid)
update(i*2,start,mid,idx,val);
else
update(i*2+1,mid+1,end,idx,val);
tree[i] = min(tree[i*2] , tree[i*2+1]);
}
}
// FUNCTION FOR QUERY
long query(long i,long start,long end,long l,long r)
{
if(start>r || end<l || start > end)
return INT_MAX;
else
if(start>=l && end<=r)
return tree[i];
long mid=(start+end)/2;
long ans1 = query(i*2,start,mid,l,r);
long ans2 = query(i*2+1,mid+1,end,l,r);
return min(ans1,ans2);
}
int main()
{
long n,q;
cin>>n>>q;
for(int i=0 ; i<n ; i++)
cin>>a[i];
//for(int i=1 ; i<2*n ; i++) cout<<tree[i]<<" "; cout<<endl;
build(1,0,n-1);
//for(int i=1 ; i<2*n ; i++) cout<<tree[i]<<" "; cout<<endl;
while(q--)
{
long l,r;
char ch;
cin>>ch>>l>>r;
if(ch=='q')
cout<<query(1,0,n-1,l-1,r-1)<<endl;
else
update(1,0,n-1,l,r);
}
return 0;
}
Example :input
5 15
1 5 2 4 3
q 1 5
q 1 3
q 3 5
q 1 5
q 1 2
q 2 4
q 4 5
u 3 1
u 3 100
u 3 6
q 1 5
q 1 5
q 1 2
q 2 4
q 4 5
Expected output:
1
1
2
1
1
2
3
1
1
1
4
3
It appears that all given values assume 1 based indexing: 1 ≤ l,r,x ≤ N
You chose to build your segment tree with 0 based indexing, so all queries and updates also should use same indexing.
So this part is wrong, because you need to set A[x]=y, and because you use 0 based indexing your code actually sets A[x+1]=y
update(1,0,n-1,l,r);
To fix change it to this:
update(1,0,n-1,l-1,r);

pseudocode for this program (Matlab)

I have three sets, say:
a=[1 1 1 1];
b=[2 2 2];
c=[3 3];
Now, I have to find out all unique combinations by taking 3 elements from all sets..
So in matlab, I can do it:
>> a=[1 1 1 1];
>> b=[2 2 2];
>> c=[3 3];
>> all=[a b c];
>> nchoosek(all,3)
>> unique(nchoosek(all,3),'rows')
The o/p is:
1 1 1
1 1 2
1 1 3
1 2 2
1 2 3
1 3 3
2 2 2
2 2 3
2 3 3
How to write the logic behind the program in pseudocode?
Here's how I would do it:
Create a dictionary of item counts.
Recurse on this dictionary k times, taking care not to pick items that are not or no longer in the pool.
When recursing, skip items that are smaller (by some criterion) than the current item in order to get a unique list.
In pseudocode:
function ucombok_rec(count, k, lowest)
{
if (k == 0) return [[]];
var res = [];
for (item in count):
if (item >= lowest && count[item] > 0) {
count[item]--;
var combo = ucombok_rec(count, k - 1, item);
for (c in combo) res ~= [[item] ~ c];
count[item]++;
}
return res;
}
function ucombok(s, k)
{
if (!s) return []; // nothing to do
var count = {};
var lowest = min(s); // min. value in set
for (item in s) count[item]++; // create item counts
return ucombok_rec(count, k, lowest); // recurse
}
In this code, [] denotes a list or vector, {} a dictionary or map and the tilde ~ means list concatenation. The count decrements and increments around the recursion remove an item temporarily from the item pool.
In your example, where the pool is made up of three lists, you' d call the function like this:
c = ucombok(a ~ b ~ c, 3)

Conditional Filter in GROUP BY in Pig

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Find all possible combinations of a String representation of a number

Given a mapping:
A: 1
B: 2
C: 3
...
...
...
Z: 26
Find all possible ways a number can be represented. E.g. For an input: "121", we can represent it as:
ABA [using: 1 2 1]
LA [using: 12 1]
AU [using: 1 21]
I tried thinking about using some sort of a dynamic programming approach, but I am not sure how to proceed. I was asked this question in a technical interview.
Here is a solution I could think of, please let me know if this looks good:
A[i]: Total number of ways to represent the sub-array number[0..i-1] using the integer to alphabet mapping.
Solution [am I missing something?]:
A[0] = 1 // there is only 1 way to represent the subarray consisting of only 1 number
for(i = 1:A.size):
A[i] = A[i-1]
if(input[i-1]*10 + input[i] < 26):
A[i] += 1
end
end
print A[A.size-1]
To just get the count, the dynamic programming approach is pretty straight-forward:
A[0] = 1
for i = 1:n
A[i] = 0
if input[i-1] > 0 // avoid 0
A[i] += A[i-1];
if i > 1 && // avoid index-out-of-bounds on i = 1
10 <= (10*input[i-2] + input[i-1]) <= 26 // check that number is 10-26
A[i] += A[i-2];
If you instead want to list all representations, dynamic programming isn't particularly well-suited for this, you're better off with a simple recursive algorithm.
First off, we need to find an intuitive way to enumerate all the possibilities. My simple construction, is given below.
let us assume a simple way to represent your integer in string format.
a1 a2 a3 a4 ....an, for instance in 121 a1 -> 1 a2 -> 2, a3 -> 1
Now,
We need to find out number of possibilities of placing a + sign in between two characters. + is to mean characters concatenation here.
a1 - a2 - a3 - .... - an, - shows the places where '+' can be placed. So, number of positions is n - 1, where n is the string length.
Assume a position may or may not have a + symbol shall be represented as a bit.
So, this boils down to how many different bit strings are possible with the length of n-1, which is clearly 2^(n-1). Now in order to enumerate the possibilities go through every bit string and place right + signs in respective positions to get every representations,
For your example, 121
Four bit strings are possible 00 01 10 11
1 2 1
1 2 + 1
1 + 2 1
1 + 2 + 1
And if you see a character followed by a +, just add the next char with the current one and do it sequentially to get the representation,
x + y z a + b + c d
would be (x+y) z (a+b+c) d
Hope it helps.
And you will have to take care of edge cases where the size of some integer > 26, of course.
I think, recursive traverse through all possible combinations would do just fine:
mapping = {"1":"A", "2":"B", "3":"C", "4":"D", "5":"E", "6":"F", "7":"G",
"8":"H", "9":"I", "10":"J",
"11":"K", "12":"L", "13":"M", "14":"N", "15":"O", "16":"P",
"17":"Q", "18":"R", "19":"S", "20":"T", "21":"U", "22":"V", "23":"W",
"24":"A", "25":"Y", "26":"Z"}
def represent(A, B):
if A == B == '':
return [""]
ret = []
if A in mapping:
ret += [mapping[A] + r for r in represent(B, '')]
if len(A) > 1:
ret += represent(A[:-1], A[-1]+B)
return ret
print represent("121", "")
Assuming you only need to count the number of combinations.
Assuming 0 followed by an integer in [1,9] is not a valid concatenation, then a brute-force strategy would be:
Count(s,n)
x=0
if (s[n-1] is valid)
x=Count(s,n-1)
y=0
if (s[n-2] concat s[n-1] is valid)
y=Count(s,n-2)
return x+y
A better strategy would be to use divide-and-conquer:
Count(s,start,n)
if (len is even)
{
//split s into equal left and right part, total count is left count multiply right count
x=Count(s,start,n/2) + Count(s,start+n/2,n/2);
y=0;
if (s[start+len/2-1] concat s[start+len/2] is valid)
{
//if middle two charaters concatenation is valid
//count left of the middle two characters
//count right of the middle two characters
//multiply the two counts and add to existing count
y=Count(s,start,len/2-1)*Count(s,start+len/2+1,len/2-1);
}
return x+y;
}
else
{
//there are three cases here:
//case 1: if middle character is valid,
//then count everything to the left of the middle character,
//count everything to the right of the middle character,
//multiply the two, assign to x
x=...
//case 2: if middle character concatenates the one to the left is valid,
//then count everything to the left of these two characters
//count everything to the right of these two characters
//multiply the two, assign to y
y=...
//case 3: if middle character concatenates the one to the right is valid,
//then count everything to the left of these two characters
//count everything to the right of these two characters
//multiply the two, assign to z
z=...
return x+y+z;
}
The brute-force solution has time complexity of T(n)=T(n-1)+T(n-2)+O(1) which is exponential.
The divide-and-conquer solution has time complexity of T(n)=3T(n/2)+O(1) which is O(n**lg3).
Hope this is correct.
Something like this?
Haskell code:
import qualified Data.Map as M
import Data.Maybe (fromJust)
combs str = f str [] where
charMap = M.fromList $ zip (map show [1..]) ['A'..'Z']
f [] result = [reverse result]
f (x:xs) result
| null xs =
case M.lookup [x] charMap of
Nothing -> ["The character " ++ [x] ++ " is not in the map."]
Just a -> [reverse $ a:result]
| otherwise =
case M.lookup [x,head xs] charMap of
Just a -> f (tail xs) (a:result)
++ (f xs ((fromJust $ M.lookup [x] charMap):result))
Nothing -> case M.lookup [x] charMap of
Nothing -> ["The character " ++ [x]
++ " is not in the map."]
Just a -> f xs (a:result)
Output:
*Main> combs "121"
["LA","AU","ABA"]
Here is the solution based on my discussion here:
private static int decoder2(int[] input) {
int[] A = new int[input.length + 1];
A[0] = 1;
for(int i=1; i<input.length+1; i++) {
A[i] = 0;
if(input[i-1] > 0) {
A[i] += A[i-1];
}
if (i > 1 && (10*input[i-2] + input[i-1]) <= 26) {
A[i] += A[i-2];
}
System.out.println(A[i]);
}
return A[input.length];
}
Just us breadth-first search.
for instance 121
Start from the first integer,
consider 1 integer character first, map 1 to a, leave 21
then 2 integer character map 12 to L leave 1.
This problem can be done in o(fib(n+2)) time with a standard DP algorithm.
We have exactly n sub problems and button up we can solve each problem with size i in o(fib(i)) time.
Summing the series gives fib (n+2).
If you consider the question carefully you see that it is a Fibonacci series.
I took a standard Fibonacci code and just changed it to fit our conditions.
The space is obviously bound to the size of all solutions o(fib(n)).
Consider this pseudo code:
Map<Integer, String> mapping = new HashMap<Integer, String>();
List<String > iterative_fib_sequence(string input) {
int length = input.length;
if (length <= 1)
{
if (length==0)
{
return "";
}
else//input is a-j
{
return mapping.get(input);
}
}
List<String> b = new List<String>();
List<String> a = new List<String>(mapping.get(input.substring(0,0));
List<String> c = new List<String>();
for (int i = 1; i < length; ++i)
{
int dig2Prefix = input.substring(i-1, i); //Get a letter with 2 digit (k-z)
if (mapping.contains(dig2Prefix))
{
String word2Prefix = mapping.get(dig2Prefix);
foreach (String s in b)
{
c.Add(s.append(word2Prefix));
}
}
int dig1Prefix = input.substring(i, i); //Get a letter with 1 digit (a-j)
String word1Prefix = mapping.get(dig1Prefix);
foreach (String s in a)
{
c.Add(s.append(word1Prefix));
}
b = a;
a = c;
c = new List<String>();
}
return a;
}
old question but adding an answer so that one can find help
It took me some time to understand the solution to this problem – I refer accepted answer and #Karthikeyan's answer and the solution from geeksforgeeks and written my own code as below:
To understand my code first understand below examples:
we know, decodings([1, 2]) are "AB" or "L" and so decoding_counts([1, 2]) == 2
And, decodings([1, 2, 1]) are "ABA", "AU", "LA" and so decoding_counts([1, 2, 1]) == 3
using the above two examples let's evaluate decodings([1, 2, 1, 4]):
case:- "taking next digit as single digit"
taking 4 as single digit to decode to letter 'D', we get decodings([1, 2, 1, 4]) == decoding_counts([1, 2, 1]) because [1, 2, 1, 4] will be decode as "ABAD", "AUD", "LAD"
case:- "combining next digit with the previous digit"
combining 4 with previous 1 as 14 as a single to decode to letter N, we get decodings([1, 2, 1, 4]) == decoding_counts([1, 2]) because [1, 2, 1, 4] will be decode as "ABN" or "LN"
Below is my Python code, read comments
def decoding_counts(digits):
# defininig count as, counts[i] -> decoding_counts(digits[: i+1])
counts = [0] * len(digits)
counts[0] = 1
for i in xrange(1, len(digits)):
# case:- "taking next digit as single digit"
if digits[i] != 0: # `0` do not have mapping to any letter
counts[i] = counts[i -1]
# case:- "combining next digit with the previous digit"
combine = 10 * digits[i - 1] + digits[i]
if 10 <= combine <= 26: # two digits mappings
counts[i] += (1 if i < 2 else counts[i-2])
return counts[-1]
for digits in "13", "121", "1214", "1234121":
print digits, "-->", decoding_counts(map(int, digits))
outputs:
13 --> 2
121 --> 3
1214 --> 5
1234121 --> 9
note: I assumed that input digits do not start with 0 and only consists of 0-9 and have a sufficent length
For Swift, this is what I came up with. Basically, I converted the string into an array and goes through it, adding a space into different positions of this array, then appending them to another array for the second part, which should be easy after this is done.
//test case
let input = [1,2,2,1]
func combination(_ input: String) {
var arr = Array(input)
var possible = [String]()
//... means inclusive range
for i in 2...arr.count {
var temp = arr
//basically goes through it backwards so
// adding the space doesn't mess up the index
for j in (1..<i).reversed() {
temp.insert(" ", at: j)
possible.append(String(temp))
}
}
print(possible)
}
combination(input)
//prints:
//["1 221", "12 21", "1 2 21", "122 1", "12 2 1", "1 2 2 1"]
def stringCombinations(digits, i=0, s=''):
if i == len(digits):
print(s)
return
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
total = 0
for j in range(i, min(i + 1, len(digits) - 1) + 1):
total = (total * 10) + digits[j]
if 0 < total <= 26:
stringCombinations(digits, j + 1, s + alphabet[total - 1])
if __name__ == '__main__':
digits = list()
n = input()
n.split()
d = list(n)
for i in d:
i = int(i)
digits.append(i)
print(digits)
stringCombinations(digits)

Resources