Reducer that groups by two values - hadoop

I have a case in which Mapper emits data that belongs to a subgroup and the subgroup belongs to a group.
I need to add up all the values in the subgroup and find the minimal value between all subgroups of the group, for each of the groups.
So, I have an output from Mapper that looks like this
Group 1
group,subgroupId,value
Group1,1,2
Group1,1,3
Group1,1,4
Group1,2,1
Group1,2,2
Group1,3,1
Group1,3,2
Group1,3,5
Group 2
group,subgroupId,value
Group2,4,2
Group2,4,3
Group2,4,4
Group2,5,1
Group2,5,2
Group2,6,1
Group2,6,2
And my output should be
Group1, 1, (2+3+4)
Group1, 2, (1+2)
Group1, 3, (1+2+5)
Group1 min = min((2+3+4),(1+2),(1+2+5))
Same for Group 2.
So I practically need to group twice, first group by GROUP and then inside of it group by SUBGROUPID.
So I should emit the minimal sum from a group, in the given example my reducer should emit (2,3), since the minimal sum is 3 and it comes from element with id 2.
So, it seems that it could be solved best using reduce twice, first reduce would get elements grouped by id and that would be passed to the second Reducer grouped by Group id.
Does this make sense and how to implement it? I've seen ChainedMapper and ChainedReducer, but they don't fit for this purpose.
Thanks

If all data can fit in the memory of one machine, you can simply do all this in a single job, using a single reducer (job.setNumReducers(1);) and two temp variables. The output is emitted in the cleanup phase of the reducer. Here is the pseudocode for that, if you use the new Hadoop API (that supports the cleanup() method):
int tempKey;
int tempMin;
setup() {
tempMin = Integer.MAX_VALUE;
}
reduce(key, values) {
int sum = 0;
while (values.hasNext()) {
sum += values.next();
}
if (sum < tempMin) {
tempMin = sum;
tempKey = key;
}
}
cleanup() { //only in the new API
emit(tempKey, tempMin);
}

Your approach (summarized below), is how I would do it.
Job 1:
Mapper: Assigns an id to a subgroupid
Combiner/Reducer(same class): Finds the minimum value for
subgroupid.
Job 2:
Mapper: Assigns a groupid to a subgroupid.
Combiner/Reducer(same class): Finds the minimum value for
groupid.
This is best implemented in two jobs for the following reasons:
Simplifies the mapper and reducer significantly (you don't need to worry about finding all the groupids the first time around). Finding the (groupid, subgroupid) pairs in the mapper could be non-trivial. Writing the two mappers should be trivial.
Follows the map reduce programming guidelines given by Tom White in Hadoop: The Definitive Guide (Chapter 6).
An Oozie workflow can easily and simply accommodate the dependent jobs.
The intermediate file products (key:subgroupid, value: min value for subgroupid) should be small, limiting the use of network resources.

Related

Custom Partitioner, without setting number of reducers

Is it must that we have to set number of reducers to use custom partitioner ?
Example : Word Count problem, want to get all the stop words count in one partition and remaining words count to go to different partition. If I set number of reducers to two and stop words to go to one partition and others to go to the next partition, it will work, but I am restricting the number of reducers to two(or N ), which I don't want. What is the best approach here? Or I have to calculate and set the number of reducers based on the size of the input to get the best performance?
Specifying a custom partitioner does not change anything since the number of partitions is provided to the partitioner:
int getPartition(KEY key, VALUE value, int numPartitions)
If you don't set a partitioner then the HashPartitioner is used. Its implementation is trivial:
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
The design of a custom paritioner is up to you. The main goal of a paritioner is to avoid skews and to evenly distribute the load on a provided number of partitions. For some small job it could be ok to decide to only support two reducers, but if you want your job to scale then you must design you job to run with an arbitrary number of reducers.
Or I have to calculate and set the number of reducers based on the size of the input to get the best performance?
That's always what you have to do, and is unrelated to the usage of a custom partitioner. You MUST set the number of reducers, the default value is 1 and Hadoop won't compute this value for you.
If you want to send stop words to one reducer and other words to the other reducer you can do something like that:
public int getPartition(K key, V value, int numReduceTasks) {
if (isStopWord(key) {
return 0;
} else {
return ((key.hashCode() & Integer.MAX_VALUE) % (numReduceTasks - 1)) + 1;
}
}
However it can easily lead to a large data skew. First reducer will be overloaded and will take much longer than the other reducers to complete. In this case it make no sense to use more than two reducers.
It could be an XY problem. I am not sure that what you are asking is the best way to solve your actual problem.

hadoop job with single mapper and two different reducers

I have a large document corpus as an input to a MapReduce job (old hadoop API). In the mapper, I can produce two kinds of output: one counting words and one producing minHash signatures. What I need to do is:
give the word counting output to one reducer class (a typical WordCount reducer) and
give the minHash signatures to another reducer class (performing some calculations on the size of the buckets).
The input is the same corpus of documents and there is no need to process it twice. I think that MultipleOutputs is not the solution, as I cannot find a way to give my Mapper output to two different Reduce classes.
In a nutshell, what I need is the following:
WordCounting Reducer --> WordCount output
/
Input --> Mapper
\
MinHash Buckets Reducer --> MinHash output
Is there any way to use the same Mapper (in the same job), or should I split that in two jobs?
You can do it, but it will involve some coding tricks (Partitioner and a prefix convention). The idea is for mapper to output the word prefixed with "W:" and minhash prefixed with "M:". Than use a Partitioner to decide into which partition (aka reducer) it needs to go into.
Pseudo code
MAIN method:
Set number of reducers to 2
MAPPER:
.... parse the word ...
... generate minhash ..
context.write("W:" + word, 1);
context.write("M:" + minhash, 1);
Partitioner:
IF Key starts with "W:" { return 0; } // reducer 1
IF Key starts with "M:" { return 1; } // reducer 2
Combiner:
IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;}
Iterate and context.write all of the values
Reducer:
IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;}
IF Key starts with "M:" { perform min hash logic }
In the output part-0000 will be you word counts and part-0001 your min hash calculations.
Unfortunately it is not possible to provide different Reducer classes, but with IF and prefix you can simulate it.
Also having just 2 reducers might not be an efficient from performance point of view, than you could play with Partitioner to allocate first N partitions to the Word Count.
If you do not like the prefix idea than you would need to implement secondary sort with custom WritableComparable class for the key. But it is worth the effort only in more sophisticated cases.
AFAIK this is not possible in a single map reduce job , only the default out-put files part--r--0000 files will be fed to reducer, so so if you are creating two multiple named outputs naming WordCount--m--0 and MinHash--m--0
you can create two other different Map/Reduce job with Identity Mapper and the respective Reducers, specifying the inputs as hdfspath/WordCount--* and hdfspath/MinHash--* as a input to the respective jobs.

Merge sort for GPU

Im trying to implement a merge sort using opencl wrapper.
The problem is, each pass needs a different indexing algorithm for threads' memory access.
Some info about this:
First pass(numbers indicate elements and arrows indicate sorting)
0<-->1 2<--->3 4<--->5 6<--->7
group0 group1 group2 group3 ===>1 thread per group, N/2 groups total
Second pass(all parallel)
0<------>2 4<------>6
1<------->3 5<------->7
group0 group1 ========> 2 threads per group, N/4 groups
Next pass
0<--------------->4 8<------------------>12
1<--------------->5 9<------------------->13
2<--------------->6 10<---------------->14
3<--------------->7 11<--------------->15
group0 group1 ===>4 threads per group
but N/8 groups
So, an element of a sub-group cannot make any comparison between another group's element.
I cannot simply do
A[i]<---->A[i+1] or A[i]<---->A[i+4]
because these cover
A[1]<---->A[2] and A[4]<----->A[8]
which are wrong.
I needed a more complex indexing algorithm which has potential to use same number of threads for all passes.
Pass n: global id(i): 0,1, 2,3 4,5 , ... to compare id 0,1 4,5 8,9
looks like compare id1=(i/2)*4+i%2
Pass n+1: global id(i): 0,1,2,3, 4,5,6,7, ... to compare id 0,1,2,3, 8,9,10,11, 16,17
looks like compare id=(i/4)*8+i%4
Pass n+2: global id(i): 0,1,2,3,... 8,9,10,... to compare id 0,1,2,3,... 16,17,18,...
looks like compare id=(i/8)*16+i%8
compare id1 = ( i/( pow(2,passN) ) ) * pow(2,passN+1) + i%( pow(2, passN) )
compare id2 = compare id1 + pow(2,passN)
so, in the kernel string, can it be
int i=get_global_id(0);
int compareId1=( i/( pow(2,passN) ) ) * pow(2,passN+1) + i%( pow(2, passN) );
int compareId2=compareId1+pow(2,passN);
if(compareId1!=compareId2)
{
if(A[compareId1]>A[compareId2])
{
xorSwapIdiom(A,compareId1,compareId2,B);
}
else
{
streamThrough(A,compareId1,compareId2,B);
}
}
else
{
// this can happen only for the first pass
// needs a different kernel structure for that
}
but Im not sure.
Question: Can you give any directions about which memory access pattern would not leak while satisfying the "no compare between different groups" condition?
Needed to hard-reset my computer many times already, trying different algortihms(memory leaked, black screen, crash, restart), this one is the last and I fear it can crash entire OS.
I already tried a much simpler version with a decreasing number of threads per pass, had some bad performance.
Edit: I tried the upper code, it sorts reversely ordered arrays. But not randomized arrays.

Need help in writing Map/Reduce job to find average

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:
ProcessName Time
process1 10
process2 20
processn 30
I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
Thanks.
Your Mappers read the text file and apply the following map function on every line
map: (key, value)
time = value[2]
emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function
reduce: (key, values)
result = sum(values) / n
emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines.
A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:
combine: (key, values)
emit(key, sum(values))
This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between.
The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.
Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like
ProcessName Time
process1 10
process2 20
.
.
.
Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like
private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
String[] fields = value.split("\t");
output.set(Integer.parseInt(fields[1]));
context.write(one, output);
}
Your reducer takes these values, and simply computes the average. This would look something like
IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.Write(key, average);
}
I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.
Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources