hadoop job with single mapper and two different reducers - hadoop

I have a large document corpus as an input to a MapReduce job (old hadoop API). In the mapper, I can produce two kinds of output: one counting words and one producing minHash signatures. What I need to do is:
give the word counting output to one reducer class (a typical WordCount reducer) and
give the minHash signatures to another reducer class (performing some calculations on the size of the buckets).
The input is the same corpus of documents and there is no need to process it twice. I think that MultipleOutputs is not the solution, as I cannot find a way to give my Mapper output to two different Reduce classes.
In a nutshell, what I need is the following:
WordCounting Reducer --> WordCount output
/
Input --> Mapper
\
MinHash Buckets Reducer --> MinHash output
Is there any way to use the same Mapper (in the same job), or should I split that in two jobs?

You can do it, but it will involve some coding tricks (Partitioner and a prefix convention). The idea is for mapper to output the word prefixed with "W:" and minhash prefixed with "M:". Than use a Partitioner to decide into which partition (aka reducer) it needs to go into.
Pseudo code
MAIN method:
Set number of reducers to 2
MAPPER:
.... parse the word ...
... generate minhash ..
context.write("W:" + word, 1);
context.write("M:" + minhash, 1);
Partitioner:
IF Key starts with "W:" { return 0; } // reducer 1
IF Key starts with "M:" { return 1; } // reducer 2
Combiner:
IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;}
Iterate and context.write all of the values
Reducer:
IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;}
IF Key starts with "M:" { perform min hash logic }
In the output part-0000 will be you word counts and part-0001 your min hash calculations.
Unfortunately it is not possible to provide different Reducer classes, but with IF and prefix you can simulate it.
Also having just 2 reducers might not be an efficient from performance point of view, than you could play with Partitioner to allocate first N partitions to the Word Count.
If you do not like the prefix idea than you would need to implement secondary sort with custom WritableComparable class for the key. But it is worth the effort only in more sophisticated cases.

AFAIK this is not possible in a single map reduce job , only the default out-put files part--r--0000 files will be fed to reducer, so so if you are creating two multiple named outputs naming WordCount--m--0 and MinHash--m--0
you can create two other different Map/Reduce job with Identity Mapper and the respective Reducers, specifying the inputs as hdfspath/WordCount--* and hdfspath/MinHash--* as a input to the respective jobs.

Related

How can I put One Input file of data in one Reducer and another input file data in another reducer

I have two text file lets say file1.txt in which I have written all the capital letter word ,another one file name is file2.txt ,in which I have written all the small letter word ,so how can I do this input split for all the capital letter of file1.txt in one reducer and all the small letter of file2.txt in diffrent reducer.
can any one please help me out .
create custom partitionser.
The main purpose of partitioner is partitions the key,value pairs of mapper output intermediate keys,The partitioner will divided the data based on our user defined conditions,which works like a hash function.The total number of partitions is equal to total number of reducers in a job. ( job.setNumReduceTasks(n)) . The partitioner phase takes place after the map phase and before the reduce phase in our mapreduce program.The default partitioning function is the hash partitioning function where the hashing is done on the key. However it might be useful to partition the data according to some other function of the key or the value.
//Set number of reducer tasks in drive program
job.setNumReduceTasks(2);
then create custom partitioner class and add the logic for partition the map data on the bases of Upper/lower case of data value.
public static class customPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
if(StringUtils.isAllUpperCase(value))
return 0;
else
return 1;
}
For example of custom partitioner -> http://www.hadooptpoint.org/hadoop-custom-partitioner-in-mapreduce-example/

Number of Reducers and output order

When I use the function job.setNumReduceTasks(1);, I get the output sorted by key. However, the output is not sorted by key when I remove this function.
So, should we expect to get sorted output from the reducer when we have more than one reducer task?
Thanks.
Output is sorted on the key within a single Reducer. However the default Partitioner is the result of a hash function, and so whilst each file will be sorted if using multiple Reducers, one file will not be a sorted continuation of the last. For example:
We have a word count job with three Reducers. The Mapper outputs:
(A,1)
(zebra,1)
(bat,1)
(zebra,1)
(frog,1)
(A,1)
The Partitioner looks like the following
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
and so it could allocate the keys in the following way:
REDUCER 1 REDUCER 2 REDUCER 3
(A,1) (frog,1) (bat,1)
(A,1)
(zebra,1)
Notice that Reducer 1 doesn't contain A-F, Reducer 2 doesn't contain G-M and Reducer 3 doesn't contain N-Z, i.e. it's not splitting alphabetically. And that's why the overall output won't be sorted, but data will be sorted within each Reducer's output.
This makes sense as otherwise we could end up with a big skew. Say for example you're running a MapReduce job on some customer services data where the ID always starts with C - you wouldn't want everything to go to the same Reducer.

Hadoop reduce individual record counts

How to get individual output record count for each reducer output file when map reduce has multiple reducers?
For now I can get total reducer recorder count using REDUCE_OUTPUT_RECORDS counter. But how to get individual reducer counts? I tried to increment reducer output record count in reducer but I could not get output part file name to write to custom counter.
I looking for count of output records of each reducer... Say in total sort order partitioning I want count of records each reducer is emitting... For example total records are 7.. 2 are from reducer 1 and 5 are from reducer 2, kind of statistics..
I hope you are looking for number of records each reducer is processing. Each reducer is called once for each key/ The size of the list is the one which you need as per what I understood. Then with programming, you need to emit 1 as output of the map for each record read, and then sum them in result and emit it.
You can also use LongSumReducer class provided in hadoop API. Hope this helps for further understanding
To Answer to my own question below are the steps:
make a static variable "count" in reduce method and increment the counter whenever emit a key value from reducer.
in cleanup method, create a custom counter and use below method to find the reducer part name.
getConfiguration().getInt( "mapreduce.task.partition", 0)
for ex: for reducer output filename part-r-00000 above method returns 0
So using this we can identify different reduce part files counts.
Below is the code:
MyReduce extends Reducer<..>{
private static int count = 0;
reduce(..){
<your code>
:
count++;
context.write(..);
}
#Override
cleanup(Context output){
output.getCounter("RecordCounter","Reducer-no-"+output.getConfiguration().getInt("mapreduce.task.partition",
0)).increment(count);
}

Custom Partitioner, without setting number of reducers

Is it must that we have to set number of reducers to use custom partitioner ?
Example : Word Count problem, want to get all the stop words count in one partition and remaining words count to go to different partition. If I set number of reducers to two and stop words to go to one partition and others to go to the next partition, it will work, but I am restricting the number of reducers to two(or N ), which I don't want. What is the best approach here? Or I have to calculate and set the number of reducers based on the size of the input to get the best performance?
Specifying a custom partitioner does not change anything since the number of partitions is provided to the partitioner:
int getPartition(KEY key, VALUE value, int numPartitions)
If you don't set a partitioner then the HashPartitioner is used. Its implementation is trivial:
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
The design of a custom paritioner is up to you. The main goal of a paritioner is to avoid skews and to evenly distribute the load on a provided number of partitions. For some small job it could be ok to decide to only support two reducers, but if you want your job to scale then you must design you job to run with an arbitrary number of reducers.
Or I have to calculate and set the number of reducers based on the size of the input to get the best performance?
That's always what you have to do, and is unrelated to the usage of a custom partitioner. You MUST set the number of reducers, the default value is 1 and Hadoop won't compute this value for you.
If you want to send stop words to one reducer and other words to the other reducer you can do something like that:
public int getPartition(K key, V value, int numReduceTasks) {
if (isStopWord(key) {
return 0;
} else {
return ((key.hashCode() & Integer.MAX_VALUE) % (numReduceTasks - 1)) + 1;
}
}
However it can easily lead to a large data skew. First reducer will be overloaded and will take much longer than the other reducers to complete. In this case it make no sense to use more than two reducers.
It could be an XY problem. I am not sure that what you are asking is the best way to solve your actual problem.

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources