When would "no mapper" be needed? - hadoop

I have been using no reducer jobs for a while in certain use cases but I never came across "no mapper" job yet. "No Mapper" means that still the mapreduce framework will read input files and shuffle/sort them in some fashion (based on InputFormat?) and those will be an input to my reducer?

"No mapper" is a euphemism for "identity mapper". The default mapper if you don't specify one is just that. At the very least, the identity mapper process directs the unchanged inputs to the right reducer partitions.

In case you use Hadoop Streaming:
-mapper "/bin/sh -c \"cat\""

For some of the aggregation functions based on the input key a identity mapper makes sense. The mapper will emit the same i/o keys as the input to it and the reducer will aggregate the values for a particular key.

Related

Job with just the reducer phase?

In Hadoop MapReduce the intermediate output (map output) is saved in the local disk. I would like to know if it is possible to start a job just with the reduce phase, that reads the mapoutput from the local disk, partition the data and execute the reduce tasks?
There is a basic implementation of Mapper called IdentityMapper , which essentially passes all the key-value pairs to a Reducer.
Reducer reads the outputs generated by the different mappers as pairs and emits key value pairs.
The Reducer’s job is to process the data that comes from the mapper.
If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
You can't run just reducers without any mappers..
Map reduce works on data which is in HDFS. So I dont think you can write reducer only map reduce to read from local disk
If you use Hadoop Streaming, you can just add:
-mapper "/bin/sh -c \"cat\""

Hadoop mapper task detailed execution time

For a certain Hadoop MapReduce mapper task, I have already had the mapper task's complete execution time. In general, a mapper has three steps: (1)read input from HDFS or other source like Amazon S3; (2)process input data; (3)write intermediate result to local disk. Now, I am wondering if it's possible to know the time spent by each step.
My purpose is to get the result of (1) how long does it take for mappers to read input from HDFS or S3. The result just indicate how fast a mapper could read. It's more like a I/O performance for a mapper; (2) how long does it take for the mapper to process these data, it's more like the computing capability of the task.
Anyone has any idea for how to acquire these results?
Thanks.
Just implement a read-only mapper that does not emit anything. This will then give an indication of how long it takes for each split to be read (but not processed).
You can as a further step define a variable passed to the job at runtime (via the job properties) which allows you to do just one of the following (by e.g. parsing the variable against an Enum object and then switching on the values):
just read
just read and process (but not write/emit anything)
do it all
This of course assumes that you have access to the mapper code.

what happen to Reducer while Map operation sends non key value as an output in MapReduce

Map operation generally take input as key and value pair. and it will return same key and value pair as output. If map will return non key-value pair output, that time how Reducer will process that output.
Please any one assist on this would be appreciated
I am not sure about Java MapReduce, but in Hadoop Streaming if the mappers do not produce any output the reducers will not be run.
You can test it by creating 2 small python scripts:
A mapper that simply consumes the input without producing anything
#!/usr/bin/python
input()
A reducer that crashes as soon at it is started
#!/usr/bin/python
sys.exit("some error message")
If you launch it, the MapReduce job will complete without any error

hadoop: difference between 0 reducer and identity reducer?

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.
0 reducer means reduce step will be skipped and mapper output will be the final out
Identity reducer means then shuffling/sorting will still take place?
You understanding is correct. I would define it as following:
If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.
Another use-case for using the Identity Reducer is to combine all the results into <# of reducers> output files. This can be handy if you are using Amazon Web Services to write to S3 directly, especially if the mapper output is small (e.g. a grep/search for a record), and you have a lot of mappers (e.g. 1000's).
The main difference between "No Reducer" (mapred.reduce.tasks=0) and "Standard reducer" which is IdentityReducer (mapred.reduce.tasks=1 etc) is when you use "No reducer" there is no partitioning&shuffling processes after MAP stage. Therefore, in this case you will get 'pure' output from your mappers without any further processing. It helps for development and debugging puproses, but not only.
It depends on your business requirements. If you are doing a wordcount you should reduce your map output to get a total result. If you just want to change the words to upper case, you don't need a reduce.

Why all the reduce tasks are ending up in a single machine?

I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);

Resources