hadoop: difference between 0 reducer and identity reducer? - hadoop

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.
0 reducer means reduce step will be skipped and mapper output will be the final out
Identity reducer means then shuffling/sorting will still take place?

You understanding is correct. I would define it as following:
If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.

Another use-case for using the Identity Reducer is to combine all the results into <# of reducers> output files. This can be handy if you are using Amazon Web Services to write to S3 directly, especially if the mapper output is small (e.g. a grep/search for a record), and you have a lot of mappers (e.g. 1000's).

The main difference between "No Reducer" (mapred.reduce.tasks=0) and "Standard reducer" which is IdentityReducer (mapred.reduce.tasks=1 etc) is when you use "No reducer" there is no partitioning&shuffling processes after MAP stage. Therefore, in this case you will get 'pure' output from your mappers without any further processing. It helps for development and debugging puproses, but not only.

It depends on your business requirements. If you are doing a wordcount you should reduce your map output to get a total result. If you just want to change the words to upper case, you don't need a reduce.

Related

How to implement multiple reducers in a single MapReduce Job

I'm having a huge data set and I need to perform different functions for the same data.
I would like to have four output files. Since four operations are different, can I use four partitioners and four reducers to implement the same ? Is it possible or should I need to write four jobs to perform this ? Please help me !
First Approach
I think you should implement the code in a unique reduce method, and emit n keys depending the process performed. For example: You implement A,B,C and D techiniques, then, in your mapper you could implement this (pseudo-code):
dataA = ProcessA(key,value)
context.write("A", dataA)
dataB = ProcessB(key,value)
context.write("B", dataB)
dataC = ProcessC(key,value)
context.write("C", dataC)
dataD = ProcessD(key,value)
context.write("D", dataD)
You should be careful about data types of output. Also, the output key could be more complex.
Second Approach
You could generate N MapReduce applications in the same java project, and then you re-use the Map, and develop N reducers.
In job.setReducerClass in each main class you set each Reducer. The Map will be the same.
You just need to specify number of reducers in your MapReduce
job config. The default partitioner will distribute data to reducers based on hash of key modulus number of specified reducers.
To override behavior of default partitioner, you can implement your own custom partitioner specifying how your data should get across to the reducers.
---Edit to answer questions in the comments section---
How can i specify more than one reducer class in the Map-reduce driver
To set number of reducers, in job conf you can set it like below -
int numReducers = /*number of reducers you want*/;
job.setNumReduceTasks(numReducers);
Whether I should write four different Jobs for this. Or can I do this with a single Job
Hadoop MR jobs are I/O intensive, in your MR job design you should work on minimizing the I/O and parallel processing as much possible.
If your reducers need same input for generating all 4 outputs, it will be better to keep single job, but another consideration can be skewness of data for either output.
For example output1 has more processing time + most of incoming data is likely to be processed for output1.
If you have scenerio like time taken to process output1 is much higher then total time taken to process output2 + output3 + output4, then you should considering splitting processing of output1 in multiple steps.
However if we consider all 4 outouts have more or less equal processing times and consumes same data throughout,
It will be better to have some conditional processing logic in the reducer and let your custom partioner decide which data goes to which reducer.
Your custom partioner can have some check like this incoming data qualifies to be contributing to "GC content" so let it got to Reducer 3.
But if your incoming data needs to be processed for more then one output/distribution use conditional processing and to write multiple output files from same reducer use "MultipleOutputs".
You can google it up and find usage examples, it lets you write output to multiple folders/files at the same time from within a Mapper or Reducer.
Hadoop let's you specify the number of reducer tasks from the job driver job.setNumReduceTasks(num_reducers);. Since you want four outputs, you would specify int num_reducers = 4; Here's an example driver class.
public class run {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Run NB Count");
job.setJarByClass(NB_train_hadoop.class);
// set mappers, reducers, other stuff
job.setNumReduceTasks(num_reducers);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
While this is handy, you have to understand that there is an optimal number of reducers you can choose which is dependent on the number of nodes in your cluster.
For example, running 4 Amazon m3.xlarge instances (1 master, 3 slaves, and 4 cores an instance), has the following relationship between wall time and number of reducer tasks used in the MapReduce job. You can see that more isn't necessarily better and if you use too many, well then you might as well crunch your data with your mother's hair curler because it would be faster that way.
Hope this is helpful!!

Hadoop mapper task detailed execution time

For a certain Hadoop MapReduce mapper task, I have already had the mapper task's complete execution time. In general, a mapper has three steps: (1)read input from HDFS or other source like Amazon S3; (2)process input data; (3)write intermediate result to local disk. Now, I am wondering if it's possible to know the time spent by each step.
My purpose is to get the result of (1) how long does it take for mappers to read input from HDFS or S3. The result just indicate how fast a mapper could read. It's more like a I/O performance for a mapper; (2) how long does it take for the mapper to process these data, it's more like the computing capability of the task.
Anyone has any idea for how to acquire these results?
Thanks.
Just implement a read-only mapper that does not emit anything. This will then give an indication of how long it takes for each split to be read (but not processed).
You can as a further step define a variable passed to the job at runtime (via the job properties) which allows you to do just one of the following (by e.g. parsing the variable against an Enum object and then switching on the values):
just read
just read and process (but not write/emit anything)
do it all
This of course assumes that you have access to the mapper code.

Pass the maximum key encountered across all mappers as parameter to the next job

I have a chain of Map/Reduce jobs:
Job1 takes data with a time stamp as a key and some data as value and transforms it.
For Job2 I need to pass the maximum time stamp that appears across all mappers in Job1 as a parameter. (I know how to pass parameters to Mappers/Reducers)
I can keep track of the maximum time stamp in each mapper of Job1, but how can I get the maximum across all mappers and pass it as a parameter to Job2?
I want to avoid running a Map/Reduce Job just to determine the maximum time stamp, since the size of my data set is in the terabyte+ scale.
Is there a way to accomplish this using Hadoop or maybe Zookeeper?
There is no way 2 maps can talk to each other.So a map only job( job1) can not get you global max. timestamp.However,I can think of 2 approaches as below.
I assume your job1 currently is a map only job and you are writing output from map itself.
A. Change your mapper to write the main output using MultipleOutputs and not Context or OutputCollector.Emit additional (key,value) pair as (constant,timestamp) using context.write().This way, you shuffle only the (constant,timestamp) pairs to reducer.Add a reducer that caliculates max. among the values it received.Run the job, with number of reducers set as 1.The output written from mapper will give you your original output while output written from reducer will give you global max. timestamp.
B. In job1, write the max. timestamp in each mapper as output.You can do this in cleanup().Use MultipleOutputs to write to a folder other than that of your original output.
Once job1 is done, you have 'x' part files in the output folder assuming you have 'x' mappers in job1.You can do a getmerge on this folder to get all the part files into a single local file.This file will have 'x' lines each contain a timestamp.You can read this using a stand-alone java program,find the global max. timestamp and save it in some local file.Share this file to job2 using distrib cache or pass the global max. as a parameter.
I would suggest doing the following, create a directory where you can put the maximum of each Mapper inside a file that is the mapper name+id. The idea is to have a second output directory and to avoid concurrency issues just make sure that each mapper writes to a unique file. Keep the maximum as a variable and write it to the file on each mappers cleanup method.
Once the job completes, it's trivial to iterate over secondary output directory to find the maximum.

When would "no mapper" be needed?

I have been using no reducer jobs for a while in certain use cases but I never came across "no mapper" job yet. "No Mapper" means that still the mapreduce framework will read input files and shuffle/sort them in some fashion (based on InputFormat?) and those will be an input to my reducer?
"No mapper" is a euphemism for "identity mapper". The default mapper if you don't specify one is just that. At the very least, the identity mapper process directs the unchanged inputs to the right reducer partitions.
In case you use Hadoop Streaming:
-mapper "/bin/sh -c \"cat\""
For some of the aggregation functions based on the input key a identity mapper makes sense. The mapper will emit the same i/o keys as the input to it and the reducer will aggregate the values for a particular key.

Why all the reduce tasks are ending up in a single machine?

I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);

Resources