setup and cleanup methods of Mapper/Reducer in Hadoop MapReduce - hadoop

Are setup and cleanup methods called in each mapper and reducer tasks respectively? Or are they called only once at the start of overall mapper and reducer jobs?

They are called for each task, so if you have 20 mappers running, the setup / cleanup will be called for each one.
One gotcha is the standard run method for both Mapper and Reducer does not catch exceptions around the map / reduce methods - so if an exception is thrown in these methods, the clean up method will not be called.
2020 Edit: As noted in the comments, this statement from 2012 (Hadoop 0.20) is no longer true, the cleanup is called as part of a finally block.

One clarification is helpful. The setup/cleanup methods are used for initialization and clean up at task level. Within a task, first initialization happens with a single call to setup() method and then all calls to map() [or reduce()] function will be done. After that another single call will be made to cleanup() method before exiting the task.

It's called per Mapper task or Reducer task.
Here is the hadoop code.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
}
} finally {
cleanup(context);
}
}

According to the mapreduce documentation
setup and cleanup are called for each Mapper and Reducer tasks.

on the reducer you can on the job do job.setNumReduceTasks(1); and that way the setup and clean-up of the reducer only will be run once.

Related

Limit the lifetime of a batch job

Is there a way to limit the lifetime of a running spring-batch job to e.g. 23 hours?
We start a batch job daily by a cron job and he job takes about 9 hours. It happened under some circumstances that the DB connection was so slow that the job took over 60 hours to complete. The problem is that the next job instance gets started by the cronjob the next day - and then anotherone the day after - and anotherone...
If this job is not finished within e.g. 23 hours, I want to terminate it and return an error. Is there a way to do that out-of-the-box with spring-batch?
Using a StepListener you can stop a job by calling StepExecution#setTerminateOnly.
stepBuilderFactory
...
.writer(writer)
.listener(timeoutListener)
...
.build()
And the TimeoutListener could look like this
#Component
public class TimeoutListener implements StepListener {
private StepExecution stepExecution;
#BeforeStep
public void beforeStep(StepExecution stepExecution) {
this.stepExecution = stepExecution;
}
#BeforeRead
public void beforeRead() {
if (jobShouldStop()) {
stepExecution.setTerminateOnly();
}
}
private boolean jobShouldStop() {
// ...
}
}
This will gracefully stop the job, without forcefully terminate any running steps.
Spring Batch specifically avoids the issue of job orchestration which this falls into. That being said, you could add a listener to your job that checks for other instances running and calls stop on them before beginning that one. Not knowing what each job does, I'm not sure how effective that would be, but it should be a start.
If you write your own job class to launch the process you can make your class implement StatefulJob interface, which prevents concurrent launches of the same job. Apart from that you can write your own monitoring and stop the job programatically after some period, but it will require some custom coding, I dont know if there is anything build-in for such use case.

How to handle exception in MapReduceFlow of Cascading

I have written cascade flow which executes MapReduce flow containg both Mapper and Reducer.
In reduce() method, it throwsIllegalArgumentException. How to handle this exception ?
I have written catch block in class where I created JobConf for the same and added it into MapReduceFlow constructor.
In your Cascading job, you can use failure traps to do this. From the example on the link, it's something like:
flowDef.addTrap( "assertions", trap );

Hadoop jobs using same reducer output to same file

I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).
My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.
Please advise.
I was taking a look at merge output files after reduce phase before I decided to ask my question.
Thanks and Kind regards,
Moiz Ahmed.
When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:
class ComplexMapper extends Mapper {
protected BitSet mappingBitmap = new BitSet();
protected void setup(Context context) ... {
{
String params = context.getConfiguration().get("params");
---analyze params and set bits into the mappingBitmap
}
protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}
public void map(Object key, Object value, Context context) ..... {
if (mappingBitmap.get(1)) {
mapA(key, value, context);
}
if (mappingBitmap.get(2)) {
mapB(key, value, context);
}
if (mappingBitmap.get(3)) {
mapC(key, value, context);
}
}
Of cause it can be implemented more elegantly with interfaces etc.
In the job setup just add:
Configuration conf = new Configuration();
conf.set("params", "AB");
Job job = new Job(conf);
As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:
hadoop fs -text /output_dir/part* > wholefile.txt
Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.
Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.
Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.

How to update task tracker that my mapper is still running fine as opposed to generating timeout?

I forgot what API/method to call, but my problem is that :
My mapper will run more than 10 minutes - and I don't want to increase default timeout.
Rather I want to have my mapper send out update ping to task tracker, when it is in the particular code path that consumes time > 10 mins.
Please let me know what API/method to call.
You can simply increase a counter and call progress. This will ensure that the task sends a heartbeat back to the tasktracker to know if its alive.
In the new API this is managed through the context, see here: http://hadoop.apache.org/common/docs/r1.0.0/api/index.html
e.G.
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// increment counter
context.getCounter(SOME_ENUM).increment(1);
context.progress();
}
In the old API there is the reporter class:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
You typically use the Reporter to let the framework know you're still alive.
Quote from the javadoc:
Mapper and Reducer can use the Reporter provided to report progress or
just indicate that they are alive. In scenarios where the application
takes an insignificant amount of time to process individual key/value
pairs, this is crucial since the framework might assume that the task
has timed-out and kill that task.

How to use hadoop for continue input

I have a case that I use Hadoop to listen/receive message from JMS queue. If the queue have the message then trigger map/reduce program, so we don't want the map reduce die we need loop execute map/reduce code many many times.
My problem is:
public boolean nextKeyValue() throws IOException Using this method we return the key and value every times. If I return false the map reduce code will run to finished. If I return true the map/ reduce code will wait for the next key value rather than to call reduce method. so is there any way to run reduce method after map method soon and the nextKeyValue return true to wait the JMS queue next message?.
Or anybody has good ideas for Hadoop read continue datasource then to do map/reduce in parallel, the same function with Hstreaming?

Resources