How to handle exception in MapReduceFlow of Cascading - hadoop

I have written cascade flow which executes MapReduce flow containg both Mapper and Reducer.
In reduce() method, it throwsIllegalArgumentException. How to handle this exception ?
I have written catch block in class where I created JobConf for the same and added it into MapReduceFlow constructor.

In your Cascading job, you can use failure traps to do this. From the example on the link, it's something like:
flowDef.addTrap( "assertions", trap );

Related

Nested transaction in SpringBatch tasklet not working

I'm using SpringBatch for my app. In one of the batch jobs, I need to process multiple data. Each data requires several database updates. And I need to make one transaction for one data. Meaning, if when processing one data an exception is thrown, database updates are rolled back for that data, then keep processing the next data.
I've put all database updates in one method in service layer. In my springbatch tasklet, I call that method for each data, like this;
for (RequestViewForBatch request : requestList) {
orderService.processEachRequest(request);
}
In the service class the method is like this;
Transactional(propagation = Propagation.NESTED, timeout = 100, rollbackFor = Exception.class)
public void processEachRequest(RequestViewForBatch request) {
//update database
}
When executing the task, it gives me this error message
org.springframework.transaction.NestedTransactionNotSupportedException: Transaction manager does not allow nested transactions by default - specify 'nestedTransactionAllowed' property with value 'true'
but i don't know how to solve this error.
Any suggestion would be appreciated. Thanks in advance.
The tasklet step will be executed in a transaction driven by Spring Batch. You need to remove the #Transactional on your processEachRequest method.
You would need a fault-tolerant chunk-oriented step configured with a skip policy. In this case, only faulty items will be skipped. Please refer to the Configuring Skip Logic section of the documentation. You can find an example here.

How to stop Spring batch job when error in processor

In my batch job, I have a single step with reading from Database , processing the record and writing back the same record to same table.(ie updating record with processed values or error reason if processing failed).
I am using AsyncItemProcessor for multi thread processing. When I get error in ItemProcessor.process() method, I throw an exception and batch job ends with FAILED status. This failed status is a requirement.
Because, its AsyncItemProcessor, I am unable to access ItemProcessListener.onProcessError().
How do I write the errorMessage to Item Table when there is an error ?
This is a known limitation of using the AsyncItemProcessor which is mentioned in its Javadoc:
While not an exhaustive list, things like StepExecution.filterCount will not
reflect the number of filtered items and
ItemProcessListener.onProcessError(Object, Exception) will not be called.
There is an open issue to update the reference documentation as well.
How do I write the errorMessage to Item Table when there is an error ?
The AsyncItemProcessor submits a FutureTask to the task executor and the only way to know if an exception happened in the task is by unwrapping the future (the exception will be actually wrapped in a java.util.concurrent.ExecutionException when the FutureTask.get is called). Now since the future is unwrapped in the AsyncItemWriter, you can use an ItemWriteListener and react to processing errors. You can find a complete example here.

Hadoop jobs using same reducer output to same file

I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).
My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.
Please advise.
I was taking a look at merge output files after reduce phase before I decided to ask my question.
Thanks and Kind regards,
Moiz Ahmed.
When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:
class ComplexMapper extends Mapper {
protected BitSet mappingBitmap = new BitSet();
protected void setup(Context context) ... {
{
String params = context.getConfiguration().get("params");
---analyze params and set bits into the mappingBitmap
}
protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}
public void map(Object key, Object value, Context context) ..... {
if (mappingBitmap.get(1)) {
mapA(key, value, context);
}
if (mappingBitmap.get(2)) {
mapB(key, value, context);
}
if (mappingBitmap.get(3)) {
mapC(key, value, context);
}
}
Of cause it can be implemented more elegantly with interfaces etc.
In the job setup just add:
Configuration conf = new Configuration();
conf.set("params", "AB");
Job job = new Job(conf);
As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:
hadoop fs -text /output_dir/part* > wholefile.txt
Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.
Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.
Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.

How to use hadoop for continue input

I have a case that I use Hadoop to listen/receive message from JMS queue. If the queue have the message then trigger map/reduce program, so we don't want the map reduce die we need loop execute map/reduce code many many times.
My problem is:
public boolean nextKeyValue() throws IOException Using this method we return the key and value every times. If I return false the map reduce code will run to finished. If I return true the map/ reduce code will wait for the next key value rather than to call reduce method. so is there any way to run reduce method after map method soon and the nextKeyValue return true to wait the JMS queue next message?.
Or anybody has good ideas for Hadoop read continue datasource then to do map/reduce in parallel, the same function with Hstreaming?

setup and cleanup methods of Mapper/Reducer in Hadoop MapReduce

Are setup and cleanup methods called in each mapper and reducer tasks respectively? Or are they called only once at the start of overall mapper and reducer jobs?
They are called for each task, so if you have 20 mappers running, the setup / cleanup will be called for each one.
One gotcha is the standard run method for both Mapper and Reducer does not catch exceptions around the map / reduce methods - so if an exception is thrown in these methods, the clean up method will not be called.
2020 Edit: As noted in the comments, this statement from 2012 (Hadoop 0.20) is no longer true, the cleanup is called as part of a finally block.
One clarification is helpful. The setup/cleanup methods are used for initialization and clean up at task level. Within a task, first initialization happens with a single call to setup() method and then all calls to map() [or reduce()] function will be done. After that another single call will be made to cleanup() method before exiting the task.
It's called per Mapper task or Reducer task.
Here is the hadoop code.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
}
} finally {
cleanup(context);
}
}
According to the mapreduce documentation
setup and cleanup are called for each Mapper and Reducer tasks.
on the reducer you can on the job do job.setNumReduceTasks(1); and that way the setup and clean-up of the reducer only will be run once.

Resources