Elastic Map Reduce: continue on error? - hadoop

We use Elastic Map Reduce quite extensively, and are processing more and more data with it. Sometimes our jobs fail because the data is malformed. We've constantly revised our map scripts to handle all sorts of exceptions but sometimes there is still some malformed data that manages to break our scripts.
Is it possible to specify Elastic Map Reduce to "continue on error" even when some of the map or reduce jobs fail?
At least, is it possible to increase the minimum number of failed tasks under which the entire cluster fails (sometimes, we have only 1 failed job out of 500 or so jobs, and we would like to obtain at least those results and have the cluster continue running.)
In addition, while we can revise our map script to handle new exceptions, we use the default Hadoop "aggregate" reducer, and when that fails, we have no way to catch an exception. Is there any special way to handle errors in the "aggregate" reducer, or do we have to work with anything available to us in question #2 above (increasing minimum number of failed tasks.)

You may catch Exception in both mapper and reducer and inside the catch block have a counter like the following:
catch (Exception ex){
context.getCounter("CUSTOM_COUNTER", ex.getMessage()).increment(1);
System.err.println(GENERIC_INPUT_ERROR_MESSAGE + key + "," + value); // also log the payoad which resulted in the exception
ex.printStackTrace();
}
If the exception message is something you would have expected and also the counter's value is acceptable then you can very well go ahead with the results or else investigate the logs. I know catching Exception isn't advised but if you want to "continue on error", then it's pretty much the same thing. Since here cost of clusters are at stake, I think we are better off catching Excpetion instead of specific exceptions.
Though, there may be side effects to it, such as your code might be run on entirely wrong input and but for the catch it would have failed much earlier. But chances of something like this happening is very less.
EDIT:
For your point #2, you may set max number of allowed failures per tracker by using the following:
conf.setMaxTaskFailuresPerTracker(noFailures);
OR
The config which you must set is mapred.max.tracker.failures. As you may know the default is 4. For all other mapred configurations see here.

If I am reading your question right, you can have your cluster continue on failure to the next step defined in the elastic-mapreduce call in the ruby based command line tool for emr
--jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --args "s3://bucket/scripts/script.sh" --step-name "do something using bash" --step-action CONTINUE \

Related

How to stop Spring batch job when error in processor

In my batch job, I have a single step with reading from Database , processing the record and writing back the same record to same table.(ie updating record with processed values or error reason if processing failed).
I am using AsyncItemProcessor for multi thread processing. When I get error in ItemProcessor.process() method, I throw an exception and batch job ends with FAILED status. This failed status is a requirement.
Because, its AsyncItemProcessor, I am unable to access ItemProcessListener.onProcessError().
How do I write the errorMessage to Item Table when there is an error ?
This is a known limitation of using the AsyncItemProcessor which is mentioned in its Javadoc:
While not an exhaustive list, things like StepExecution.filterCount will not
reflect the number of filtered items and
ItemProcessListener.onProcessError(Object, Exception) will not be called.
There is an open issue to update the reference documentation as well.
How do I write the errorMessage to Item Table when there is an error ?
The AsyncItemProcessor submits a FutureTask to the task executor and the only way to know if an exception happened in the task is by unwrapping the future (the exception will be actually wrapped in a java.util.concurrent.ExecutionException when the FutureTask.get is called). Now since the future is unwrapped in the AsyncItemWriter, you can use an ItemWriteListener and react to processing errors. You can find a complete example here.

Laravel Jobs fail on Redis when attempting to use throttle

End Goal
The aim is for my application to fire off potentially a lot of emails to the Redis queue (This bit is working) and then Redis throttle the processing of these to only a set number of emails every selected number of minutes.
For this example, I have a test job that appends the time to a file and I am attempting to throttle it to once every 60 seconds.
The story so far....
So far, I have the application successfully pushing a test amount of 50 jobs to the Redis queue. I can log in to Horizon and see these 50 jobs in the "processjob" queue. I can also log in to redis-cli and see 50 sets under the list key "queues:processjob".
My issue is that as soon as I attempt to put the throttle on, only 1 job runs and the rest fail with the following error:
Predis\Response\ServerException: ERR Error running script (call to f_29cc07bd431ccbf64637e5dcb60484560fdfa2da): #user_script:10: WRONGTYPE Operation against a key holding the wrong kind of value in /var/www/html/smhub/vendor/predis/predis/src/Client.php:370
If I remove the throttle, all works file and 5 jobs are instantly ran.
I thought maybe it was the incorrect key name but if I change the following:
public function handle()
{
//
Redis::throttle('queues:processjob')->allow(1)->every(60)->then(function(){
Storage::disk('local')->append('testFile.txt',date("Y-m-d H:i:s"));
}, function (){
return $this->release(10);
});
}
to this:
public function handle()
{
//
Redis::funnel('queues:processjob')->limit(1)->then(function(){
Storage::disk('local')->append('testFile.txt',date("Y-m-d H:i:s"));
}, function (){
return $this->release(10);
});
}
then it all works fine.
My thoughts...
Something tells me that the issue is that the redis key is of type "list" and that the jobs are all under a single list. That being said, if it didn't work this way, how would we throttle a queue as the throttle requires a unique key.
For anybody else that is having issues attempting to get this to work and is getting the same issue as I was, this is what resolved my issues:
The Fault
I assumed that Redis::throttle('queues:processjob') was meant to be referring to the queue that you wanted to be throttled. However, after some re-reading of the documentation and testing of the code, I realized that this was not the case.
The Fix
Redis::throttle('queues:processjob') is meant to point to it's own 'holding' queue and so must be a unique Redis key name. Therefore, changing it to Redis::throttle('throttle:queues:processjob') worked fine for me.
The workings
When I first looked in to this, I assumed that that Redis::throttle('this') throttled the queue that you specified. To some degree this is correct but it will not work if the job was created via another means.
Redis::throttle('this') actually creates a new 'holding' queue where the jobs go until the condition(s) you specify are met. So jobs will go to the queue 'this' in this example and when the throttle trigger is released, they will be passed to the queue specified in their execution code. In this case, 'queues:processjob'.
I hope this helps!

How can I troubleshoot silently failing queued jobs?

I have a job that is dispatched with two arguments - path and filename to a file. The job parses the file using simplexml, then makes a notice of it in the database and moves the file to the appropriate folder for safekeeping. If anything goes wrong, it moves the file to another folder for failed files, as well as creates an event to give me a notification.
My problem is that sometimes the job will fail silently. The job is removed from the queue, but the file has not been parsed and it remains in the same directory. The failed_jobs table is empty (I'm using the database queue driver for development) and the failed() method has not been triggered. The Queue::failing() method I put in the app service provider has not been triggered either - I know, since both of those contain only a single log call to check whether they were hit. The Laravel log is empty (it's readable and Laravel does write to it for other errors - I double-checked) and so are relevant system log files such as e.g. php's.
At first I thought it was a timeout issue, but the queue listener has not failed or stopped, nor been restarted. I increased the timeout to 300 seconds anyway, and verified that all of the "[datetime] Processed: [job]" lines the listener generates were well within that timespan. Php execution times etc. are also far longer than required for this job.
So how on earth can I troubleshoot this when the logs are empty, nothing appears to fail, and I get no notification of what's wrong? If I queue up 200 files then maybe 180 will be processed and the remaining 20 fail silently. If I refresh the database + migrations and queue up the same 200 again, then maybe 182 will be processed and 18 will fail silently - but they won't necessarily be the same.
My handle method, simplified to show relevant bits, looks as follows:
public function handle()
{
try {
$xml = simplexml_load_file($this->path.$this->filename);
$this->parse($xml);
$parsedFilename = config('feeds.parsed path').$this->filename;
File::move($this->path.$this->filename, $parsedFilename);
} catch (Exception $e) {
// if i put deliberate errors in the files, this works fine
$errorFilename = config('feeds.error path').$this->filename;
File::move($this->path.$this->filename, $errorFilename);
event(new ParserThrewAnError($this->filename));
}
}
Okay, so I still have absolutely no idea why, but... after restarting the VM I have tested eight times with various different files and options and had zero problems. If anyone can guess the reason, feel free to reply and I'll accept your answer if it sounds reasonable. For now, I'll mark my own answer as correct once I can, in case somebody else stumbles across this later.

How to use hadoop for continue input

I have a case that I use Hadoop to listen/receive message from JMS queue. If the queue have the message then trigger map/reduce program, so we don't want the map reduce die we need loop execute map/reduce code many many times.
My problem is:
public boolean nextKeyValue() throws IOException Using this method we return the key and value every times. If I return false the map reduce code will run to finished. If I return true the map/ reduce code will wait for the next key value rather than to call reduce method. so is there any way to run reduce method after map method soon and the nextKeyValue return true to wait the JMS queue next message?.
Or anybody has good ideas for Hadoop read continue datasource then to do map/reduce in parallel, the same function with Hstreaming?

Ruby - Error Handling - Good Practices

This is more of an opinion oriented question. When handling exceptions in nested codes such as:
Assuming you have a class that initialize another class to run a job. The job returns a value, which is then processed by the class which initially called it.
Where would you put the exception and error logging? Would you define it on the initialization of the job class in the calling class, which will handle then exception in the job execution or on both levels ?
if the job handles exceptions then you don't need to wrap the call to the job in a try catch.
but the class that initializes and runs the job could throw exceptions, so you should handle exceptions at that level as well.
here is an example:
def some_job
begin
# a bunch of logic
rescue
# handle exception
# log it
end
end
it wouldn't make sense then to do this:
def some_manager
begin
some_job
rescue
# log
end
end
but something like this makes more sense:
def some_manager
begin
# a bunch of logic
some_job
# some more logic
rescue
# handle exception
# log
end
end
and of course you would want to catch specific exceptions.
Probably the best answer, in general, for handling Exceptions in Ruby is reading Exceptional Ruby. It may change your perspective on error handling.
Having said that, your specific case. When I hear "job" in hear "background process", so I'll base my answer on that.
Your job will want to report status while it's doing it's thing. This could be states like "in queue", "running", "finished", but it also could be more informative (user facing) information: "processing first 100 out of 1000 records".
So, if an error happens in your background process, my suggestion is two-fold:
Make sure you catch exceptions before you exit the job. Your background job processor might not like a random exception coming from your code. I, personally, like the idea of catching the exception and saving it to the database, for easy retrieval later. Then again, depending on your background job processor, maybe it handles error reporting for you. (I think reque does, for example).
On the front end, use AJAX (or something) to occasionally check in to how the job is doing. Say every 10 seconds or something. In additional to getting the status of the job, also make sure you return this additional information to the user (if appropriate).

Resources