What happens when a mapper dies in EMR streaming job? - hadoop

In a elastic mapreduce streaming job, what is going to happen if a mapper suddenly dies? The data that were already processed will be replayed? If so, is there any option to disable that?
I am asking because I am using EMR to insert some data to third party database. Every mapper sends the data coming in through HTTP. In this case if a mapper crashes I don't want to replay the HTTP requests and I need to continue where I were left.

MR is a fault tolerant framework. When a Map task fails (streaming API or Java API) the behavior is the same.
Once the job tracker is notified that the task has failed it will try and reschedule the task. The temporary output generated by the failed task is deleted.
A more detailed discussion on how failures are handled in MR can be seen here
For your particular case I think you need to refer to the external source in your setup() method to find out the records which have been processed, then use this information in your mapper() methods to decide whether a particular record should be processed or not.

Related

How to know the the running status of a spring integration flow

I have a simple integration flow that poll data based on a cron job from database, publish on a DirectChannel, then do split and transformations, and publish on another executor service channel, do some operations and finally publish to an output channel, its written using dsl style.
Also, I have an endpoint where I might receive an http request to trigger this flow, at this point I send the messages one of the mentioned channels to trigger the flow.
I want to make sure that the manual trigger doesn’t happen if the flow is already running due to either the cron job or another request.
I have used the isRunning method of the StandardIntegrationFlow, but it seems that it’s not thread safe.
I also tried using .wireTap(myService) and .handle(myService) where this service has an atomicBoolean flag but it got set per every message, which is not a solution.
I want to know if the flow is running without much intervention from my side, and if this is not supported how can I apply the atomic boolean logic on the overall flow and not on every message.
How can I simulate the racing condition in a test in order to make sure my implementation prevent this?
The IntegrationFlow is just a logical container for configuration phase. It does have those lifecycle methods, but only for an internal framework logic. Even if they are there, they don't help because endpoints are always running if you want to do them something by some event or input message.
It is hard to control all of that since it is in an async state as you explain. Even if we can stop a SourcePollingChannelAdapter in the beginning of that flow to let your manual call do do something, it doesn't mean that messages in other threads are not in process any more. The AtomicBoolean cannot help here for the same reason: even if you set it to true in the MessageSourceMutator.beforeReceive() and reset back to false in its afterReceive() when message is null, it still doesn't mean that messages you pushed down in other thread are already processed.
You might consider to use an aggregator for AtomicBoolean resetting in the end of batch since you mention that you pull data from DB, so perhaps there is a number of records per poll you can track downstream. This way your manual call could be skipped until aggregator collects results for that batch.
You also need to think about stopping a SourcePollingChannelAdapter at the moment when manual action is permitted, so there won't be any further race conditions with the cron.

Pause/Resume Flink job during migrations

I'm using Apache Flink to propagate updates from a given set of Kafka topics into an Elasticsearch cluster.
The problem I'm facing is that sometimes the Elasticsearch cluster evolves and I have to (1) modify the mappings, (2) copy over the data...and by the time I have to point the Flink jobs to the new alias/index, there are plenty of updates that made it to the old index.
So I wonder what's the best way to approach this. I can have downtime, but I would like to avoid this if possible. I was trying to make the Flink jobs to slowdown or pause the (Kafka) input sources until the migration finishes, but I didn't find any endpoint for this.
The Flink jobs run in application mode.
If anyone can shed some light on how to accomplish this: pause/resume the jobs via an API or something similar, I will really appreciate the input. The only constraint I have is around stopping the applications (as in stopping/killing pods): it's possible, but too troublesome due to access constraints to the Kubernetes clusters.
I'd probably look into stopping the job with a save point using Flink REST API: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-stop
If that Flink app is pretty big and has lots of state, you can also try to just stop sending data to the input Kafka topic if you don't want to stop it (assuming that it can properly write with the new mappings and indeces after you've made the required ES cluster changes without any change in the Flink Job). It is a bit of overhead, but you could have different topics for your producers and Flink sources, and have another simple Flink job mirror data from one topic (where producers produce to) to the other (where Flink consumes from). When you want to stop writing to ES, just stop that job using the REST API. To not write a new Flink job you could use MirrorMaker or similar, but to stop it you may have to kill its pod.
Or another option is architecting the Elasticsearch indexes so they can support your cluster evolution without having to stop the Flink app. It is hard to know what do you'd need to exactly change, but by writing into aliases and playing with the write index flag you may be able to achieve what you want. I've done this in the past, but it is true that if your mappings change a lot it may be hard to do,

spring batch -how to submit jobs without executing ? to be able to run them later by a scheduler

i'm starting a project in spring batch, my plan is like the following:
create spring boot app
expose an api to submit a job(without executing), that will return the job execution id to be able to track the progress of the job later by other clients
create a scheduler for running a job - i want to have a logic that will decide how many jobs i can run at any moment.
the issue is that my batch service may receive many requests for starting jobs, and i want to put the job exeuctuion in a pending status first, then a scheduler later on will check the jobs in pending status and depending on my logic will decide if it should run another set jobs.
is that possible to do in spring batch, or i need to implement it from scratch ?
The common way of addressing such a use case is to decouple job submission from job execution using a queue. This is described in details in the Launching Batch Jobs through Messages section. Your controller can accept job requests and put them in a queue. The scheduler can then control how many requests to read from the queue and launch jobs accordingly.
Spring Batch provides all building blocks (JobLaunchRequest, JobLaunchingMessageHandler, etc) to implement this pattern.

scheduling jobs using spring batch or just Quartz scheduler

I am looking for best solution to create a java web application to generate reports in excel/PDf format. some thing similar to Google Adwords, where user can create schedule reports and download it when the report is generated at a later time.
I am thinking to develop and java application where User logs, selects a pre defined report and provides the input parameters (like report date etc), This request will be queued up or saved as Quarts Job(prefer persistent Queue). A Job will be monitoring the queue/job and execute the job, generate the report(output excel /pdf) and stored in disk.
When the user refresh the screen or logs back at a later time, the report should be available for down load.
Using Spring batch and Quartz scheduler can I do this ? I also expecting like Spring admin , where I can see number of request in Queue(jobs queued up), and stop the queue processing etc.
You would use spring-batch if you wanted to process all report requests at the same time, perhaps at night when your servers are not otherwise occupied processing real-time user requests (or even during the day during slow periods).
You would use a quartz job if you wanted to check for new jobs every few seconds/minutes/hours/etc, and process one/many of them at that specified time interval.
So, quartz is a scheduler and batch is a process. You could use quartz to schedule batch jobs to run at specific times. They aren't competing technologies, they are complimentary.
About your question:
Given that you talk about queues and their persistence however it sounds a lot like your problem would fit into a simple jms model. You would need some messaging software. If you want to make it easy on yourself I'd recommend using spring-jms as a wrapper around the basic Java EE JMS api -- the spring wrappers are simply simpler than basic jms. For a messaging service I'd look at RabbitMQ, because again it's pretty simple.
With the jms architecture you'd post user requests to the queue, which you'd configured to be persistent. You'd have a custom listener on the queue, passing requests to a report generator whenever it runs. You can assign one or more threads to the listener, meaning that you should find it easy to tune the performance of the report generator.
There is a pretty useful DZone article about using rabbitmq via spring-integration (a set of prebuilt pattern implementations that help with connecting things to each other).

Spring Batch or JMS for long running jobs

I have the problem that I have to run very long running processes on my Webservice and now I'm looking for a good way to handle the result. The scenario : A user executes such a long running process via UI. Now he gets the message that his request was accepted and that he should return some time later. So there's no need to display him the status of his request or something like this. I'm just looking for a way to handle the result of the long running process properly. Since the processes are external programms, my application server is not aware of them. Therefore I have to wait for these programms to terminate. Of course I don't want to use EJBs for this because then they would block for the time no result is available. Instead I thought of using JMS or Spring Batch. Does anyone ever had the same problem or an advice which solution would be better?
It really depends on what forms of communication your external programs have available. JMS is a very good approach and immediately available in your app server but might not be the best option if your external program is a long running DB query which dumps the result in a text file...
The main advantage of Spring Batch over "just" using JMS as an aynchronous communcations channel is the transactional properties, allowing the infrastructure to retry failed jobs, group jobs together and such. Without knowing more about your specific setup, it is hard to give detailed advise.
Cheers,
I had a similar design requirement, users were sending XML files and I had to generate documents from them. Using JMS in this case is advantageous since you can always add new instances of these processes which can consume and execute the jobs in parallel.
You can use a timer task to check status or monitor these processes. Also, you can publish a message to a JMS queue once the processes are completed.

Resources