Spring Batch correctly restart uncompleted jobs in clustered environment - spring-boot

I used the following logic to restart the uncompleted jobs on single-node Spring Batch application:
public void restartUncompletedJobs() {
try {
jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));
List<String> jobs = jobExplorer.getJobNames();
for (String job : jobs) {
Set<JobExecution> runningJobs = jobExplorer.findRunningJobExecutions(job);
for (JobExecution runningJob : runningJobs) {
runningJob.setStatus(BatchStatus.FAILED);
runningJob.setEndTime(new Date());
jobRepository.update(runningJob);
jobOperator.restart(runningJob.getId());
}
}
} catch (Exception e) {
LOGGER.error(e.getMessage(), e);
}
}
Right now I'm trying to make it working on the two-node cluster. Both of the application on every node will be pointed to the shared PostgreSQL database.
Let's consider the following example: I have 2 job instances - the jobInstance1 is running right now on node1 and the jobInstance2 is running on node2. Node1 is restarted for some reason during jobInstance1 execution. After node1 restart the spring batch application tries to restart the uncompleted jobs with a logic presented above - it sees that there are 2 uncompleted job instances - jobInstance1 and jobInstance2(which is correctly running on node2) and tries to restart both of them. This way instead to restart the only jobInstance1 - it will restart both jobInstance1 and jobInstance2.. but the jobInstance2 should not be restarted because it is correctly executing right now on node2.
How to correctly restart during the application startup the not completed jobs(before the previous application termination) and prevent the situation when the jobs like jobInstance2 will be also restarted?
UPDATED
This is the solution provided in the answer below:
Get the job instances of your job with JobOperator#getJobInstances
For each instance, check if there is a running execution using JobOperator#getExecutions.
2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.
I have a question regarding #2.1 - will Spring Batch automatically restart uncompleted jobs with a running execution after application restart or do I need to do manual actions to do so?

Your logic is not restarting uncompleted jobs. Your logic is taking currently running job executions, setting their status to FAILED and restarting them. Your logic should not find running executions, it should look for not currently running executions, especially failed ones and restart them.
How to correctly restart the failed jobs and prevent the situation when the jobs like jobInstance2 will be also restarted?
In pseudo code, what you need to do to achieve this is:
Get the job instances of your job with JobOperator#getJobInstances
For each instance, check if there is a running execution using JobOperator#getExecutions.
2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.
In your scenario:
jobInstance1 should be restarted in step 2.2
jobInstance2 should be filtered in step 2.1 since there is a running execution for it on node 2.

Related

Spring batch master is waiting but worker startup failed in remote partitioning

I am stuck in a scenario in spring batch job remote partitioning where master started successfully but worker failed to start.The job is deployed on Aws batch , so master is waiting indefinitely for workers to finish since worker cannot comeup.
Can anyone suggest me the way to handle such scenario. I dont want my master node to wait till timeout has occured.
The manager is configurable with a timeout to fail if workers do not reply in time. So it won't wait indefinitely.
And if that happens, the job instance will fail and you can either:
restart it (only failed partitions will be restarted)
or abandon it and start a new instance.

How to update Spring Batch status on unexpected shutdown

I'm implementing a service that would reject job requests from being processed if an existing job is running. Unfortunately, I'm not sure if there is a way to tell the difference between a job that is actively running and a job that ended due to an unexpected shutdown like turning Tomcat off. The statuses in the tables are the same with status = STARTED and exit_code = UNKNOWN.
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions("MY_JOB");
Is there a way to tell the two apart or implementation that would change active job statuses to maybe ABANDONED?
There is indeed no way, by just looking at the database, to distinguish between a job that is effectively running and a job that has been abruptly killed (in both cases, the status is STARTED).
What you need to do is, in addition to checking status in the database, find a way to see if a job is effectively running. This really depends on how you run your jobs. For example, if you run your jobs in separate JVMs, you can write some code that checks if there is a JVM currently running your job. If you deploy your jobs to Kubernetes, you could ask Kubernetes if there is a pod currently running your job, etc.
However, if you can identify the execution that has been abruptly stopped for which the status has been stuck at STARTED (because Spring Batch did not have a chance to update its status to FAILED with a graceful shutdown), then you can update the status manually to ABONDONED and set its END_TIME to a non null value. This way, JobExplorer#findRunningExecutions will not return it anymore as a running execution.

How can I start running server in one yml job and tests in another when run server job is still running

So I have 2 yml pipelines currently... one starts running the server and after server is up and running I start the other pipeline that runs tests in one job and once that's completed starts a job that shuts down the server from first pipeline.
I'm kinda new to yml and wondering if there is a way to run all this in a single pipeline...
The problem I came across is that if I put server to run in a first job I do not know how to condition the second job to kick off after server is running. This job doesn't have succeeded of failed condition because it's still in progress as the server has to run in order for tests to be run.
I tried adding a variable that I set to true after server is running but it still never jumps to the next job?
I looked into templates too but those are not very clear to me so any suggestion or documentation or tutorial would be very helpful on how to achive putting this in one pipeline...
I already googled a bunch and will keep googling but figured someone here might have an answer already.
Each agent can run only one job at a time. To run multiple jobs in parallel you must configure multiple agents. You also need sufficient parallel jobs.
You can specify the conditions under which each job runs. By default, a job runs if it does not depend on any other job, or if all of the jobs that it depends on have completed and succeeded. You can customize this behavior by forcing a job to run even if a previous job fails or by specifying a custom condition.
Since you have added a variable that you set to true after server is running. Then try to enable a custom condition, set that job run if a variable is xxx.
More details please kindly check official doc here:
Specify jobs in your pipeline
Specify conditions

Spring Scheduler code within an App with multiple instances with multiple JVMs

I have a spring scheduler task configured with either of fixedDelay or cron, and have multiple instances of this app running on multiple JVMs.
The default behavior is all the instances are executing the scheduler task.
Is there a way by which we can control this behavior so that only one instance will execute the scheduler task and others don't.
Please let me know if you know any approaches.
Thank you
We had similar problem. We fixed it like this:
Removed all #Scheduled beans from our Spring Boot services.
Created AWS Lambda function scheduled with desired schedule.
Lambda function hits our top level domain with scheduling request.
Load balancer forwards this request to one of the service instances.
This way we are sure that scheduled task is executed only once across the cluster of our services.
I have faced similar problem where same scheduled batch job was running on two server where it was intended to be running on one node at a time. But later on I found a solution to not to execute the job if it is already running on other server.
Job someJob = ...
Set<JobExecution> jobs = jobExplorer.findRunningJobExecutions("someJobName");
if (jobs == null || jobs.isEmpty()) {
jobLauncher.run(someJob, jobParametersBuilder.toJobParameters());
}
}
So before launching the job, a check is needed if the job is already in execution on other node.
Please note that this approach will work only with DB based job repository.
We had the same problem our three instance were running same job and doing the tasks three times every day. We solved it by making use of Spring batch. Spring batch can have only unique job id so if you start the job with a job id like date it will restricts duplicate jobs to start with same id. In our case we used date like '2020-1-1' (since it runs only once a day) . All three instance tries to start the job with id '2020-1-1' but spring rejects two duplicate job stating already job '2020-1-1' is running.
If my understanding is correct on your question, that you want to run this scheduled job on a single instance, then i think you should look at ShedLock
ShedLock makes sure that your scheduled tasks are executed at most once at the same time. If a task is being executed on one node, it acquires a lock which prevents execution of the same task from another node (or thread). Please note, that if one task is already being executed on one node, execution on other nodes does not wait, it is simply skipped.

How Container failure is handled for a YARN MapReduce job?

How are software/hardware failures handled in YARN? Specifically, what happens in case of container(s) failure/crash?
Container and task failures are handled by node-manager. When a container fails or dies, node-manager detects the failure event and launches a new container to replace the failing container and restart the task execution in the new container.
In the event of application-master failure, the resource-manager detects the failure and start a new instance of the application-master with a new container.
Find the details here
App master will re-attempt task that complete with exception or stop responding ( 4 time by default )
_ Job with two many failed task are considered as failed job.

Resources