Restarting Partition Step in Spring Batch

Restarting Partition Step in Spring Batch - spring

Our application is a Spring batch running in openshift. The application calls another service via REST to fetch records from database. Both use nginx side car for handling the traffic. Both side cars restarted for some reason and the Spring batch job terminated suddenly .I already implemented retry mechanism using #Retryable but the logic has not even reached the retry part. The only log I found in the application is given below
"Encountered an error executing step myPartitionStep in job myJob","level":"ERROR","thread":"main","logClass":"o.s.batch.core.step.AbstractStep","logMethod":"execute","stack_trace":"o.s.b.core.JobExecutionException: Partition handler returned an unsuccessful step
o.s.b.c.p.support.PartitionStep.doExecute(PartitionStep.java:112)
o.s.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
o.s.b.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
o.s.b.c.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:68)
o.s.b.c.j.f.s.state.StepState.handle(StepState.java:68)
o.s.b.c.j.f.support.SimpleFlow.resume(SimpleFlow.java:169)
o.s.b.c.j.f.support.SimpleFlow.start(SimpleFlow.java:144)
o.s.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:137)
o.s.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
o.s.b.c.l.s.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
o.s.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
o.s.b.c.l.s.SimpleJobLauncher.run(SimpleJobLauncher.java:140)
j.i.r.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java)
j.i.r.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:566)
o.s.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
o.s.a.f.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
o.s.b.c.c.a.SimpleBatchConfiguration$PassthruAdvice.invoke(SimpleBatchConfiguration.java:128)
... 13 frames truncated\n"
I am not able to point what exactly is the reason for this error. It stopped at partition step which uses itemReader to call another service and fetche the records,FlatFileItemWriter which writes the records. We cannot afford to have duplicates in our file. Is it possible to restart the app exactly where it stopped without having duplicates?

The stacktrace you shared is truncated, so it is not possible to see the root cause from what you shared.
Spring Batch supports restarting a failed paritioned step, as long as you use a persistent job repository. You need to restart the same job instance, ie use the same job parameter that you used in your first run (that failed). Only failed partitions will be rerun. Any failed partition will resume from where it left off.

Related

Spring batch processor crash in between reading file and causing duplicate key exception

I am currently on batch processor and the issue we are facing currently is that when Batch processor is reading files, it could restart unexpectedly in middle of reading files, this will make the full flow not working because when the BP resumes reading file it may be reading file that is already saved in database and causing duplicate key exception.
So, I have been told to implement the solution where when the BP runs into duplicate key exception, it should read the file from bottom to top and when it runs into duplicate key exception again, it should move to next file.
I am looking for advice/guidance on how to implement/code this solution?

A correctly configured Spring Batch job (persistent job repository + chunk-oriented step) would allow you to restart that kind of failed jobs without any issue.
In fact, the read count will be saved in the database and used in a restart scenario. No data would be written to the database in case of a chunk failure (the transaction will rolled back). So upon restart, the job would resume reading from the last save point and save new data.

Spring batch limit job execution

My spring batch application is running on PCF platform which is connected to MySQL database (single instance), it's running fine when only an instance is up & running but when it comes to more than one application instance, I'm getting exception org.springframework.dao.DuplicateKeyException. This might be happening because similar batch job is firing at the same time & trying to update batch instance table with same job ID. Is there any way to restrict this kind of failure or in another way, I wanted a solution where only one batch job will run at a time even there are multiple instances running.

For me , it is a good sign that DuplicateKeyException is thrown. Because it exactly achieves what you want to do is that spring-batch already makes sure that the same job execution will not executed in parallel. (i.e. Only one server instance execute the job successfully while other fail to execute)
So I see no harms in your case. If you don't like this exception , you can catch it and re-throw it as your application level exception saying something like "The job is executing by other sever instances , so skip to execute it."
If you really want that only one server instance will try to trigger to execute a job and other servers will not try to trigger in the meantime , it is not the problem of spring-batch but is a problem about how you ensure that only one server node will fires the request in the distributed environment. If the batch job is fired as a scheduled task using #Scheduled , you can consider to use a distributed lock such as ShedLock to make sure that it is executed at most once at the same time on one node only.

Activate Batch on only one Server instance

I have a nginx loadbalancer in front of two tomcat instances each contains a spring boot application. Each spring boot application executes a batch that writes data in a database.
The batch executes every day at 1am.
The problem is that both instances execute the batch simultaniously which i don't want.
Is there a way to keep the batchs deployed in two instances and tell tomcat or nginx to start the batch in master server (and the slave server doesn't run the batch).
If one of the servers stops, the second server could start the batch on his behalf.
Is there a tool in nginx or tomcat (or some other technology) to do that ?
thank you in advance.

Here is a simplistic design approach.
Since you have two scheduled methods in the 2 VMs triggered at same time, add a random delay to both. This answer has many options on how to delay the trigger for a random duration. Spring #Scheduled annotation random delay
Inside the method run the job only if it is NOT already started (by the other VM). This could be done with a new table to track this.
Here is the pseudo code for this design:
#Scheduled(cron = "schedule expression")
public void batchUpdateMethod() {
//Check database for signs of job running now.
if (job is not running){
//update database table to indicate job is running
//Run the batch job
//update database table to indicate job is finished
}
}
The database, or some common file location, should be used as a lock to sync between the two runs, since the two VMs are independent of each other.
For a more robust design, consider Spring Batch
Spring Batch uses a database for its jobs (JobsRepository). By default an in memory datasource is used to keep track of running jobs and their status. In your setup, the 2 instances are (most likely) using their own in memory database.
Multiple instances of Spring Batch can coordinate with each other as a cluster and one can run jobs, while the other actasa backup, if the jobsRepository database is shared.
For this you need to configure the 2 instances to use a common datasource.
Here are some docs:
https://docs.spring.io/spring-batch/docs/current/reference/html/index-single.html#jobrepository
https://docs.spring.io/spring-batch/docs/current/reference/html/job.html#configuringJobRepository

If you design two app server instances to run the same job at the same time, then by design, one will succeed to create a job instance and the other will fail (and this failure can be ignored). See Javadoc of JobRepository. This is one of the roles of the job repository: to act as a safeguard against duplicate job executions in a clustered environment.
If one of the servers stops, the second server could start the batch on his behalf. Is there a tool in nginx or tomcat (or some other technology) to do that ?
I believe there is no need for such tool or technology. If one of the servers is down at the time of the schedule, the other will be able to take things over and succeed in launching the job.

I did implement a simple BCM Server functionality, where all servers do register(create a Server-table entry) with their unique IP. The Servers need to register within a defined time(e.g. 10 sec). If a Server does not register within time(last update timestamp > 10 sec), then the Server gets de-registered(delete Server-table entry) by the Server, which do register.
At the end I do have a table with ordered Server entries and can define the task uniquely to the registered Servers.
The Implementation is very simple and Works perfectly.
Before I did also have in mind the Spring Batch Job Sharing functionality, but I wanted zu have a more lightweight and more flexible Solution.
Currently I use it in all my projects where I need to have Batch-Processing implemented.

Step Failure not reported by Composed Task Runner or reflected in Spring Cloud Dataflow Tables

Currently we are using Spring Cloud Dataflow to run a sequence of apps we have created based on a definition. Each of the apps we have made are spring batch jobs, with individual steps. The current issue we are having is that when one of these steps inside the app's batch job fails, it is reflected as expected in the step_execution, job_execution, and task_execution tables in the scdf database. However, we are not able to rerun any scdf job that has failed in an app from the top scdf level because it seems the row entry in the step_execution table for SCDF's step related to the overall app never propagates to FAILED in the status column, instead always being COMPLETED no matter what happens. Below I have included a picture which gets across what I am saying. test-simple8-test-app is the app we have created, while check-step, sleep-step, and should-error-step are steps inside the job for that app. You can see in the should-error-step that it has FAILED for both ExitCode and Status, while the entry for the app itself has COMPLETED for status and FAILED for ExitCode.
Relevant Table
We have tried altering what we report in the task_execution table since we saw CTR is looking for certain fields there, but it still seems it does not affect the Status column in step_executions. If we manually change the entry in the db to FAILED for that value, it proceeds as we would expect and as is normal for spring batch, in that it resumes the job from that app and re executes it.
Is there a good way to relieve this problem, or is it a problem with the way we are approaching it?
Edit: Added Flow Diagram for better clarity

Spring Batch: Horizontal scaling of Job Repository

I read a lot about how to enable parallel processing and chunking of an individual job, using Master/Slave paradigm. Consider an already implemented Spring Batch solution that was intended to run on a standalone server. With minimal refactoring I would like to enable this to horizontally scale and be more resilient in production operation. Speed and efficiency is not a goal.
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
In the following example a Job Repository is used that connects to an initializes a database schema for the Job Repository. Job initiation requests are fed to a message queue, that a single server, with a single Java process is listening on via Spring JMS. When encountering this it executes a new Java process that is the Spring Batch job. If the job has not been started according to the Job Repository it will begin. If the job had failed it will pick up where the job left off. If the job is in process it will ignore.
The single point of failure is the single server and single listening process for job initiation. I would like to increase resiliency by horizontally scaling identical server instances all competing for who can first grab the job initiation message when it first appears in the queue. That server instance will now attempt to run the job.
I was conceiving that all instances of the JobRepository would share the same schema, so they can all query for when the status is currently in process and decide what they will do. I am unsure though if this schema or JobRepository implementation is meant to be utilized by multiple instances.
Is there a risk in pursuing this that this approach could result in deadlocking the database? There are other constraints to where the Partition features of Spring Batch will not work for my application.

I decided to build a prototype to test if the condition that the Spring Batch Job Repository schema and SimpleJobRepository can be used in a load balanced way with multiple Spring Batch Java processes running concurrently. I was afraid that deadlock scenarios might have occurred at the database to where all running job processes get stuck.
My Test
I started with the mkyong Spring Batch HelloWorld example and made some changes to it where it could be packaged into a Jar that can be executed from the command line. I also removed the initialize database step defined in the database.config file and manually established a local MySQL server with the proper schema elements. I added a Job parameter for time to be the current time in millis so that each job instance would be unique.
Next, I wrote a separate Java main class that used Apache Commons Exec framework to create 50 sub processes with no wait between them. Each of these processes have a Thread.sleep for 1 second within their Processor objects as well so that a number of processes will all kick off at the same time and all attempt to access the database at the same time.
Results
After running this test a number of times in a row I see that all 50 Spring batch processes consistently complete successfully and update the same database schema correctly. I don't see any indication that if there were multiple Spring Batch job processes running on multiple servers connecting to the same database that they would interfere with each other on the schema nor do I see any indication that a deadlock could happen at this time.
So it sounds as if load balancing of Spring Batch jobs without the use of advanced Master/Slave and Step Partitioning approaches is a valid use case.
If anybody would like to comment on my test or suggest ways to improve it I would appreciate it.

Here is excerpt from
Spring Batch docs on how Spring Batch handles database updates for its repository:
Spring Batch employs an optimistic locking strategy when dealing with updates to the database. This means that each time a record is 'touched' (updated) the value in the version column is incremented by one. When the repository goes back to save the value, if the version number has changed it throws an OptimisticLockingFailureException, indicating there has been an error with concurrent access. This check is necessary, since, even though different batch jobs may be running in different machines, they all use the same database tables.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio