Multiple Spring Batch instances for robustness and scalability - spring

My batch use case looks like a common pattern, yet I'm not sure if Spring Batch is designed to work as I expect. Many thanks in advance for clarifications and suggestions.
There can be a number of spring-batch based applications responsible for tasks processing. Processing requests come via REST resource, hence ThreadPoolTaskExecutor is used (as discussed here). JobRegistry is based on JDBC, all instances share exactly the same configuration.
What I want to achieve is a situation where each node can process any job that has been submitted. This way I can scale my solution out as load grows - I would simply add new instances that would process queued requests. Solution is also robust: if any node dies, its tasks are automatically handled by a different node.
But it does not work this way with Spring Batch apparently. Each node seems to handle the tasks that were handled on that very node. Even if node A has 1000 items in the queue and node B does nothing, it will not take any of the A's load.
That's because how SimpleJobLauncher works - it simply queues a task into taskExecutor after creating a respective JobExecution:
JobExecution jobExecution = this.jobRepository.createJobExecution(job.getName(), jobParameters);
this.taskExecutor.execute(new Runnable(job, jobParameters, jobExecution) ....
I don't think I need job partitioning - it does not seem to be the usecase. So how do I achieve robustness and scalability with Spring Batch?


Is Spring batch's default behavior is to process next item only after first item finished?

After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.

Spring Scheduling Quartz and thousands of jobs

According to the business logic of my Spring Boot application with Quartz Scheduling and MongoDB as Job persistent storage, every user of the system can create the postponed job that must be executed at some point in time. The user chooses the time when it must be executed.
Right now I'm thinking about the approach where every user will create a dedicated JobDetail for every postponed job, something like this:
schedulerFactoryBean.getScheduler().addJob(jobDetail(), true, true);
The issue I can potentially see here, that with this approach I can quickly create thousands of jobs in Quartz scheduler. Previously I never scheduled such amount of jobs in Spring Scheduling with Quartz and don't know how the system will handle it. Is it a good idea to implement the system in such way and will Spring Scheduling Quartz handle such amount of jobs without problems?
Yes, Quartz itself can handle thousands of jobs and triggers without any issues.
If you are going to have many jobs executing concurrently, just make sure that you configure Quartz with a sufficient number of worker threads. The number of worker threads should be typically equal to the maximum number of jobs that can be running concurrently + some small buffer (10% or so) just in case.
From what you write I assume that your jobs will be one-off jobs, i.e. each job will be executed only once. If that is the case, Quartz can automatically discard your jobs as soon as they finish executing unless your jobs are marked as durable. Quartz automatically removes non-durable jobs if they are not scheduled to run in the future. This feature may help you reduce the total number of registered jobs.
I hope this helps. If not, please ask.

Spring Boot, Cron job synchronization

In my Spring Boot application, based on the Cron job(runs every 5 minutes) I need to process 2000 products in my database.
Right now the process time of these 2000 products takes more than 5 minutes. I ran into the issue where the second Cron job runs when the first one is not completed yet.
Is there in Spring/Cron out of the box functionality that will allow to synchronize these jobs and wait for the previous job completion before starting the next one?
Please advise how to properly implement such kind of system. Anyway, the following technologies are also available Neo4j, MongoDB, Kafka. Please advise how to properly design/implement this functionality using the Spring/Cron separately or even together with the mentioned technologies.
1) You may try to use #Scheduled(fixedDelay = 5*60*1000). It will guarantee that next invocation will happen strictly in 5 minutes after previous one is finished. But this may break your scheduling requirements
2) You can limit the underlying ThreadExecutor's pool size to 1 thread, so next invocation will have to wait until previous is finished, but this, again, can break the logic, since it would affect all periodic tasks invoked by #Scheduled
3) You can use Quartz instead of spring's native #Scheduled. It's more complicated to configure, but allows to achieve the desired behaviour via #DisallowConcurrentExecution annotation or via setting JobDetail::isConcurrentExectionDisallowed in your job details

spring xd pass data between jobs in composed job

What is the best way to pass parameters between Spring XD jobs within a composed job? Can I get parent's (composed job's) execution context to set job parameters to make them available in the next jobs?
In short...don't. There is no common component between the jobs within a composed job. That feature is really intended for job orchestration. You're better off handling any shared state between the jobs on your own.

Parallel step execution of ItemStreamReader in SpringBatch

I have a ItemStreamReader (extends AbstractItemCountingItemStreamItemReader), the reader on its own is quite fast, but the the following processing takes quite some time.
From a business point of view I can process as many items in parallel as I want.
As my ItemStreamReader is reading a large JSON file with a JsonParser, it ends up to be statefull. So just adding a TaskExecutor to the Step does not work and throws parsing exceptions and the following log output by spring batch:
16:51:41.023 [main] WARN o.s.b.c.s.b.FaultTolerantStepBuilder - Asynchronous TaskExecutor detected with ItemStream reader. This is probably an error, and may lead to incorrect restart data being stored.
16:52:29.790 [jobLauncherTaskExecutor-1] WARN o.s.b.core.step.item.ChunkMonitor - No ItemReader set (must be concurrent step), so ignoring offset data.
16:52:31.908 [feed-import-1] WARN o.s.b.core.step.item.ChunkMonitor - ItemStream was opened in a different thread. Restart data could be compromised.
How can I execute the processing in my Step to be executed in parallel by multiple threads?
Spring Batch provides a number of ways to parallelize processing. In your case, since processing seems to be the bottle neck, I'd recommend looking at two options:
The AsyncItemProcessor and AsyncItemWriter work in tandem to parallelize the processing of items within a chunk. You can think of them as a kind of fork/join concept. The items within the chunk are read by a single thread as normal. The AsyncItemProcessor wraps your normal ItemProcessor and executes that logic on a different thread, returning a Future instead of the actual item. The AsyncItemWriter then waits for the Future to return the processed item before writing it. These classes are found in the Spring Batch Integration module. You can read more about them in the documentation here:
Remote Chunking
The AsyncItemProcessor/AsyncItemWriter paradigm works well in a single JVM, but if you need to scale your processing further, you may want to take a look at remote chunking. Remote chunking is designed to scale the processor piece of a step to beyond a single JVM. Using a master/slave configuration, the master reads the input using a regular ItemReader. Then the items are sent via Spring Integration channels to the slaves for processing. The results can either be written in the slave or returned to the master for writing. It's important to note that in this approach, each item read by the master will go over the wire so it can be very IO intensive and should only be considered if the processing bottle neck is worse than the potential impact of sending the messages. You can read more about remote chunking in the documentation here:
