Chunk oriented processing multithreaded? - spring

If I define like this for spring batch:
<chunk reader="chunkReader"
writer="chunkWriter"
processor="chunkProcessor"
commit-interval="#{jobParameters['commitSize']}" />
In this spring batch's chunk-oriented processing, are the chunks processed in parallel? And are the individual items in the chunks processed in parallel?
I am asking mainly to see if I need to worry about multithreading and race conditions.

Unless you instruct Spring, it will be all sequential.
If you decide to use multithreading, a batch job can use Spring’s TaskExecutor abstraction to execute each chunk in its own thread. A step in a job can be configured to perform within a threadpool, processing each chunk independently. As chunks are processed, Spring Batch keeps track of what is done accordingly. If an error occurs in any one of the threads, the job’s processing is rolled back or terminated per the regular Spring Batch functionality.
See: https://docs.spring.io/spring-batch/trunk/reference/html/scalability.html
and
How to set up multi-threading in Spring Batch?

Related

Run multiple batch jobs in parallel Mule 4

I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.

Is Spring batch's default behavior is to process next item only after first item finished?

After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.

Throttle calls to ItemProcessor in Spring batch

I've created spring batch which reads from flat file and process the data using ItemProcessor before writing in the DB using ItemWriter, everything so for works fine.
The problem now I need to control the number of times "Process" method is called for processing the data, my itemprocessor calls some API with details, the API will take some time to respond (not sure about the timeout), Hence, i should not overload the API with new messages. I need to control the calls to API, e.g X number of call in y Sec if it reaches, i need to wait for Z sec before resuming the activity.
I am not sure how to achieve this in spring batch, I am looking at implementing chunklistener in processor to track the calls. However, I am looking for a better approach.
You do not need a listener to do this.
If you do not define an asynchronous taskexecutor in your stop, then the whole processing is completely sequentially.
It will read an item, process an item, reads the next item, processes it until it as read and processed as many items as you defined in your commitsize (-> the size of your chunks). After that, it will put those items into a list and forward this list to the writer. This process will be executed, until all elements have been read, processed, and finally written.
If you would like to process your chunks in parallel, then you can define an asynchronous taskexecutor.
If you define an AsyncTaskExeutor in your step, you are able to configure the number of threads this TaskExecutor manages/creates. Moreover, you can also define the throttlelimit of your step which defines the number of chunks that can be be processed in parallel.

Spring Batch Singleton Reader and Multi Threaded Processor

I have to configure a job using Spring Batch. Is it possible to have a single threaded ItemReader but But Multi Threaded processor?
In this case ItemReader will create the work-items to be processed by reading it from database (by executing predefined query) and each processor will process item/chunk in parallel.
Take a look at the AsyncItemProcessor and AsyncItemWriter from the spring-batch-integration module. What those do is the AsyncItemProcessor is executed in a different thread, returning a Future. The AsyncItemWriter then unwraps the Future and writes the result. You can read more about this in the documentation here: https://docs.spring.io/spring-batch/apidocs/org/springframework/batch/integration/async/AsyncItemProcessor.html

Parallel step execution of ItemStreamReader in SpringBatch

I have a ItemStreamReader (extends AbstractItemCountingItemStreamItemReader), the reader on its own is quite fast, but the the following processing takes quite some time.
From a business point of view I can process as many items in parallel as I want.
As my ItemStreamReader is reading a large JSON file with a JsonParser, it ends up to be statefull. So just adding a TaskExecutor to the Step does not work and throws parsing exceptions and the following log output by spring batch:
16:51:41.023 [main] WARN o.s.b.c.s.b.FaultTolerantStepBuilder - Asynchronous TaskExecutor detected with ItemStream reader. This is probably an error, and may lead to incorrect restart data being stored.
16:52:29.790 [jobLauncherTaskExecutor-1] WARN o.s.b.core.step.item.ChunkMonitor - No ItemReader set (must be concurrent step), so ignoring offset data.
16:52:31.908 [feed-import-1] WARN o.s.b.core.step.item.ChunkMonitor - ItemStream was opened in a different thread. Restart data could be compromised.
How can I execute the processing in my Step to be executed in parallel by multiple threads?
Spring Batch provides a number of ways to parallelize processing. In your case, since processing seems to be the bottle neck, I'd recommend looking at two options:
AsyncItemProcessor/AsyncItemWriter
The AsyncItemProcessor and AsyncItemWriter work in tandem to parallelize the processing of items within a chunk. You can think of them as a kind of fork/join concept. The items within the chunk are read by a single thread as normal. The AsyncItemProcessor wraps your normal ItemProcessor and executes that logic on a different thread, returning a Future instead of the actual item. The AsyncItemWriter then waits for the Future to return the processed item before writing it. These classes are found in the Spring Batch Integration module. You can read more about them in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#asynchronous-processors
Remote Chunking
The AsyncItemProcessor/AsyncItemWriter paradigm works well in a single JVM, but if you need to scale your processing further, you may want to take a look at remote chunking. Remote chunking is designed to scale the processor piece of a step to beyond a single JVM. Using a master/slave configuration, the master reads the input using a regular ItemReader. Then the items are sent via Spring Integration channels to the slaves for processing. The results can either be written in the slave or returned to the master for writing. It's important to note that in this approach, each item read by the master will go over the wire so it can be very IO intensive and should only be considered if the processing bottle neck is worse than the potential impact of sending the messages. You can read more about remote chunking in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#externalizing-batch-process-execution

Resources