I have to configure a job using Spring Batch. Is it possible to have a single threaded ItemReader but But Multi Threaded processor?
In this case ItemReader will create the work-items to be processed by reading it from database (by executing predefined query) and each processor will process item/chunk in parallel.
Take a look at the AsyncItemProcessor and AsyncItemWriter from the spring-batch-integration module. What those do is the AsyncItemProcessor is executed in a different thread, returning a Future. The AsyncItemWriter then unwraps the Future and writes the result. You can read more about this in the documentation here: https://docs.spring.io/spring-batch/apidocs/org/springframework/batch/integration/async/AsyncItemProcessor.html
Related
I am trying to implement multithreading in spring batch. In my implementation i want
Single multithreaded Reader (As we are reading only one file)
Multi thread processor and writer
I can't implement general chunking as there reader will be multhreded which i don't want.
So next option is asyncItemProcessor/asynchItemWriter. But what i have seen that reader and writer are single threaded only 'asyncItemProcessor' runs in multiple threads in 'asyncItemProcessor/asynchItemWriter.
Is there is anyway where i can run 'Reader' single threaded and processor and writer as multithreaded ?
i want
Single multithreaded Reader (As we are reading only one file)
Multi thread processor and writer
Is there is anyway where i can run 'Reader' single threaded and processor and writer as multithreaded ?
You can use a regular non multi-threaded chunk-oriented step with a reader and a AsyncItemProcessor + AsyncItemWriter (those should be used in conjunction) to achieve that.
After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.
If I define like this for spring batch:
<chunk reader="chunkReader"
writer="chunkWriter"
processor="chunkProcessor"
commit-interval="#{jobParameters['commitSize']}" />
In this spring batch's chunk-oriented processing, are the chunks processed in parallel? And are the individual items in the chunks processed in parallel?
I am asking mainly to see if I need to worry about multithreading and race conditions.
Unless you instruct Spring, it will be all sequential.
If you decide to use multithreading, a batch job can use Spring’s TaskExecutor abstraction to execute each chunk in its own thread. A step in a job can be configured to perform within a threadpool, processing each chunk independently. As chunks are processed, Spring Batch keeps track of what is done accordingly. If an error occurs in any one of the threads, the job’s processing is rolled back or terminated per the regular Spring Batch functionality.
See: https://docs.spring.io/spring-batch/trunk/reference/html/scalability.html
and
How to set up multi-threading in Spring Batch?
I have a ItemStreamReader (extends AbstractItemCountingItemStreamItemReader), the reader on its own is quite fast, but the the following processing takes quite some time.
From a business point of view I can process as many items in parallel as I want.
As my ItemStreamReader is reading a large JSON file with a JsonParser, it ends up to be statefull. So just adding a TaskExecutor to the Step does not work and throws parsing exceptions and the following log output by spring batch:
16:51:41.023 [main] WARN o.s.b.c.s.b.FaultTolerantStepBuilder - Asynchronous TaskExecutor detected with ItemStream reader. This is probably an error, and may lead to incorrect restart data being stored.
16:52:29.790 [jobLauncherTaskExecutor-1] WARN o.s.b.core.step.item.ChunkMonitor - No ItemReader set (must be concurrent step), so ignoring offset data.
16:52:31.908 [feed-import-1] WARN o.s.b.core.step.item.ChunkMonitor - ItemStream was opened in a different thread. Restart data could be compromised.
How can I execute the processing in my Step to be executed in parallel by multiple threads?
Spring Batch provides a number of ways to parallelize processing. In your case, since processing seems to be the bottle neck, I'd recommend looking at two options:
AsyncItemProcessor/AsyncItemWriter
The AsyncItemProcessor and AsyncItemWriter work in tandem to parallelize the processing of items within a chunk. You can think of them as a kind of fork/join concept. The items within the chunk are read by a single thread as normal. The AsyncItemProcessor wraps your normal ItemProcessor and executes that logic on a different thread, returning a Future instead of the actual item. The AsyncItemWriter then waits for the Future to return the processed item before writing it. These classes are found in the Spring Batch Integration module. You can read more about them in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#asynchronous-processors
Remote Chunking
The AsyncItemProcessor/AsyncItemWriter paradigm works well in a single JVM, but if you need to scale your processing further, you may want to take a look at remote chunking. Remote chunking is designed to scale the processor piece of a step to beyond a single JVM. Using a master/slave configuration, the master reads the input using a regular ItemReader. Then the items are sent via Spring Integration channels to the slaves for processing. The results can either be written in the slave or returned to the master for writing. It's important to note that in this approach, each item read by the master will go over the wire so it can be very IO intensive and should only be considered if the processing bottle neck is worse than the potential impact of sending the messages. You can read more about remote chunking in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#externalizing-batch-process-execution
Is it possible in spring batch to have one reader read the data and that data being split to multiple writers for processing running parallel?
Steps:
Reader : JdbcCursorItemReader reads 100 records
10 Parallel Writers: Each ItemWriter gets 10 records to process.
I've looked at:
CompositeItemWriter: seems to passes all the read items to all the writers when I need to split the items evenly to the writers.
BackToBackPatternClassifier: I don't really need a classifier because I'm splitting items evenly.
Is there another way of just having one reader and multiple writers ?
Or I can just manually create threads in my Writer ?
What do you mean by "multiple writers"?
What you are trying to achieve seems NOT multiple writers, but a single writer with multiple-threads.
To be clear, when we are talking about "multiple writer", we mean a reader read a chunk, and need to do different kind of "writing" for the chunk. e.g. you may have a PlayerRecordReader which read Player from somewhere, and you have PlayerDbWriter and PlayerFileWriter which writes to DB and File. Multiple writer are not for distributing the load.
For case that you want the writing to be done in parallel, what you need is a single writer (of course you need to make it thread-safe) and using executor in your step definition. This page in Spring Batch give you clear instruction on how to do it. http://static.springsource.org/spring-batch/reference/html/scalability.html#multithreadedStep
I moved my Writer logic to a Runnable class (Thread class) Called MyWriterRunnable and in the MyWriter class I'm manually splitting the items List into 10 batches and calling MyWriterRunnable for each batch.
If you are trying to process data in parallel, you would need to partition your data and assign the chunk to the linked step. Your partition can be as simple as determining what to read per thread or it could take already read data from previous step and break them up into evenly distributed chunks and assign each reader of each thread a chunk to process.