Throttle calls to ItemProcessor in Spring batch - spring

I've created spring batch which reads from flat file and process the data using ItemProcessor before writing in the DB using ItemWriter, everything so for works fine.
The problem now I need to control the number of times "Process" method is called for processing the data, my itemprocessor calls some API with details, the API will take some time to respond (not sure about the timeout), Hence, i should not overload the API with new messages. I need to control the calls to API, e.g X number of call in y Sec if it reaches, i need to wait for Z sec before resuming the activity.
I am not sure how to achieve this in spring batch, I am looking at implementing chunklistener in processor to track the calls. However, I am looking for a better approach.

You do not need a listener to do this.
If you do not define an asynchronous taskexecutor in your stop, then the whole processing is completely sequentially.
It will read an item, process an item, reads the next item, processes it until it as read and processed as many items as you defined in your commitsize (-> the size of your chunks). After that, it will put those items into a list and forward this list to the writer. This process will be executed, until all elements have been read, processed, and finally written.
If you would like to process your chunks in parallel, then you can define an asynchronous taskexecutor.
If you define an AsyncTaskExeutor in your step, you are able to configure the number of threads this TaskExecutor manages/creates. Moreover, you can also define the throttlelimit of your step which defines the number of chunks that can be be processed in parallel.

Related

Run multiple batch jobs in parallel Mule 4

I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.

How are threads of Processors invoked in Nifi flow?

I'm trying to learn writing custom Nifi Processor and from the documentation, the processor should be thread-safe. What I wanted to understand is, if, say - I have 100 flow file records connected to my custom processor, would my processor's onTrigger method ( assume that I haven't enabled #TriggerSerially on this method ) be triggered 100 times and in 100 separate threads ( irrespective of concurrently or not ), or is there a possibility that one flow file is used as input to more than one thread of onTrigger method on my processor.
I apologize if I didn't articulate the question correctly, but essentially, is is possible that the number of times my processor's onTrigger method is triggered, is more than the number of flow files that are connected as input to the processor?
The number of threads executing a processor is based on the number of concurrent tasks on the scheduling tab, which defaults to 1. If you increase this to 2, then 2 threads are concurrently executing the onTrigger method. A single flow file will only be processed by one of these threads.
The #TriggerSerially annotation prevents you from being able to increase the conccurent tasks, so it forces there to never be concurrent execution. A common use case for this would be a source processor that is pulling data from somewhere, typically you wouldn't to concurrently be pulling the same data twice.

Is Spring batch's default behavior is to process next item only after first item finished?

After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.

Parallel step execution of ItemStreamReader in SpringBatch

I have a ItemStreamReader (extends AbstractItemCountingItemStreamItemReader), the reader on its own is quite fast, but the the following processing takes quite some time.
From a business point of view I can process as many items in parallel as I want.
As my ItemStreamReader is reading a large JSON file with a JsonParser, it ends up to be statefull. So just adding a TaskExecutor to the Step does not work and throws parsing exceptions and the following log output by spring batch:
16:51:41.023 [main] WARN o.s.b.c.s.b.FaultTolerantStepBuilder - Asynchronous TaskExecutor detected with ItemStream reader. This is probably an error, and may lead to incorrect restart data being stored.
16:52:29.790 [jobLauncherTaskExecutor-1] WARN o.s.b.core.step.item.ChunkMonitor - No ItemReader set (must be concurrent step), so ignoring offset data.
16:52:31.908 [feed-import-1] WARN o.s.b.core.step.item.ChunkMonitor - ItemStream was opened in a different thread. Restart data could be compromised.
How can I execute the processing in my Step to be executed in parallel by multiple threads?
Spring Batch provides a number of ways to parallelize processing. In your case, since processing seems to be the bottle neck, I'd recommend looking at two options:
AsyncItemProcessor/AsyncItemWriter
The AsyncItemProcessor and AsyncItemWriter work in tandem to parallelize the processing of items within a chunk. You can think of them as a kind of fork/join concept. The items within the chunk are read by a single thread as normal. The AsyncItemProcessor wraps your normal ItemProcessor and executes that logic on a different thread, returning a Future instead of the actual item. The AsyncItemWriter then waits for the Future to return the processed item before writing it. These classes are found in the Spring Batch Integration module. You can read more about them in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#asynchronous-processors
Remote Chunking
The AsyncItemProcessor/AsyncItemWriter paradigm works well in a single JVM, but if you need to scale your processing further, you may want to take a look at remote chunking. Remote chunking is designed to scale the processor piece of a step to beyond a single JVM. Using a master/slave configuration, the master reads the input using a regular ItemReader. Then the items are sent via Spring Integration channels to the slaves for processing. The results can either be written in the slave or returned to the master for writing. It's important to note that in this approach, each item read by the master will go over the wire so it can be very IO intensive and should only be considered if the processing bottle neck is worse than the potential impact of sending the messages. You can read more about remote chunking in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#externalizing-batch-process-execution

Spring batch one reader multiple writers

Is it possible in spring batch to have one reader read the data and that data being split to multiple writers for processing running parallel?
Steps:
Reader : JdbcCursorItemReader reads 100 records
10 Parallel Writers: Each ItemWriter gets 10 records to process.
I've looked at:
CompositeItemWriter: seems to passes all the read items to all the writers when I need to split the items evenly to the writers.
BackToBackPatternClassifier: I don't really need a classifier because I'm splitting items evenly.
Is there another way of just having one reader and multiple writers ?
Or I can just manually create threads in my Writer ?
What do you mean by "multiple writers"?
What you are trying to achieve seems NOT multiple writers, but a single writer with multiple-threads.
To be clear, when we are talking about "multiple writer", we mean a reader read a chunk, and need to do different kind of "writing" for the chunk. e.g. you may have a PlayerRecordReader which read Player from somewhere, and you have PlayerDbWriter and PlayerFileWriter which writes to DB and File. Multiple writer are not for distributing the load.
For case that you want the writing to be done in parallel, what you need is a single writer (of course you need to make it thread-safe) and using executor in your step definition. This page in Spring Batch give you clear instruction on how to do it. http://static.springsource.org/spring-batch/reference/html/scalability.html#multithreadedStep
I moved my Writer logic to a Runnable class (Thread class) Called MyWriterRunnable and in the MyWriter class I'm manually splitting the items List into 10 batches and calling MyWriterRunnable for each batch.
If you are trying to process data in parallel, you would need to partition your data and assign the chunk to the linked step. Your partition can be as simple as determining what to read per thread or it could take already read data from previous step and break them up into evenly distributed chunks and assign each reader of each thread a chunk to process.

Resources