I am quite new to Spring Batch and am stuck with a problem for which I could not find a solution.
I have create a job which has a step and two flows:
Step 1:
Retrieves a list of contract numbers(for simplification, a unique number which will be used to search further records). Using ItemReader single chunk, it will pass a single contract number to next step.
Flow 1:
This flow has a Step(Reader,Processor,Writer) whose Reader will pick this contract number and retrieve a list of member ids. These Ids will be passed in chunks(of 10) to the processor.
The processor will further perform several Query calls to finally create a Participant details list to the writer. The writer will write this data in chunks to the Workbook object.
Flow 2: Once all the data is written in workbook, the object is sent as a file to a remote location. This process is done using a tasklet which has all the necessary details to send file to the proposed destination.
Now, once this Entire process is completed (Step 1-> Flow 1-> Flow 2) it checks whether any more contract details are to be written into the remote location.
If yes, another Contract number is retrieved from the list which is then passed to the flows(flow1 and flow2)
Once all the contract numbers are processed then the code completes with RepeatStatus.FINISHED
Adding a diagram for better understanding:
Diagrammatic representation of the above explanation
It looks something like this:
Job
-> Step 1 (retrieve Id number list but send a single contract number)
-> Flow 1
-> Reader
-> Processor
-> Writer
-> Flow 2
-> Tasklet (Send file to remote location)
(If all contract-numbers are not processed go to Step 1 and iterate to the next contract-number else finish the job)
My problems start here:
How do I jump back from flow 2 back to Step 1 based on a condition? I do find several suggestions where people add a decider loop but you can go back to the previous step (in this case of the condition is not satisfied in flow 2, flow 2 will be re-triggered). So how do I jump back from flow2 to Step 1?
How do I pass data between all the steps and flows throughout job? (without using execution context)
If you think there is a better way to do this please do suggest.
How do I jump back from flow 2 back to Step 1 based on a condition?
Use a JobExecutionDecider for that. You need to make sure that step 1 is allowed to restart even if complete (parameter allowStartIfComplete=true)
How do I pass data between all the steps and flows throughout job? (without using execution context)
If you don't want to share data through the execution context, you can use a shared object between steps.
Related
After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.
Need help to understand what constitutes as a step and a job and how to configure them as a spring batch program. Scenario is .
<pre>
<datasource name="xyz">
<searchcriteria name="ab" parafield1="content_i" parafield2="supplier-1"/>
<searchcriteria name="ab" parafield1="content_i" parafield2="supplier-1"/>
<searchcriteria .../>
</datasource>
</pre>
Read a set of search parameters from an external XML file as above.
For every search criteria, step 3 and step 4 has to be done.
System as to hit a SOAP service which might return with 10K odd records.
For every 250 records (endpoint capacity constraint) in 10K odd results, I have to hit another SOAP service and the results should written to 3 csv files, 2 of which are consolidated and 1 file for every record (250 files). Writing of 2 files and 1 csv file can be in parallel.
Design decisions
I cannot have a one job launcher for every search, as there is capacity constraints at the source. No no parallel search.
No DB involvement, hence no need of metadata DB
No restart ability required. It is always from the beginning.
Question (edited)
Would like to have XML reader in the first step (no processing , no writer) and for every search (read in first step), how should I repeat step 2 where again I read (call services) and generate CSV files (split between 2 writers) ?
Using Spring boot 2.0.4.
Thanks in advance
I've created spring batch which reads from flat file and process the data using ItemProcessor before writing in the DB using ItemWriter, everything so for works fine.
The problem now I need to control the number of times "Process" method is called for processing the data, my itemprocessor calls some API with details, the API will take some time to respond (not sure about the timeout), Hence, i should not overload the API with new messages. I need to control the calls to API, e.g X number of call in y Sec if it reaches, i need to wait for Z sec before resuming the activity.
I am not sure how to achieve this in spring batch, I am looking at implementing chunklistener in processor to track the calls. However, I am looking for a better approach.
You do not need a listener to do this.
If you do not define an asynchronous taskexecutor in your stop, then the whole processing is completely sequentially.
It will read an item, process an item, reads the next item, processes it until it as read and processed as many items as you defined in your commitsize (-> the size of your chunks). After that, it will put those items into a list and forward this list to the writer. This process will be executed, until all elements have been read, processed, and finally written.
If you would like to process your chunks in parallel, then you can define an asynchronous taskexecutor.
If you define an AsyncTaskExeutor in your step, you are able to configure the number of threads this TaskExecutor manages/creates. Moreover, you can also define the throttlelimit of your step which defines the number of chunks that can be be processed in parallel.
After reading this answer (by Michael Minella)
Spring batch chunk processing , how does the reader work ?if the result set changes?
I assume with JdbcPagingItemReader, the query is run again for each page. In this case, when reading a new page it is possible a new record had been inserted in a position before this page starts, causing the last record of previous page to be processed again.
This means in order to prevent a record to be reprocessed I must always set a "processed already" flag manually into input data and check it before writing ?
Is this a feasible approach ?
The same question applies to a JdbcCursorItemReader when the process is interrupted (power outage) and restarted. What happens if a new record has been inserted before the current index that is saved into ExecutionContext ?
Your assumptions are right.
In case of the JdbcPagingItemReader this will also depend on the transaction isolation level of your transaction (READ_COMMITED, READ_UNCOMMITTED, ...).
In case of the JdbcCursorItemReader you have to ensure that the query returns the exact same result set (including order) in the case of a restart. Otherwise the results are unpredictable.
In the batches I'm writing, I often save the result of the selection into a csv file in the first step and configure the reader with "saveState=false", if I cannot guarantee that the selection will produce the same results in case of a crash. So, if the first step fails a restart will produce a complete new csv-file. After the first step, all the entries that need to be processed are in a file. And of course, this file cannot change and therefore, in a case of a restart, continuing processing from the last successful chunk is possible from the second step onward.
Edited:
Using a "state-column" works well, if you have a single step that does the reading (having the state-column in its where-clause), processing and writing/updating (the state-column to 'processed') the state. You just have to start the job again as a new launch, if such a job fails.
I am using spring batch module to read a complex file with multi-line records. First 3 lines in the file will always contain a header with few common fields.
These common fields will be used in the processing of subsequent records in the file. The job is restartable.
Suppose the input file has 10 records (please note number of records may not be same as number of lines since records can span over multiple lines).
Suppose job runs first time, starts reading the file from line 1, and processes first 5 records and fails while processing 6th record.
During this first run, since job has also parsed header part (first 3 lines in the file), application can successfully process first 5 records.
Now when failed job restarted it will start from 6th record and hence will not read the header part this time. Since application requires certain values
contained in the header record, the job fails. I would like to know possible suggestions so that restarted job always reads the header part and then starts
from where it left off (6th record in the above scenario).
Thanks in advance.
i guess, the file in question does not change between runs? then it's not necessary to re-read it, my solution builds on this assumption
if you use one step you can
implement a LineCallbackHandler
give it access to the stepExecutionContext (it's easy with annotations, but can be too with interfaces, just extend StepExecutionListenerSupport)
save the header values into the ExecutionContext
extract them from the context and use them where you want to
it should work for re-start as well, because Spring Batch reads/saves the values from the first run and will provide the complete ExecutionContext for subsequent runs
You can make 2 step job where:
First step reads first 3 lines as header information and puts everything you need to job context (and therefore save it in DB for future executions if job fails). If this step fails, header info will be read again and if it passes you are sure it will always have header info in job context.
Second step can use same file for input but this time you can tell it to skip first 3 lines and read rest as is. This way you will get restartability on that step and each time job fails it will resume where it left of.