I have a job that processes items in chunks (of 1000). The items are marshalled into a single JSON payload and posted to a remote service as a batch (all 1000 in one HTTP POST). Sometime the remote service bogs down and the connection times out. I set up skip for this
return steps.get("sendData")
.<DataRecord, DataRecord> chunk(1000)
.reader(reader())
.processor(processor())
.writer(writer())
.faultTolerant()
.skipLimit(10)
.skip(IOException.class)
.build();
If a chunk fails, batch retries the chunk, but one item at a time (in order to find out which item caused the failure) but in my case no one item caused the failure, it is the case that the entire chunk succeeeds or fails as a chunk and should be retried as a chunk (in fact, dropping to single-item mode causes the remote service to get very angry and it refuses to accept the data. We do not control the remote service).
What's my best way out of this? I was trying to see if I could disable single-item retry mode, but I don't even fully understand where this happens. Is there a custom SkipPolicy or something that I can implement? (the methods there didn't look that helpful)
Or is there some way to have the item reader read the 1000 records but pass it to the writer as a List (1000 input items => one output item)?
Let me walk though this in two parts. First I'll explain why it works the way it does, then I'll propose an option for addressing your issue.
Why Is Retry Item By Item
In your configuration, you've specified that it be fault tolerant. With that, when an exception is thrown in the ItemWriter, we don't know which item caused it so we don't have a way to skip/retry it. That's why, when we do begin the skip/retry logic, we go item by item.
How To Handle Retry By The Chunk
What this comes down to is you need to get to a chunk size of 1 in order for this to work. What that means is that instead of relying on Spring Batch for iterating over the items within a chunk for the ItemProcessor, you'll have to do it yourself. So your ItemReader would return a List<DataRecord> and your ItemProcessor would loop over that list. Your ItemWriter would take a List<List<DataRecord>>. I'd recommend creating a decorator for an ItemWriter that unwraps the outer list before passing it to the main ItemWriter.
This does remove the ability to do true skipping of a single item within that list but it sounds like that's ok for your use case.
Related
I have a spring batch application, which is working good. It just reads from text file and write to to oracle table. It performs the loading in chuck. Currently I have configured with chuck size of 2000. The issue is, when I implement the skip listener for this job, spring ignoring the chunk size i have given and it is inserting just one record at a time into database. Skip listerner is just writing the invalid record to text file. Is this how spring batch works ?
In a chunk, the ItemWriter will always first attempt to write the entire list of items in the chunk. However, if a skippable exception is thrown, the framework needs to figure out which item(s) caused the error.
To do this, the transaction is rolled back and then the items are retried one-by-one. This allows any item(s) that may have caused the issue to be passed to your skip listener. Unfortunately, it also removes the batch-iness of the chunk.
In general, it is preferable (and will perform better) to do upfront validation in the processor, so you can "filter" the items out rather than throwing an exception and retrying the items individually.
I have a requirement, which is like, I read items from a DB, if possible in a paging way where the items represent the later "batch size", I do some processing steps, like filtering etc. then I want to accumulate the items to send it to a rest service where I can send it to in batches, e.g. n of them at once instead one by one.
Parallelising it on the step level is what I am doing but I am not sure on how to get the batching to work, do I need to implement a reader that returns a list and a processor that receives a list? If so, I read that you will have not a proper account of number items processed.
I am trying to find a way to do it in the most appropriate spring batch way instead of hacking a fix, I also assume that I need to keep state in the reader and wondered if there is a better way not to.
You cannot have something like an aggregating processor. Every single item that is read is processed as single item.
However, you can implement a Reader that groups items and forwards them as a whole group. to get an idea, how this could be done have a look at my answer to this question Spring Batch Processor or Dean Clark's answer here Spring Batch-How to process multiple records at the same time in the processor?.
Both use a SpringBatch's SingleItemPeekableItemReader.
Am currently working with spring batch for the first time. In spring batch i've set commit level to 1000 which gave me better performance but now I ve the issues in identifying the corrupt or exception item. We need to send mail update with the record line or item number with the exception data.
I tried item listener, chunk listener, step listener and job listener but am not able to figure out how to get those information from execution listener context while generating mail in job listener. Am able to get the information about exception and not able to track which record has the issue and item count in the chunk.
For example, if I have 1000 lines in file or db and commit level 100. If we have issue in 165 item. I need to get the line number as 165 in any listener so I can attach that in context to populate logging info to have a quick turn around time to fix the issue before reprocessing.
I Searched but I couldn't get suggestion or idea. I believe this will be a common problem in chunk commit greater than 1. Please suggest the better way to handle.
Thanks in advance
You'll want to perform the checks that can cause an issue in the processor, and create an error item out of them which will get persisted to its own table/file. Some errors are unavoidable, and unfortunately you'll need to do manual debugging within that chunk.
Edit:
To find the commit range, you would need to preserve order. If using a FlatFileItemReader, it will store the line for you if your POJO implements ItemCountAware. If running against a DB, you'll want to make sure the query preserves order with an order by on the unique index. Then you'll be able to track the chunk down by checking where read_count from the batch_step_execution table.
You can enable skipping. Spring Batch processes each item of a chunk again in a separate transaction after a chunk fails due to a skippable exception. It detects the item that caused the exception in this way.
I was trying to batch simple file. I understand that I couldnt multi-thread it. So at least I tried to perform better while increasing the chunks param:
#Bean
public Step processFileStep() {
return stepBuilderFactory.get("processSnidFileStep")
.<MyItem, MyItem>chunk(10)
.reader(reader())
....
My logic needs the processor to 'filter' our non valid records.
but than I found out that the processor not able to get chunks.. but only one Item at a time:
public interface ItemProcessor<I, O> {
O process(I item) throws Exception;
}
In my case I need to access the database and valid my record over there. so for each Item I have to query the DB(instead of doing it with bunch of items together)
I cant multi-thread or make my process perform better? what am I missing here? It will take too long to process each record one by one from a file.
thanks.
From past discussions, the CSV reader may have serious performance issues. You might be better served by writing a reader using another CSV parser.
Depending on your validation data, you might create a job scoped filter bean that wraps a Map that can be either preloaded very quickly or lazy loaded. This way you would limit the hits on the database to either initialization or first reference (repectively), and reduce the filter time to a hashmap lookaside.
In the Spring Batch chunk-oriented processing architecture, the only component where you get access to the complete chunk of records is the ItemWriter.
So if you want to do any kind of bulk processing this is where you would typically do that. Either with an ItemWriteListener#beforeWrite or by implementing your own custom ItemWriter.
I am writing a Spring Batch application to do the following: There is an input table (PostgreSQL DB) to which someone continually adds rows - that is basically work items being added. For each of these rows, I need to fetch more data from another DB, do some processing, and then do an output transaction which can be multiple SQL queries touching multiple tables (this needs to be one transaction for consistency reasons).
Now, the part between the input and output should be a modular - it already has 3-4 logically separated things, and in future there would be more. This flow need not be linear - what processing is done next can be dependent on the result of previous. In short, this is basically like the flow you can setup using steps inside a job.
My main problem is this: Normally a single chunk processing step has both ItemReader and ItemWriter, i.e., input to output in a single step. So, should I include all the processing steps as part of a single ItemProcessor? How would I make a single ItemProcessor a stateful workflow in itself?
The other option is to make each step a Tasklet implementation, and write two tasklets myself to behave as ItemReader and ItemWriter.
Any suggestions?
Found an answer - yes you are effectively limited to a single step. But:
1) For linear workflows, you can "chain" itemprocessors - that is create a composite itemprocessor to which you can provide all the itemprocessors which do actual work through applicationContext.xml. Composite itemprocessor just runs them one by one. This is what I'm doing right now.
2) You can always create the internal subflow as a seperate spring batch workflow and call it through code in an itemprocessor similar to composite itemprocessor above. I might move to this in the future.