Chunk processing in Spring Batch effectively limits you to one step? - jdbc

I am writing a Spring Batch application to do the following: There is an input table (PostgreSQL DB) to which someone continually adds rows - that is basically work items being added. For each of these rows, I need to fetch more data from another DB, do some processing, and then do an output transaction which can be multiple SQL queries touching multiple tables (this needs to be one transaction for consistency reasons).
Now, the part between the input and output should be a modular - it already has 3-4 logically separated things, and in future there would be more. This flow need not be linear - what processing is done next can be dependent on the result of previous. In short, this is basically like the flow you can setup using steps inside a job.
My main problem is this: Normally a single chunk processing step has both ItemReader and ItemWriter, i.e., input to output in a single step. So, should I include all the processing steps as part of a single ItemProcessor? How would I make a single ItemProcessor a stateful workflow in itself?
The other option is to make each step a Tasklet implementation, and write two tasklets myself to behave as ItemReader and ItemWriter.
Any suggestions?

Found an answer - yes you are effectively limited to a single step. But:
1) For linear workflows, you can "chain" itemprocessors - that is create a composite itemprocessor to which you can provide all the itemprocessors which do actual work through applicationContext.xml. Composite itemprocessor just runs them one by one. This is what I'm doing right now.
2) You can always create the internal subflow as a seperate spring batch workflow and call it through code in an itemprocessor similar to composite itemprocessor above. I might move to this in the future.

Related

Spring JPA transactional to avoid concurrency read / update

Using Spring boot and JPA/hibernate , I'm looking for a solution to avoid a table record being read by another process while I'm reading then updating an entity. Isolation levels Dirty read, Nonrepeatable read and Phantom read are not so clear for me. I mean if process #1 starts a read/update i don't want a process #2 to be able to read the old value (before updated by #1) and then update the structure with wrong values.
Isolation levels all prevent reading changes in different levels of strictness:
Dirty Read -> reading not yet committed changes
Nonrepeatable read -> querying the same row second time finds data changes
Phantom read -> like previous but instead of data changes, it finds more data added (more here)
Serializable level, being the strictest, would prevent reading any changes yet, essentially resulting in sequential processing in DB (no concurrency) and would probably solve your problem
What you are looking for, if I understood correctly, is to block second process from doing any work until row update is complete - that is called row locking, and can be controlled directly as well (without setting serializable isolation)
See more about row locking with Spring JPA here: https://www.baeldung.com/java-jpa-transaction-locks
If it wasn't different process (different program) but just a different thread within the same Java program a simple synchronized would do the trick as well.

Batching stores transparently

We are using the following frameworks and versions:
jOOQ 3.11.1
Spring Boot 2.3.1.RELEASE
Spring 5.2.7.RELEASE
I have an issue where some of our business logic is divided into logical units that look as follows:
Request containing a user transaction is received
This request contains various information, such as the type of transaction, which products are part of this transaction, what kind of payments were done, etc.
These attributes are then stored individually in the database.
In code, this looks approximately as follows:
TransactionRecord transaction = transactionRepository.create();
transaction.create(creationCommand);`
In Transaction#create (which runs transactionally), something like the following occurs:
storeTransaction();
storePayments();
storeProducts();
// ... other relevant information
A given transaction can have many different types of products and attributes, all of which are stored. Many of these attributes result in UPDATE statements, while some may result in INSERT statements - it is difficult to fully know in advance.
For example, the storeProducts method looks approximately as follows:
products.forEach(product -> {
ProductRecord record = productRepository.findProductByX(...);
if (record == null) {
record = productRepository.create();
record.setX(...);
record.store();
} else {
// do something else
}
});
If the products are new, they are INSERTed. Otherwise, other calculations may take place. Depending on the size of the transaction, this single user transaction could obviously result in up to O(n) database calls/roundtrips, and even more depending on what other attributes are present. In transactions where a large number of attributes are present, this may result in upwards of hundreds of database calls for a single request (!). I would like to bring this down as close as possible to O(1) so as to have more predictable load on our database.
Naturally, batch and bulk inserts/updates come to mind here. What I would like to do is to batch all of these statements into a single batch using jOOQ, and execute after successful method invocation prior to commit. I have found several (SO Post, jOOQ API, jOOQ GitHub Feature Request) posts where this topic is implicitly mentioned, and one user groups post that seemed explicitly related to my issue.
Since I am using Spring together with jOOQ, I believe my ideal solution (preferably declarative) would look something like the following:
#Batched(100) // batch size as parameter, potentially
#Transactional
public void createTransaction(CreationCommand creationCommand) {
// all inserts/updates above are added to a batch and executed on successful invocation
}
For this to work, I imagine I'd need to manage a scoped (ThreadLocal/Transactional/Session scope) resource which can keep track of the current batch such that:
Prior to entering the method, an empty batch is created if the method is #Batched,
A custom DSLContext (perhaps extending DefaultDSLContext) that is made available via DI has a ThreadLocal flag which keeps track of whether any current statements should be batched or not, and if so
Intercept the calls and add them to the current batch instead of executing them immediatelly.
However, step 3 would necessitate having to rewrite a large portion of our code from the (IMO) relatively readable:
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
to:
userObjects.forEach(userObject -> {
dslContext.insertInto(...).values(userObject.getX(), ...).execute();
}
which would defeat the purpose of having this abstraction in the first place, since the second form can also be rewritten using DSLContext#batchStore or DSLContext#batchInsert. IMO however, batching and bulk insertion should not be up to the individual developer and should be able to be handled transparently at a higher level (e.g. by the framework).
I find the readability of the jOOQ API to be an amazing benefit of using it, however it seems that it does not lend itself (as far as I can tell) to interception/extension very well for cases such as these. Is it possible, with the jOOQ 3.11.1 (or even current) API, to get behaviour similar to the former with transparent batch/bulk handling? What would this entail?
EDIT:
One possible but extremely hacky solution that comes to mind for enabling transparent batching of stores would be something like the following:
Create a RecordListener and add it as a default to the Configuration whenever batching is enabled.
In RecordListener#storeStart, add the query to the current Transaction's batch (e.g. in a ThreadLocal<List>)
The AbstractRecord has a changed flag which is checked (org.jooq.impl.UpdatableRecordImpl#store0, org.jooq.impl.TableRecordImpl#addChangedValues) prior to storing. Resetting this (and saving it for later use) makes the store operation a no-op.
Lastly, upon successful method invocation but prior to commit:
Reset the changes flags of the respective records to the correct values
Invoke org.jooq.UpdatableRecord#store, this time without the RecordListener or while skipping the storeStart method (perhaps using another ThreadLocal flag to check whether batching has already been performed).
As far as I can tell, this approach should work, in theory. Obviously, it's extremely hacky and prone to breaking as the library internals may change at any time if the code depends on Reflection to work.
Does anyone know of a better way, using only the public jOOQ API?
jOOQ 3.14 solution
You've already discovered the relevant feature request #3419, which will solve this on the JDBC level starting from jOOQ 3.14. You can either use the BatchedConnection directly, wrapping your own connection to implement the below, or use this API:
ctx.batched(c -> {
// Make sure all records are attached to c, not ctx, e.g. by fetching from c.dsl()
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
});
jOOQ 3.13 and before solution
For the time being, until #3419 is implemented (it will be, in jOOQ 3.14), you can implement this yourself as a workaround. You'd have to proxy a JDBC Connection and PreparedStatement and ...
... intercept all:
Calls to Connection.prepareStatement(String), returning a cached proxy statement if the SQL string is the same as for the last prepared statement, or batch execute the last prepared statement and create a new one.
Calls to PreparedStatement.executeUpdate() and execute(), and replace those by calls to PreparedStatement.addBatch()
... delegate all:
Calls to other API, such as e.g. Connection.createStatement(), which should flush the above buffered batches, and then call the delegate API instead.
I wouldn't recommend hacking your way around jOOQ's RecordListener and other SPIs, I think that's the wrong abstraction level to buffer database interactions. Also, you will want to batch other statement types as well.
Do note that by default, jOOQ's UpdatableRecord tries to fetch generated identity values (see Settings.returnIdentityOnUpdatableRecord), which is something that prevents batching. Such store() calls must be executed immediately, because you might expect the identity value to be available.

Avoid processing same file twice in Spring Batch

I have to write a Spring Batch job as follows:
Step 1: Load an XML file from the file system and write its contents to a database staging table
Step 2: Call Oracle PL/SQL procedure to process the staging table.
(Comments on that job structure are welcome, but not the question).
In Step 1, I want to move the XML file to another directory after I have loaded it. I want this, as much as possible, to be "transactional" with the write to the staging table. That is, either both the writes to staging and the file move succeed, or neither does.
I feel this necessary because if (A) the staging writes happen but the file does not move, the next run will pick up the file again and process it again and (B) if the file gets moved but the staging writes do not happen, then we will have missed that file's processing.
This interface's requirements are all about robustness. I know I could just put a step execution listener to move all the files at the end, but I want the approach that is going to guarantee that we never miss processing data and never process the same file twice.
Part of the difficulty is that I am using a MultiResourceItemReader. I read that ChunkListener.beforeChunk() happens as part of the chunk transaction, so I tried to make a custom chunk CompletionPolicy to force chunks to complete after each change of resource (file) name, but I could not get it to work. In any case, I would have needed an afterChunk() listener, which is not part of the transaction anyway.
I'll take any guidance on my specific questions or an expert explanation of how to robustly process files in Spring Batch (which I am only just learning). Thanks!
I have pretty similar spring batch process right now.
Spring batch fits good to your requirement.
I would recommend to start using here spring integration.
In spring integration you can configure to monitor your folder and then make it trigger batch job. There is good example in official documentation.
Then you should use powerful concept of spring batch - identifying parameters. Spring batch job runs with unique parameters, and if you put this parameter as identifying, then no other job could be spawned with same parameter (though you can restart your original job).
/**
* Add a new String parameter for the given key.
*
* #param key - parameter accessor.
* #param parameter - runtime parameter
* #param identifying - indicates if the parameter is used as part of identifying a job instance
* #return a reference to this object.
*/
public JobParametersBuilder addString(String key, String parameter, boolean identifying) {
parameterMap.put(key, new JobParameter(parameter, identifying));
return this;
}
So here you need to ask yourself what is your uniquely identifying constraint for batch job? I would suggest it's full file path. But then you need to be sure that nobody provides different files with same filename.
Also spring integration can see if file was already seen by application and ignore it. Please check documentation on AcceptOnceFileListFilter.
If you want to have guaranteed 'transactional-like' logic in batch - then don't put it into Listeners, create a specific step which will move file. Listeners are good for suplimental logic.
In this way if this step will fail for any reason, you will still be able fix issue and to retry job.
This kind of process can be easy done with a job with 2 step and 1 listener:
A standard (read from XML -> process? -> write to DB) step; you don't care about restartability because SB is smart enough to avoid data read repetition
a listener attached to step 1 to move file after successfully step execution (example 1, example 2 or example 3)
A second step with data processing
#3 may may be inserted as step 1 process phase

Aggregating processor or aggregating reader

I have a requirement, which is like, I read items from a DB, if possible in a paging way where the items represent the later "batch size", I do some processing steps, like filtering etc. then I want to accumulate the items to send it to a rest service where I can send it to in batches, e.g. n of them at once instead one by one.
Parallelising it on the step level is what I am doing but I am not sure on how to get the batching to work, do I need to implement a reader that returns a list and a processor that receives a list? If so, I read that you will have not a proper account of number items processed.
I am trying to find a way to do it in the most appropriate spring batch way instead of hacking a fix, I also assume that I need to keep state in the reader and wondered if there is a better way not to.
You cannot have something like an aggregating processor. Every single item that is read is processed as single item.
However, you can implement a Reader that groups items and forwards them as a whole group. to get an idea, how this could be done have a look at my answer to this question Spring Batch Processor or Dean Clark's answer here Spring Batch-How to process multiple records at the same time in the processor?.
Both use a SpringBatch's SingleItemPeekableItemReader.

My Concerns about Spring-Batch that you cant actually multi-thread/read in chunks while reading items

I was trying to batch simple file. I understand that I couldnt multi-thread it. So at least I tried to perform better while increasing the chunks param:
#Bean
public Step processFileStep() {
return stepBuilderFactory.get("processSnidFileStep")
.<MyItem, MyItem>chunk(10)
.reader(reader())
....
My logic needs the processor to 'filter' our non valid records.
but than I found out that the processor not able to get chunks.. but only one Item at a time:
public interface ItemProcessor<I, O> {
O process(I item) throws Exception;
}
In my case I need to access the database and valid my record over there. so for each Item I have to query the DB(instead of doing it with bunch of items together)
I cant multi-thread or make my process perform better? what am I missing here? It will take too long to process each record one by one from a file.
thanks.
From past discussions, the CSV reader may have serious performance issues. You might be better served by writing a reader using another CSV parser.
Depending on your validation data, you might create a job scoped filter bean that wraps a Map that can be either preloaded very quickly or lazy loaded. This way you would limit the hits on the database to either initialization or first reference (repectively), and reduce the filter time to a hashmap lookaside.
In the Spring Batch chunk-oriented processing architecture, the only component where you get access to the complete chunk of records is the ItemWriter.
So if you want to do any kind of bulk processing this is where you would typically do that. Either with an ItemWriteListener#beforeWrite or by implementing your own custom ItemWriter.

Resources