Avoid processing same file twice in Spring Batch - spring

I have to write a Spring Batch job as follows:
Step 1: Load an XML file from the file system and write its contents to a database staging table
Step 2: Call Oracle PL/SQL procedure to process the staging table.
(Comments on that job structure are welcome, but not the question).
In Step 1, I want to move the XML file to another directory after I have loaded it. I want this, as much as possible, to be "transactional" with the write to the staging table. That is, either both the writes to staging and the file move succeed, or neither does.
I feel this necessary because if (A) the staging writes happen but the file does not move, the next run will pick up the file again and process it again and (B) if the file gets moved but the staging writes do not happen, then we will have missed that file's processing.
This interface's requirements are all about robustness. I know I could just put a step execution listener to move all the files at the end, but I want the approach that is going to guarantee that we never miss processing data and never process the same file twice.
Part of the difficulty is that I am using a MultiResourceItemReader. I read that ChunkListener.beforeChunk() happens as part of the chunk transaction, so I tried to make a custom chunk CompletionPolicy to force chunks to complete after each change of resource (file) name, but I could not get it to work. In any case, I would have needed an afterChunk() listener, which is not part of the transaction anyway.
I'll take any guidance on my specific questions or an expert explanation of how to robustly process files in Spring Batch (which I am only just learning). Thanks!

I have pretty similar spring batch process right now.
Spring batch fits good to your requirement.
I would recommend to start using here spring integration.
In spring integration you can configure to monitor your folder and then make it trigger batch job. There is good example in official documentation.
Then you should use powerful concept of spring batch - identifying parameters. Spring batch job runs with unique parameters, and if you put this parameter as identifying, then no other job could be spawned with same parameter (though you can restart your original job).
/**
* Add a new String parameter for the given key.
*
* #param key - parameter accessor.
* #param parameter - runtime parameter
* #param identifying - indicates if the parameter is used as part of identifying a job instance
* #return a reference to this object.
*/
public JobParametersBuilder addString(String key, String parameter, boolean identifying) {
parameterMap.put(key, new JobParameter(parameter, identifying));
return this;
}
So here you need to ask yourself what is your uniquely identifying constraint for batch job? I would suggest it's full file path. But then you need to be sure that nobody provides different files with same filename.
Also spring integration can see if file was already seen by application and ignore it. Please check documentation on AcceptOnceFileListFilter.
If you want to have guaranteed 'transactional-like' logic in batch - then don't put it into Listeners, create a specific step which will move file. Listeners are good for suplimental logic.
In this way if this step will fail for any reason, you will still be able fix issue and to retry job.

This kind of process can be easy done with a job with 2 step and 1 listener:
A standard (read from XML -> process? -> write to DB) step; you don't care about restartability because SB is smart enough to avoid data read repetition
a listener attached to step 1 to move file after successfully step execution (example 1, example 2 or example 3)
A second step with data processing
#3 may may be inserted as step 1 process phase

Related

Performing two read operation in spring batch

I need to read from a DB and based on that result I need to fetch data from another DB which is on another server and after need to write it in file. Now solution that came in mind to use Spring Batch reader for reading from first DB and using we can read from 2nd DB in process.
But in this process what I feel that in process reading is not good idea because it processes single data in one time. (Please correct me if I am wrong)
Is there any other way to do this so that we can perform this task in efficient way.
Thanks in advance
Please Suggest me what could be the options

Batching stores transparently

We are using the following frameworks and versions:
jOOQ 3.11.1
Spring Boot 2.3.1.RELEASE
Spring 5.2.7.RELEASE
I have an issue where some of our business logic is divided into logical units that look as follows:
Request containing a user transaction is received
This request contains various information, such as the type of transaction, which products are part of this transaction, what kind of payments were done, etc.
These attributes are then stored individually in the database.
In code, this looks approximately as follows:
TransactionRecord transaction = transactionRepository.create();
transaction.create(creationCommand);`
In Transaction#create (which runs transactionally), something like the following occurs:
storeTransaction();
storePayments();
storeProducts();
// ... other relevant information
A given transaction can have many different types of products and attributes, all of which are stored. Many of these attributes result in UPDATE statements, while some may result in INSERT statements - it is difficult to fully know in advance.
For example, the storeProducts method looks approximately as follows:
products.forEach(product -> {
ProductRecord record = productRepository.findProductByX(...);
if (record == null) {
record = productRepository.create();
record.setX(...);
record.store();
} else {
// do something else
}
});
If the products are new, they are INSERTed. Otherwise, other calculations may take place. Depending on the size of the transaction, this single user transaction could obviously result in up to O(n) database calls/roundtrips, and even more depending on what other attributes are present. In transactions where a large number of attributes are present, this may result in upwards of hundreds of database calls for a single request (!). I would like to bring this down as close as possible to O(1) so as to have more predictable load on our database.
Naturally, batch and bulk inserts/updates come to mind here. What I would like to do is to batch all of these statements into a single batch using jOOQ, and execute after successful method invocation prior to commit. I have found several (SO Post, jOOQ API, jOOQ GitHub Feature Request) posts where this topic is implicitly mentioned, and one user groups post that seemed explicitly related to my issue.
Since I am using Spring together with jOOQ, I believe my ideal solution (preferably declarative) would look something like the following:
#Batched(100) // batch size as parameter, potentially
#Transactional
public void createTransaction(CreationCommand creationCommand) {
// all inserts/updates above are added to a batch and executed on successful invocation
}
For this to work, I imagine I'd need to manage a scoped (ThreadLocal/Transactional/Session scope) resource which can keep track of the current batch such that:
Prior to entering the method, an empty batch is created if the method is #Batched,
A custom DSLContext (perhaps extending DefaultDSLContext) that is made available via DI has a ThreadLocal flag which keeps track of whether any current statements should be batched or not, and if so
Intercept the calls and add them to the current batch instead of executing them immediatelly.
However, step 3 would necessitate having to rewrite a large portion of our code from the (IMO) relatively readable:
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
to:
userObjects.forEach(userObject -> {
dslContext.insertInto(...).values(userObject.getX(), ...).execute();
}
which would defeat the purpose of having this abstraction in the first place, since the second form can also be rewritten using DSLContext#batchStore or DSLContext#batchInsert. IMO however, batching and bulk insertion should not be up to the individual developer and should be able to be handled transparently at a higher level (e.g. by the framework).
I find the readability of the jOOQ API to be an amazing benefit of using it, however it seems that it does not lend itself (as far as I can tell) to interception/extension very well for cases such as these. Is it possible, with the jOOQ 3.11.1 (or even current) API, to get behaviour similar to the former with transparent batch/bulk handling? What would this entail?
EDIT:
One possible but extremely hacky solution that comes to mind for enabling transparent batching of stores would be something like the following:
Create a RecordListener and add it as a default to the Configuration whenever batching is enabled.
In RecordListener#storeStart, add the query to the current Transaction's batch (e.g. in a ThreadLocal<List>)
The AbstractRecord has a changed flag which is checked (org.jooq.impl.UpdatableRecordImpl#store0, org.jooq.impl.TableRecordImpl#addChangedValues) prior to storing. Resetting this (and saving it for later use) makes the store operation a no-op.
Lastly, upon successful method invocation but prior to commit:
Reset the changes flags of the respective records to the correct values
Invoke org.jooq.UpdatableRecord#store, this time without the RecordListener or while skipping the storeStart method (perhaps using another ThreadLocal flag to check whether batching has already been performed).
As far as I can tell, this approach should work, in theory. Obviously, it's extremely hacky and prone to breaking as the library internals may change at any time if the code depends on Reflection to work.
Does anyone know of a better way, using only the public jOOQ API?
jOOQ 3.14 solution
You've already discovered the relevant feature request #3419, which will solve this on the JDBC level starting from jOOQ 3.14. You can either use the BatchedConnection directly, wrapping your own connection to implement the below, or use this API:
ctx.batched(c -> {
// Make sure all records are attached to c, not ctx, e.g. by fetching from c.dsl()
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
});
jOOQ 3.13 and before solution
For the time being, until #3419 is implemented (it will be, in jOOQ 3.14), you can implement this yourself as a workaround. You'd have to proxy a JDBC Connection and PreparedStatement and ...
... intercept all:
Calls to Connection.prepareStatement(String), returning a cached proxy statement if the SQL string is the same as for the last prepared statement, or batch execute the last prepared statement and create a new one.
Calls to PreparedStatement.executeUpdate() and execute(), and replace those by calls to PreparedStatement.addBatch()
... delegate all:
Calls to other API, such as e.g. Connection.createStatement(), which should flush the above buffered batches, and then call the delegate API instead.
I wouldn't recommend hacking your way around jOOQ's RecordListener and other SPIs, I think that's the wrong abstraction level to buffer database interactions. Also, you will want to batch other statement types as well.
Do note that by default, jOOQ's UpdatableRecord tries to fetch generated identity values (see Settings.returnIdentityOnUpdatableRecord), which is something that prevents batching. Such store() calls must be executed immediately, because you might expect the identity value to be available.

Need explanation on internal working of read_table method in pyarrow.parquet

I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After successful consumption, the first path is sent for reading and the following paths will be read once the preceding read processes are done. Now to read a specific table I am using the following line of code,
data_table = parquet.read_table(path_to_the_file)
Let's say I have five read tasks in the message. The first read process is being carried out and gets read successfully and now before the other reading tasks are yet to be performed I just manually stopped my server. This stop would not send a message execution successful acknowledgement to the queue as there are a four remaining read processes. Once I restart the server, the whole consumption and reading processes starts from the initial stage. And now when the read_table method is called upon the first path, it gets stuck totally.
Digging up inside the work flow of read_table method, I found out where it actually gets stuck.
But further explanations of this method for reading a file inside a hadoop filesystem is required.
path = 'hdfs://173.21.3.116:9000/tempDir/test_dataset.parquet'
data_table = parquet.read_table(path)
Can somebody please give me a picture of the internal implementation that happens after calling this method? So that I could find where the issue is actually occurred and a solution to it.

read and validate in spring batch

i want to read multiple xml files in spring batch , and before reading i should validate the name of each file and put it in the context , how can i process ?
is it possible to have this senario using tasklet and reader writer processor ?? :
folder : file1.xml file2.xml file3.xml
validate filename (file1.xml) -- read file1 -- process -- write
then
validate filename (file2.xml) -- read file2 -- write
validate filename (file3.xml) -- read file3 -- write
......
or any other way ??????
There are three approaches you can take with this. Each has it's benefits and weaknesses.
Use a step to validate
You can set your job up with a validateFileName step that preceedes the step that processes the files (say processFiles). The validateFileName step would do any validations on the file name needed then provide the files to process to the next step. How it communicates this could be a simple as moving the valid files to a new directory or as complex as using the job's ExecutionContext to store the names of the files to process.
The advantage of this is that it decouples validation from processing. The disadvantage is that it makes the job slightly more complex given that you'd have an extra step.
Use a StepExecutionListener to do the validation
You could use a StepExecutionListener#beforeStep() call to do the validation. Same concepts apply as before with regards to how to communicate what validates and what doesn't.
This may be a less complex option, but it more tightly couples (albeit marginally) the processing and validation.
Use an ItemReader that validates before it reads
This last option is to write an ItemReader implementation that is similar to the MultiResourceItemReader but provides a hook into validating the file before reading it. If the file doesn't validate, you would skip it.
This option again couples validation with the processing, but may provide a nice reusable abstraction for this particular use case.
I hope this helps!

Chunk processing in Spring Batch effectively limits you to one step?

I am writing a Spring Batch application to do the following: There is an input table (PostgreSQL DB) to which someone continually adds rows - that is basically work items being added. For each of these rows, I need to fetch more data from another DB, do some processing, and then do an output transaction which can be multiple SQL queries touching multiple tables (this needs to be one transaction for consistency reasons).
Now, the part between the input and output should be a modular - it already has 3-4 logically separated things, and in future there would be more. This flow need not be linear - what processing is done next can be dependent on the result of previous. In short, this is basically like the flow you can setup using steps inside a job.
My main problem is this: Normally a single chunk processing step has both ItemReader and ItemWriter, i.e., input to output in a single step. So, should I include all the processing steps as part of a single ItemProcessor? How would I make a single ItemProcessor a stateful workflow in itself?
The other option is to make each step a Tasklet implementation, and write two tasklets myself to behave as ItemReader and ItemWriter.
Any suggestions?
Found an answer - yes you are effectively limited to a single step. But:
1) For linear workflows, you can "chain" itemprocessors - that is create a composite itemprocessor to which you can provide all the itemprocessors which do actual work through applicationContext.xml. Composite itemprocessor just runs them one by one. This is what I'm doing right now.
2) You can always create the internal subflow as a seperate spring batch workflow and call it through code in an itemprocessor similar to composite itemprocessor above. I might move to this in the future.

Resources