We have a spring batch which reach a bunch of data in the reader, processes it and writes it. It all happens as a batch.
I noticed that the processor and writer are going over the same data twice, once as a batch and once individual records.
for ex: writer reads 1000 records, sends 1000 records to the processor, sends 1000 records to the writer.
After this the records gets processed again, individually, but only processor and writer are being called.
We have log statements in all reader, processor, and writer and I can see the logs.
Is there any condition which can make the records being processed individually after they have been processed as a list?
<batch:job id="feeder-job">
<batch:step id="regular">
<tasklet>
<chunk reader="feeder-reader" processor="feeder-processor"
writer="feeder-composite-writer" commit-interval="#{stepExecutionContext['query-fetch-size']}"
skip-limit="1000000">
<skippable-exception-classes>
<include class="java.lang.Exception" />
<exclude class="org.apache.solr.client.solrj.SolrServerException"/>
<exclude class="org.apache.solr.common.SolrException"/>
<exclude class="com.batch.feeder.record.RecordFinderException"/>
</skippable-exception-classes>
</chunk>
<listeners>
<listener ref="feeder-reader" />
</listeners>
</tasklet>
</batch:step>
</batch:job>
You should well read about a feature before using it. Here you are correct that processing is happening twice only after error occurs.
Basically, you have defined a chunk / step which is fault tolerant to certain specified exceptions Configuring Skip Logic
Your step will not fail till total exception count remains below skip-limitbut on errors, chunk items will be processed twice - one by one, the second time and skipping bad records in second processing.
Related
I encoutered a problem when an exception is occured in writer phase.
One item caused rollback due to an integrity problem in database, and no retry is executed, thus the processor is never replayed.
While an item caused rollback, it would be skipped. And others items are retry with interval-commit to one.
But, in my case, no retry is done for others items with interval-commit to one
Would you know for what reason no retry is executed ?
Thanks by advance.
i hope you added the retry limit and the retry exception class needs to be listed in the retry-able list , check out below sample syntax for the same
<job id="flowJob">
<step id="retryStep">
<tasklet>
<chunk reader="itemReader" writer="itemWriter" processor="itemProcessor" commit-interval="1" retry-limit="3">
<retryable-exception-classes>
<include class="org.springframework.remoting.RemoteAccessException"/>
</retryable-exception-classes>
</chunk>
</tasklet>
</step>
</job>
My purpose of this batch job is to fetch few documents from the DB, encrypt it and sftp it to a server. For this I am using item readers and writers. For encryption, I should use a tasklet which is in a jar(I don't own the source code). There are millions of records to be processed, so I am using some chunk-interval for reading and writing it. My problem is during encyrption(while calling the tasklet after every chunk of writing is complete).
Is there a way to call the tasklet after writer (in batch:chunk) ?
As of now, I am doing the following:
<batch:job id="batchJob">
<batch:step id="prepareStep" next="encryptStep">
<batch:tasklet task-executor="executor">
<batch:chunk reader="reader" processor="processor"
writer="writer" commit-interval="100" >
</batch:chunk>
</batch:tasklet>
</batch:step>
<batch:step id="encryptStep" next="uploadStep">
<batch:tasklet ref="encryptTasklet" />
</batch:step>
But the problem in the above approach is that only after reading, processing and writing all the million records only then encryptStep is called. But I want it to work in chunks, that is, execute encryptTasklet after every chunk write is executed. Is there a way to achieve this?
Please help.
We have data streaming in on an irregular basis and in quantities that I can not predict. I currently have the commit-interval set to 1 because we want data to be written as soon as we receive it. We sometimes get large numbers of items at a time (~1000-50000 items in a second) which I would like to commit in larger chunks as it takes awhile to write these individually. Is there way to set a timeout on the commit-interval?
Goal: We set the commit-interval to 10000, we get 9900 items and after 1 second it commits the 9900 items rather then waiting until it receives 100 more.
Currently, when we set the commit-interval greater than 1, we just see data waiting to be written until it hits the amount specified by the commit-interval.
How is your data streaming in? Is it being loaded to a work table? Added to a queue? Typically you'd just drain the work table or queue with whatever commit interval performs best then re-run the job periodically to check if a new batch of inbound records has been received.
Either way, I would typically leverage flow control to have your job loop and just process as many records as are ready to be processed for a given time interval:
<job id="job">
<decision id="decision" decider="decider">
<next on="PROCESS" to="processStep" />
<next on="DECIDE" to="decision" />
<end on="COMPLETED" />
<fail on="*" />
</decision>
<step id="processStep">
<!-- your step here -->
</step>
</job>
<beans:bean id="decider" class="com.package.MyDecider"/>
Then your decider would do something like this:
if (maxTimeReached) {
return END;
}
if (hasRecords) {
return PROCESS;
} else {
wait X seconds;
return DECIDE;
}
my setup (simplified for clarity) is following:
<int:inbound-channel-adapter channel="in" expression="0">
<int:poller cron="0 0 * * * *"/>
<int:header name="snapshot_date" expression="new java.util.Date()"/>
<int:header name="correlationId" expression="T(java.util.UUID).randomUUID()"/>
<!-- more here -->
</int:inbound-channel-adapter>
<int:recipient-list-router input-channel="in" apply-sequence="true">
<int:recipient channel="data.source.1"/>
<int:recipient channel="data.source.2"/>
<!-- more here -->
</int:recipient-list-router>
<int:chain input-channel="data.source.1" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.1"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="data.source.2" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from another_large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.2"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="save" output-channel="process">
<int:splitter expression="T(com.google.common.collect.Lists).partition(payload, 1000)"/>
<int:transformer>
<int-groovy:script location="transform.groovy"/>
</int:transformer>
<int:service-activator expression="#db2.insertData(payload, headers)"/>
<int:aggregator/>
</int:chain>
<int:chain input-channel="process" output-channel="nullChannel">
<int:aggregator/>
<int:service-activator expression="#finalProcessing.doSomething()"/>
</int:chain>
let me explain the steps a little bit:
poller is triggered by cron. message is enriched with some information about this run.
message is sent to multiple data-source chains.
each chain extracts data from large dataset (100+k rows). resultset message is marked with source header.
resultset is split into smaller chunks, transformed and inserted into db2.
after all data sources have been polled, some complex processing is initiated, using the information about the run.
this configuration does the job so far, but is not scalable. main problem is that i have to load full dataset into memory first and pass it along the pipeline, which might cause memory issues.
my question is - what is the simplest way to have resultset extracted from db1, pushed through the pipeline and inserted into db2 in small batches?
First of all since version 4.0.4 Spring Integration's <splitter> supports Iterator as payload to avoid memory overhead.
We have a test-case for the JDBC which shows that behaviour. But as you see it is based on the Spring Integration Java DSL and Java 8 Lamdas. (Yes, it can be done even for older Java versions without Lamdas). Even if this case is appropriate for you, your <aggregator> should not be in-memory, because it collects all messages to the MessageStore.
That's first case.
Another option is based on the paging algorithm, when your SELECT accepts a pair of WHERE params in the your DB dialect. For Oracle it can be like: Paging with Oracle.
Where the pageNumber is some message header - :headers[pageNumber]
After that you do some trick with <recipient-list-router> to send a SELECT result to the save channel and to some other channel wich increments pageNumber header value and sends a message to the data.source.1 channel and so on. When the pageNumber becomes out of data scope, the <int-jdbc:outbound-gateway> stops produces results.
Something like that.
I don't say that it so easy, but it should be a start point for you, at least.
When defining a spring batch job and using retry-limit parameter in xml description is it the total number of runs, or the number of retries?
i.e. when retry-limit=1, will my job run once or twice (in case of an error on the first run)?
This seems like a silly question, but I didn't find a clear answer in any documentation I've seen...
The retry-limit attribute is really "item-based" and not "job-based". By "item-based" I mean that for every item (record/line) that is read/processed/writen, if that item fails, it will be retried up the retry-limit. If that limit is reached, the step will fail.
For example
<step id="someStep">
<tasklet>
<chunk reader="itemReader" writer="itemWriter"
processor="itemProcessor" commit-interval="20"
retry-limit="3">
<retryable-exception-classes>
<include class="org.springframework.exception.SomeException"/>
</retryable-exception-classes>
</chunk>
</tasklet>
</step>
In the above basic step configuration, when a SomeException is thrown by any of the components in the step (itemReader, itemWriter, or itemProcessor), the item is retried up to three times before the step fails.
Here's Spring doc's explanation
In most cases you want an exception to cause either a skip or Step failure. However, not all exceptions are deterministic. If a FlatFileParseException is encountered while reading, it will always be thrown for that record; resetting the ItemReader will not help. However, for other exceptions, such as a DeadlockLoserDataAccessException, which indicates that the current process has attempted to update a record that another process holds a lock on, waiting and trying again might result in success. In this case, retry should be configured:
<step id="step1">
<tasklet>
<chunk reader="itemReader" writer="itemWriter"
commit-interval="2" retry-limit="3">
<retryable-exception-classes>
<include class="org.springframework.dao.DeadlockLoserDataAccessException"/>
</retryable-exception-classes>
</chunk>
</tasklet>
</step>
The Step allows a limit for the number of times an individual item can be retried, and a list of exceptions that are 'retryable'. More details on how retry works can be found in Chapter 9, Retry.
spring batch job gets failed if any step fails to complete execution without any error or exception.
If any error or exception occurs in any steps the step is defined as failed, with this the job is also defined as failed job.
First of all, if you want to restart a job you need to make sure that the job is defined as restart-able. Otherwise
you can not run the same job again. More over a job is restart-able only and only if it was failed in the previous
attempt. Once it is finished successfully you can not restart a job even if it is declared as restart-able, Yes you can but the job parameter has to be have different.
the retry-limit attribute defines how many times a failed task/step of a failed job can be retry to start
To use retry-limit you also need to defined on which exception or error it should retry
The retry-limit attribute is really "item-based" and not "job-based". By "item-based" I mean that for every item (record/line) that is read/processed/writen, if that item fails, it will be retried up the retry-limit. If that limit is reached, the step will fail.
For example if retry-limit is set as 2, it will try to execute for twice.