Batch processing in jdbc gateway - jdbc

my setup (simplified for clarity) is following:
<int:inbound-channel-adapter channel="in" expression="0">
<int:poller cron="0 0 * * * *"/>
<int:header name="snapshot_date" expression="new java.util.Date()"/>
<int:header name="correlationId" expression="T(java.util.UUID).randomUUID()"/>
<!-- more here -->
</int:inbound-channel-adapter>
<int:recipient-list-router input-channel="in" apply-sequence="true">
<int:recipient channel="data.source.1"/>
<int:recipient channel="data.source.2"/>
<!-- more here -->
</int:recipient-list-router>
<int:chain input-channel="data.source.1" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.1"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="data.source.2" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from another_large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.2"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="save" output-channel="process">
<int:splitter expression="T(com.google.common.collect.Lists).partition(payload, 1000)"/>
<int:transformer>
<int-groovy:script location="transform.groovy"/>
</int:transformer>
<int:service-activator expression="#db2.insertData(payload, headers)"/>
<int:aggregator/>
</int:chain>
<int:chain input-channel="process" output-channel="nullChannel">
<int:aggregator/>
<int:service-activator expression="#finalProcessing.doSomething()"/>
</int:chain>
let me explain the steps a little bit:
poller is triggered by cron. message is enriched with some information about this run.
message is sent to multiple data-source chains.
each chain extracts data from large dataset (100+k rows). resultset message is marked with source header.
resultset is split into smaller chunks, transformed and inserted into db2.
after all data sources have been polled, some complex processing is initiated, using the information about the run.
this configuration does the job so far, but is not scalable. main problem is that i have to load full dataset into memory first and pass it along the pipeline, which might cause memory issues.
my question is - what is the simplest way to have resultset extracted from db1, pushed through the pipeline and inserted into db2 in small batches?

First of all since version 4.0.4 Spring Integration's <splitter> supports Iterator as payload to avoid memory overhead.
We have a test-case for the JDBC which shows that behaviour. But as you see it is based on the Spring Integration Java DSL and Java 8 Lamdas. (Yes, it can be done even for older Java versions without Lamdas). Even if this case is appropriate for you, your <aggregator> should not be in-memory, because it collects all messages to the MessageStore.
That's first case.
Another option is based on the paging algorithm, when your SELECT accepts a pair of WHERE params in the your DB dialect. For Oracle it can be like: Paging with Oracle.
Where the pageNumber is some message header - :headers[pageNumber]
After that you do some trick with <recipient-list-router> to send a SELECT result to the save channel and to some other channel wich increments pageNumber header value and sends a message to the data.source.1 channel and so on. When the pageNumber becomes out of data scope, the <int-jdbc:outbound-gateway> stops produces results.
Something like that.
I don't say that it so easy, but it should be a start point for you, at least.

Related

Wait for jdbc outbound channel adapter to complete before further processing

I'm new to Spring Integration and I'm experimenting with the various components in a small project.
In the task at hand, I need to process a text file and store its contents to database. The file holds lines that can be grouped together, so it will be natural to divide each file into several independent messages.
This is the whole process (please see the config at the end):
do an initial file analysis;
done by transformers.outcomeTransf
store some data to database (i.e. file name, file date, etc.);
?
split the file contents into several distinct messages;
done by splitters.outcomeSplit
further analyze each message;
done by transformers.SingleoutcomeToMap
store single message data to database referencing data stored at step 1.
done by stored-proc-outbound-channel-adapter
The database holds just two tables:
T1 for file metadata (file name, file date, file source, ...);
T2 for file content details, rows here reference rows in T1.
I'm missing the component for step 2. As I understand it, a channel outbound adapter "swallows" the message it handles, so that no other endpoint can receive it.
I thought about a publish-subscribe channel (without a TaskExecutor) after step one, with a jdbc outbound adapter as first subscriber and the splitter from stem 3 as the second one: each subscribed handler should then receive a copy of the message but it's not clear to me if any processing in the splitter would wait the outbound adapter had finished.
Is this the right approach to the task? What if the transformer at step 4 is called asynchronously - each splitted message is self contained and that would call for concurrency.
Spring configuration:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:int="http://www.springframework.org/schema/integration"
xmlns:int-file="http://www.springframework.org/schema/integration/file"
xmlns:int-jdbc="http://www.springframework.org/schema/integration/jdbc"
xmlns:beans="http://www.springframework.org/schema/beans"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/integration/spring-integration.xsd
http://www.springframework.org/schema/integration/file
http://www.springframework.org/schema/integration/file/spring-integration-file.xsd
http://www.springframework.org/schema/integration/jdbc
http://www.springframework.org/schema/integration/jdbc/spring-integration-jdbc.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<!-- input-files-from-folder -->
<int-file:inbound-channel-adapter id="outcomeIn"
directory="file:/in-outcome">
<int:poller id="poller" fixed-delay="2500" />
</int-file:inbound-channel-adapter>
<int:transformer input-channel="outcomeIn" output-channel="outcomesChannel" method="transform">
<beans:bean class="transformers.outcomeTransf" />
</int:transformer>
<!-- save source to db! -->
<int:splitter input-channel="outcomesChannel" output-channel="singleoutcomeChannel" method="splitMessage">
<beans:bean class="splitters.outcomeSplit" />
</int:splitter>
<int:transformer input-channel="singleoutcomeChannel" output-channel="jdbcChannel" method="transform">
<beans:bean class="transformers.SingleoutcomeToMap" />
</int:transformer>
<int-jdbc:stored-proc-outbound-channel-adapter
data-source="dataSource" channel="jdbcChannel" stored-procedure-name="insert_outcome"
ignore-column-meta-data="true">
<int-jdbc:sql-parameter-definitions ... />
<int-jdbc:parameter ... />
</int-jdbc:stored-proc-outbound-channel-adapter>
<bean id="dataSource" class="org.apache.commons.dbcp2.BasicDataSource" destroy-method="close" >
<property name="driverClassName" value="org.postgresql.Driver"/>
<property ... />
</bean>
</beans>
You think the right way. When you have a PublishSubscribeChannel without an Executor, each next subscriber is going to wait when the previous finishes its work. Therefore your spllitter is not going to be called until everything is done on DB. More over by default, when first subscriber fails to handle a message (not DB connection ?), all others won't be called.
Another way to achieve similar behavior can be configured with the <request-handler-advice-chain> and ExpressionEvaluatingRequestHandlerAdvice: https://docs.spring.io/spring-integration/docs/5.0.4.RELEASE/reference/html/messaging-endpoints-chapter.html#expression-advice
All the splitter downstream flow concurrency and multi-threading is already not related to the DB logic. A parallelism isn't going to happen until DB performs its request properly.

Determine the end of a cyclic workflow in spring integration (inbound-channel => service-activator)

We have the following simple int-jpa based workflow:
[inbound-channel-adapter] -> [service-activator]
The config is like this:
<int:channel id="inChannel"> <int:queue/> </int:channel>
<int:channel id="outChannel"> <int:queue/> </int:channel>
<int-jpa:inbound-channel-adapter id="inChannelAdapter" channel="inChannel"
jpa-query="SOME_COMPLEX_POLLING_QUERY"
max-results="2">
<int:poller max-messages-per-poll="2" fixed-rate="20" >
<int:advice-chain synchronization-factory="txSyncFactory" >
<tx:advice transaction-manager="transactionManager" >
<tx:attributes>
<tx:method name="*" timeout="30000" />
</tx:attributes>
</tx:advice>
<int:ref bean="pollerAdvice"/>
</int:advice-chain>
</int-jpa:inbound-channel-adapter>
<int:service-activator input-channel="inChannel" ref="myActivator"
method="pollEntry" output-channel="outChannel" />
<bean id="myActivator" class="com.company.myActivator" />
<bean id="pollerAdvice" class="com.company.myPollerAdvice" />
The entry point for processing is a constantly growing table against which the SOME_COMPLEX_POLLING_QUERY is run. The current flow is :
[Thread-1] The SOME_COMPLEX_POLLING_QUERY will only return entries that has busy set to false (we set busy to true as soon as polling is done using txSyncFactory)
[Thread-2] These entries will pass through the myActivator where it might take anywhere from 1 min to 30 mins.
[Thread-2] Once the processing is done, we set back the busy from true to false
Problem: We need to trigger a notification even when the processing of all the entries that were present in the table is done.
Approach tried: We used the afterReturning of pollerAdvice to find out if the SOME_COMPLEX_POLLING_QUERY returned any results or not. However this method will start returning "No Entries" way before the Thread-2 is done processing all the entries.
Note:
The same entries will be processes again after 24hrs. But this time it will have more entries.
We are not using outbound-channel-adapter, since we dont have any requirement for it. However, we are open to use it, if that is a part of the solution proposed.
Not sure if that will work for you, but since you still need to wait with the notification until Thread-2, I would suggest to have some AtomicBoolean bean. In the mentioned afterReturning(), when there is no data polled from the DB, you just change the state of the AtomicBoolean to true. When the Thread-2 finishes its work, it can call <filter> to check the state of the AtomicBoolean and then really perform an <int-event:outbound-channel-adapter> to emit a notification event.
So, the final decision to emit event or not is definitely done from the Thread-2, not polling channel adapter.

Spring integration delayer- Jdbc Message Store: message is not getting deleted

Below is the delayer code which I am using in my application. The output channel checkMessageInProgress is a database stored procedure call which will check if the message needs to be processed or delayed.
If the message needs to be delayed again, the retry count is incremented. After 3 delays, custom Application Exception is raised. I am using a jdbc Message Store for the delayer messages. In scenario where the message is delayed for 3 times and when exception is raised, messages are not getting deleted from the databases tables and the server is picking those messages on restart. How do I make sure that the message is deleted from table in cases where the delay happens for 3 times
<int:chain input-channel="delayerChannel"
output-channel="checkMessageInProgress">
<int:header-enricher>
<!-- Exception/ERROR handling for flows originating from Delayer -->
<int:header name="errorChannel" value="exceptionChannel"
overwrite="true" />
<int:header name="retryCount" overwrite="true" type="int"
expression="headers['retryCount'] == null ? 0 : headers['retryCount'] + 1" />
</int:header-enricher>
<!-- If retryCount maxed out -discard message and log it in error table -->
<int:filter expression="(headers['retryCount'] lt 3)"
discard-channel="raiseExceptionChannel">
</int:filter>
<!-- Configurable delay - fetch from property file -->
<int:delayer id="Delayer" default-delay="${timeout}"
message-store="mymessageStore">
<!-- Transaction management for flows originating from the Delayer -->
<int:transactional transactionmanager="myAppTransactionManager"/>
</int:delayer>
</int:chain>
That is not surprise. Since you use transactional resource (database) any exception downstream causes transaction rollback, therefore no deletion for the data.
Consider shift message to the separate thread before throwing exception. That way the transaction will be committed.

(Spring batch) Pollable channel with replies contains ChunkResponses even if JOB is succefully completed

I have following chunk writer configuration with getting the replies from spring batch remote chunking:
<bean id="chunkWriter" class="org.springframework.batch.integration.chunk.ChunkMessageChannelItemWriter" scope="step">
<property name="messagingOperations" ref="messagingGateway" />
<property name="replyChannel" ref="masterChunkReplies" />
<property name="throttleLimit" value="5" />
<property name="maxWaitTimeouts" value="30000" />
</bean>
<bean id="messagingGateway" class="org.springframework.integration.core.MessagingTemplate">
<property name="defaultChannel" ref="masterChunkRequests" />
<property name="receiveTimeout" value="2000" />
</bean>
<!-- Remote Chunking Replies From Slave -->
<jms:inbound-channel-adapter id="masterJMSReplies"
destination="remoteChunkingRepliesQueue"
connection-factory="remoteChunkingConnectionFactory"
channel="masterChunkReplies">
<int:poller fixed-delay="10" />
</jms:inbound-channel-adapter>
<int:channel id="masterChunkReplies">
<int:queue />
<int:interceptors>
<int:wire-tap channel="loggingChannel"/>
</int:interceptors>
</int:channel>
My remotely chunked step is running perfectly, all data are processed with very good performance, all steps ends in COMPLETED state. But problem is that masterChunkReplies queue channel contains ChunkResponses after end of the job. Documentation doesn't say anything about it, is that normal state?
Problem is that I can't run a new job then, because it then crashes at:
Message contained wrong job instance id ["
+ jobInstanceId + "] should have been [" + localState.getJobId() + "]."
There is a simple workaround, cleaning the masterChunkReplies queue channel at the start of the job, but I'm not sure if it is correct...
Can you please clarify this?
Gary, I found the root cause.
At slaves, if I change following chunk consumer JMS adapter:
<jms:message-driven-channel-adapter id="slaveRequests"
connection-factory="remoteChunkingConnectionFactory"
destination="remoteChunkingRequestsQueue"
channel="chunkRequests"
concurrent-consumers="10"
max-concurrent-consumers="50"
acknowledge="transacted"
receive-timeout="5000"
idle-task-execution-limit="10"
idle-consumer-limit="5"
/>
for
<jms:inbound-channel-adapter id="jmsRequests" connection-factory="remoteChunkingConnectionFactory"
destination="remoteChunkingRequestsQueue"
channel="chunkRequests"
acknowledge="transacted"
receive-timeout="5000"
>
<int:poller fixed-delay="100"/>
</jms:inbound-channel-adapter>
then it works, masterChunkReplies queue is consumed completely at the end of job. Anyway, any attempts of consuming chunkRequests at slaves in parallalel doesn't work. MasterChunkReplies queue then contains not consumed ChunkResponses. So starting new jobs ends in
Message contained wrong job instance id ["
+ jobInstanceId + "] should have been [" + localState.getJobId() + "]."
Gary, does it mean that slaves cannot consume ChunkRequests in parallel?
Gary, after few days of struggling, I made it finally to work, ...Even with parallel ChunkRequests consuming at slaves and with empty masterChunkReplies pollable channel at the end of the job...Changes:
At master, I changed the polled inbound channel adapter consuming ChunkResponses just taken from github examples, for message driven adapter with the same level of multithreading as slaves are consuming ChunkRequests. Because I had a feeling that master is consuming ChunkResponses slowely, that's why there were additional ChunkResponses at the end of the job.
Also I misconfigured remotely chunked step....My fault.
I didn't test it yet at more then one nodes, but now I think it works as it should be.
Thank you very much for help.
regards
Tomas

Spring Integration - parallel file processing by group

I am trying to use experiment with Spring Integration with a simple task. I have a folder where I get incoming files. The files are named after a group ID.
I want all the files in the same groupId to be processed in sequence but files with different groupIds can be processed in parallel.
I started putting together a configuration like this:
<int:service-activator input-channel="filesInChannel"
output-channel="outputChannelAdapter">
<bean class="com.ingestion.FileProcessor" />
</int:service-activator>
<int:channel id="filesInChannel" />
<int-file:inbound-channel-adapter id="inputChannelAdapter"
channel="filesInChannel" directory="${in.file.path}" prevent-duplicates="true"
filename-pattern="${file.pattern}">
<int:poller id="poller" fixed-rate="1" task-executor="executor"/>
</int-file:inbound-channel-adapter>
<int-file:outbound-channel-adapter id="outputChannelAdapter" directory="${ok.file.path}" delete-source-files="true"/>
<task:executor id="executor" pool-size="10"/>
This is processing all the incoming files with 10 threads. What are the steps I need to split the files by groupId and have them processed one thread per groupId?
Thanks.
Assuming a finite number of group ids, you could use a different adapter for each group (with a single thread; all feeding into the same channel); each with a different pattern.
Or you could create a custom FileListFilter and use some kind of thread affinity to assign files from each group to a specific thread, with the filter only returning this thread's file(s).

Resources