Spring Integration - parallel file processing by group - spring

I am trying to use experiment with Spring Integration with a simple task. I have a folder where I get incoming files. The files are named after a group ID.
I want all the files in the same groupId to be processed in sequence but files with different groupIds can be processed in parallel.
I started putting together a configuration like this:
<int:service-activator input-channel="filesInChannel"
output-channel="outputChannelAdapter">
<bean class="com.ingestion.FileProcessor" />
</int:service-activator>
<int:channel id="filesInChannel" />
<int-file:inbound-channel-adapter id="inputChannelAdapter"
channel="filesInChannel" directory="${in.file.path}" prevent-duplicates="true"
filename-pattern="${file.pattern}">
<int:poller id="poller" fixed-rate="1" task-executor="executor"/>
</int-file:inbound-channel-adapter>
<int-file:outbound-channel-adapter id="outputChannelAdapter" directory="${ok.file.path}" delete-source-files="true"/>
<task:executor id="executor" pool-size="10"/>
This is processing all the incoming files with 10 threads. What are the steps I need to split the files by groupId and have them processed one thread per groupId?
Thanks.

Assuming a finite number of group ids, you could use a different adapter for each group (with a single thread; all feeding into the same channel); each with a different pattern.
Or you could create a custom FileListFilter and use some kind of thread affinity to assign files from each group to a specific thread, with the filter only returning this thread's file(s).

Related

Spring Integration - Wait till finishes processing file

MyHandler class takes about 10-20 seconds (approximately) to process a huge 200MB csv/txt file. If I drop a file in the'my.test.dir' directory, MyHandler keeps picking the same file multiple times. To avoid this, I set prevent-duplicates to false. But I might get a file with the same file name after some time. It's not picking up files with the same name later. Please suggest, how to handle this scenario? MyHandler has to wait until it finishes processing the file.
<bean id="test-file-bean" class="com.test.MyHandler"/>
<int-file:inbound-channel-adapter
id="test-adapter-inbound"
directory="${my.test.dir}"
channel="test-file-channel"
filter="test-file-filter"
prevent-duplicates="false" auto-startup="true"
auto-create-directory="true">
<int:poller fixed-delay="5"/>
</int-file:inbound-channel-adapter>
<int:service-activator
input-channel="test-file-channel" ref="test-file-bean" method="handleFlow"/>
Thanks.
Consider to use a FileSystemPersistentAcceptOnceFileListFilter to prevent duplicates, but pass those which timestamp has been changed.
See more info in docs : https://docs.spring.io/spring-integration/docs/current/reference/html/file.html#file-reading.
Over there you also can find a ChainFileListFilter if you need to combine with your own.

Spring Integration AWS s3-inbound-streaming-channel-adapter stream from multiple s3 buckets

I am using XML based spring integration and use s3-inbound-streaming-channel-adapter to stream from a single s3 bucket.
We now have a requirement to stream from two s3 buckets.
So is it possible for s3-inbound-streaming-channel-adapter to stream from multiple buckets?
Or would I need to create a separate s3-inbound-streaming-channel-adapter for each s3 bucket?
This is my current set up for a single s3 bucket and it does work.
<int-aws:s3-inbound-streaming-channel-adapter
channel="s3Channel"
session-factory="s3SessionFactory"
filter="acceptOnceFilter"
remote-directory-expression="'bucket-1'">
<int:poller fixed-rate="1000"/>
</int-aws:s3-inbound-streaming-channel-adapter>
Thanks in advance.
UPDATE:
I ended up having two s3-inbound-streaming-channel-adapter as mentioned by Artem Bilan below.
However, for each inbound adapter, I had to declare instances of acceptOnceFilter and metadataStore separately.
This is because if I only had one instance of acceptOnceFilter and metadataStore and these were shared the the two inbound adapters, then some weird looping started happening.
e.g. When a file_1.csv arrived on bucket-1 and got processed and then if you put the same file_1.csv on bucket-2 then weird looping started happening. Don't know why! So I ended up creating acceptOnceFilter and metadataStore for each inbound adapter.
`
<!-- ===================================================== -->
<!-- Region 1 s3-inbound-streaming-channel-adapter setting -->
<!-- ===================================================== -->
<bean id="metadataStore" class="org.springframework.integration.metadata.SimpleMetadataStore"/>
<bean id="acceptOnceFilter"
class="org.springframework.integration.aws.support.filters.S3PersistentAcceptOnceFileListFilter">
<constructor-arg index="0" ref="metadataStore"/>
<constructor-arg index="1" value="streaming"/>
</bean>
<int-aws:s3-inbound-streaming-channel-adapter id="s3Region1"
channel="s3Channel"
session-factory="s3SessionFactory"
filter="acceptOnceFilter"
remote-directory-expression="'${s3.bucketOne.name}'">
<int:poller fixed-rate="1000"/>
</int-aws:s3-inbound-streaming-channel-adapter>
<int:channel id="s3Channel">
<int:queue capacity="50"/>
</int:channel>
<!-- ===================================================== -->
<!-- Region 2 s3-inbound-streaming-channel-adapter setting -->
<!-- ===================================================== -->
<bean id="metadataStoreRegion2" class="org.springframework.integration.metadata.SimpleMetadataStore"/>
<bean id="acceptOnceFilterRegion2"
class="org.springframework.integration.aws.support.filters.S3PersistentAcceptOnceFileListFilter">
<constructor-arg index="0" ref="metadataStoreRegion2"/>
<constructor-arg index="1" value="streaming"/>
</bean>
<int-aws:s3-inbound-streaming-channel-adapter id="s3Region2"
channel="s3ChannelRegion2"
session-factory="s3SessionFactoryRegion2"
filter="acceptOnceFilterRegion2"
remote-directory-expression="'${s3.bucketTwo.name}'">
<int:poller fixed-rate="1000"/>
</int-aws:s3-inbound-streaming-channel-adapter>
<int:channel id="s3ChannelRegion2">
<int:queue capacity="50"/>
</int:channel>
`
That's correct, the current implementation supports only a single remote directory to poll periodically. We really are working at this very moment to formalize such a solution as an out-of-the-box feature. Similar request has been reported for the (S)FTP support, especially when the target directory is not know in advance during configuration.
If that is not a big deal for your to configure several channel adapters for each for the directory, that would be great. You always can send messages from them to the same channel for processing.
Otherwise you can consider to loop the list of buckets via:
<xsd:attribute name="remote-directory-expression" type="xsd:string">
<xsd:annotation>
<xsd:documentation>
Specify a SpEL expression which will be used to evaluate the directory
path to where the files will be transferred
(e.g., "headers.['remote_dir'] + '/myTransfers'" for outbound endpoints)
There is no root object (message) for inbound endpoints
(e.g., "#someBean.fetchDirectory");
</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
in some bean.

Wait for jdbc outbound channel adapter to complete before further processing

I'm new to Spring Integration and I'm experimenting with the various components in a small project.
In the task at hand, I need to process a text file and store its contents to database. The file holds lines that can be grouped together, so it will be natural to divide each file into several independent messages.
This is the whole process (please see the config at the end):
do an initial file analysis;
done by transformers.outcomeTransf
store some data to database (i.e. file name, file date, etc.);
?
split the file contents into several distinct messages;
done by splitters.outcomeSplit
further analyze each message;
done by transformers.SingleoutcomeToMap
store single message data to database referencing data stored at step 1.
done by stored-proc-outbound-channel-adapter
The database holds just two tables:
T1 for file metadata (file name, file date, file source, ...);
T2 for file content details, rows here reference rows in T1.
I'm missing the component for step 2. As I understand it, a channel outbound adapter "swallows" the message it handles, so that no other endpoint can receive it.
I thought about a publish-subscribe channel (without a TaskExecutor) after step one, with a jdbc outbound adapter as first subscriber and the splitter from stem 3 as the second one: each subscribed handler should then receive a copy of the message but it's not clear to me if any processing in the splitter would wait the outbound adapter had finished.
Is this the right approach to the task? What if the transformer at step 4 is called asynchronously - each splitted message is self contained and that would call for concurrency.
Spring configuration:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:int="http://www.springframework.org/schema/integration"
xmlns:int-file="http://www.springframework.org/schema/integration/file"
xmlns:int-jdbc="http://www.springframework.org/schema/integration/jdbc"
xmlns:beans="http://www.springframework.org/schema/beans"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/integration
http://www.springframework.org/schema/integration/spring-integration.xsd
http://www.springframework.org/schema/integration/file
http://www.springframework.org/schema/integration/file/spring-integration-file.xsd
http://www.springframework.org/schema/integration/jdbc
http://www.springframework.org/schema/integration/jdbc/spring-integration-jdbc.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<!-- input-files-from-folder -->
<int-file:inbound-channel-adapter id="outcomeIn"
directory="file:/in-outcome">
<int:poller id="poller" fixed-delay="2500" />
</int-file:inbound-channel-adapter>
<int:transformer input-channel="outcomeIn" output-channel="outcomesChannel" method="transform">
<beans:bean class="transformers.outcomeTransf" />
</int:transformer>
<!-- save source to db! -->
<int:splitter input-channel="outcomesChannel" output-channel="singleoutcomeChannel" method="splitMessage">
<beans:bean class="splitters.outcomeSplit" />
</int:splitter>
<int:transformer input-channel="singleoutcomeChannel" output-channel="jdbcChannel" method="transform">
<beans:bean class="transformers.SingleoutcomeToMap" />
</int:transformer>
<int-jdbc:stored-proc-outbound-channel-adapter
data-source="dataSource" channel="jdbcChannel" stored-procedure-name="insert_outcome"
ignore-column-meta-data="true">
<int-jdbc:sql-parameter-definitions ... />
<int-jdbc:parameter ... />
</int-jdbc:stored-proc-outbound-channel-adapter>
<bean id="dataSource" class="org.apache.commons.dbcp2.BasicDataSource" destroy-method="close" >
<property name="driverClassName" value="org.postgresql.Driver"/>
<property ... />
</bean>
</beans>
You think the right way. When you have a PublishSubscribeChannel without an Executor, each next subscriber is going to wait when the previous finishes its work. Therefore your spllitter is not going to be called until everything is done on DB. More over by default, when first subscriber fails to handle a message (not DB connection ?), all others won't be called.
Another way to achieve similar behavior can be configured with the <request-handler-advice-chain> and ExpressionEvaluatingRequestHandlerAdvice: https://docs.spring.io/spring-integration/docs/5.0.4.RELEASE/reference/html/messaging-endpoints-chapter.html#expression-advice
All the splitter downstream flow concurrency and multi-threading is already not related to the DB logic. A parallelism isn't going to happen until DB performs its request properly.

Batch processing in jdbc gateway

my setup (simplified for clarity) is following:
<int:inbound-channel-adapter channel="in" expression="0">
<int:poller cron="0 0 * * * *"/>
<int:header name="snapshot_date" expression="new java.util.Date()"/>
<int:header name="correlationId" expression="T(java.util.UUID).randomUUID()"/>
<!-- more here -->
</int:inbound-channel-adapter>
<int:recipient-list-router input-channel="in" apply-sequence="true">
<int:recipient channel="data.source.1"/>
<int:recipient channel="data.source.2"/>
<!-- more here -->
</int:recipient-list-router>
<int:chain input-channel="data.source.1" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.1"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="data.source.2" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from another_large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.2"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="save" output-channel="process">
<int:splitter expression="T(com.google.common.collect.Lists).partition(payload, 1000)"/>
<int:transformer>
<int-groovy:script location="transform.groovy"/>
</int:transformer>
<int:service-activator expression="#db2.insertData(payload, headers)"/>
<int:aggregator/>
</int:chain>
<int:chain input-channel="process" output-channel="nullChannel">
<int:aggregator/>
<int:service-activator expression="#finalProcessing.doSomething()"/>
</int:chain>
let me explain the steps a little bit:
poller is triggered by cron. message is enriched with some information about this run.
message is sent to multiple data-source chains.
each chain extracts data from large dataset (100+k rows). resultset message is marked with source header.
resultset is split into smaller chunks, transformed and inserted into db2.
after all data sources have been polled, some complex processing is initiated, using the information about the run.
this configuration does the job so far, but is not scalable. main problem is that i have to load full dataset into memory first and pass it along the pipeline, which might cause memory issues.
my question is - what is the simplest way to have resultset extracted from db1, pushed through the pipeline and inserted into db2 in small batches?
First of all since version 4.0.4 Spring Integration's <splitter> supports Iterator as payload to avoid memory overhead.
We have a test-case for the JDBC which shows that behaviour. But as you see it is based on the Spring Integration Java DSL and Java 8 Lamdas. (Yes, it can be done even for older Java versions without Lamdas). Even if this case is appropriate for you, your <aggregator> should not be in-memory, because it collects all messages to the MessageStore.
That's first case.
Another option is based on the paging algorithm, when your SELECT accepts a pair of WHERE params in the your DB dialect. For Oracle it can be like: Paging with Oracle.
Where the pageNumber is some message header - :headers[pageNumber]
After that you do some trick with <recipient-list-router> to send a SELECT result to the save channel and to some other channel wich increments pageNumber header value and sends a message to the data.source.1 channel and so on. When the pageNumber becomes out of data scope, the <int-jdbc:outbound-gateway> stops produces results.
Something like that.
I don't say that it so easy, but it should be a start point for you, at least.

Spring batch admin remote partition steps running maximum 8 threads even though concurrency is 10?

I am using spring batch remote partitioning for batch process. I am launching jobs using spring batch admin.
I have inbound gateway consumer concurrency step to 10 but maximum number of partitions running in parallel are 8.
I want to increase the consumer concurrency to 15 later on.
Below is my configuration,
<task:executor id="taskExecutor" pool-size="50" />
<rabbit:template id="computeAmqpTemplate"
connection-factory="rabbitConnectionFactory" routing-key="computeQueue"
reply-timeout="${compute.partition.timeout}">
</rabbit:template>
<int:channel id="computeOutboundChannel">
<int:dispatcher task-executor="taskExecutor" />
</int:channel>
<int:channel id="computeInboundStagingChannel" />
<amqp:outbound-gateway request-channel="computeOutboundChannel"
reply-channel="computeInboundStagingChannel" amqp-template="computeAmqpTemplate"
mapped-request-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS"
mapped-reply-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS" />
<beans:bean id="computeMessagingTemplate"
class="org.springframework.integration.core.MessagingTemplate"
p:defaultChannel-ref="computeOutboundChannel"
p:receiveTimeout="${compute.partition.timeout}" />
<beans:bean id="computePartitionHandler"
class="org.springframework.batch.integration.partition.MessageChannelPartitionHandler"
p:stepName="computeStep" p:gridSize="${compute.grid.size}"
p:messagingOperations-ref="computeMessagingTemplate" />
<int:aggregator ref="computePartitionHandler"
send-partial-result-on-expiry="true" send-timeout="${compute.step.timeout}"
input-channel="computeInboundStagingChannel" />
<amqp:inbound-gateway concurrent-consumers="${compute.consumer.concurrency}"
request-channel="computeInboundChannel"
reply-channel="computeOutboundStagingChannel" queue-names="computeQueue"
connection-factory="rabbitConnectionFactory"
mapped-request-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS"
mapped-reply-headers="correlationId, sequenceNumber, sequenceSize, STANDARD_REQUEST_HEADERS" />
<int:channel id="computeInboundChannel" />
<int:service-activator ref="stepExecutionRequestHandler"
input-channel="computeInboundChannel" output-channel="computeOutboundStagingChannel" />
<int:channel id="computeOutboundStagingChannel" />
<beans:bean id="computePartitioner"
class="org.springframework.batch.core.partition.support.MultiResourcePartitioner"
p:resources="file:${spring.tmp.batch.dir}/#{jobParameters[batch_id]}/shares_rics/shares_rics_*.txt"
scope="step" />
<beans:bean id="computeFileItemReader"
class="org.springframework.batch.item.file.FlatFileItemReader"
p:resource="#{stepExecutionContext[fileName]}" p:lineMapper-ref="stLineMapper"
scope="step" />
<beans:bean id="computeItemWriter"
class="com.st.batch.foundation.writers.ComputeItemWriter"
p:symfony-ref="symfonyStepScoped" p:timeout="${compute.item.timeout}"
p:batchId="#{jobParameters[batch_id]}" scope="step" />
<step id="computeStep">
<tasklet transaction-manager="transactionManager">
<chunk reader="computeFileItemReader" writer="computeItemWriter"
commit-interval="${compute.commit.interval}" />
</tasklet>
</step>
<flow id="computeFlow">
<step id="computeStep.master">
<partition partitioner="computePartitioner"
handler="computePartitionHandler" />
</step>
</flow>
<job id="computeJob" restartable="true">
<flow id="computeJob.computeFlow" parent="computeFlow" />
</job>
compute.grid.size = 112
compute.consumer.concurrency = 10
Input files are splited to 112 equal parts = compute.grid.size = total number of partitions
Number of servers = 4.
There are 2 problems,
i) Even though I have set concurrency to 10, maximum number of threads running are 8.
ii)
some are slower as other processes runs on them and some are faster so I want make sure step executions are distributed fairly i.e. if faster servers are done with their execution, other remaining executions in queue should go to them . It should not be distributed round robbin fashion.
I know in rabbitmq there is prefetch count setting and ack mode to distribute farely. For spring integration, prefetch count is 1 default and ack mode is AUTO by default. But still some servers keeps running more partitions even though other servers are done for long time. Ideally no servers should be sitting idle.
Update:
One more thing I now observed is that, for some steps which runs in parallel using split (not distributed using remote partitioning) also run max 8 in parallel. It looks something like thread pool limit issue but as you can see taskExecutor has pool-size set to 50.
Is there anything in spring-batch/spring-batch-admin which limits number of concurrently running steps ?
2nd Update:
And, if there are 8 or more threads running in parallel processing items, spring batch admin doesn't load. It just hangs. If I reduce concurrency, spring batch admin loads. I even tested it with setting concurrency 4 on one server and 8 on other server, spring batch admin doesn't load it I use URL of server where 8 threads are running but it works on the server where 4 threads are running.
Spring batch admin manager has below jobLauncher configuration,
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository" />
<property name="taskExecutor" ref="jobLauncherTaskExecutor" />
</bean>
<task:executor id="jobLauncherTaskExecutor" pool-size="6" rejection-policy="ABORT" />
The pool size is 6 there, has it anything to do with above problem ?
Or is there anything in tomcat 7 which restricts number of threads running to 8 ?
Are you using a database for JobRepository?
During the execution, the batch frameworks persists step executions and number of connections to the JobRepository database can interfere in parallel step executions.
Concurrency of 8 makes me thinks you might be using BasicDataSource? If so, switch to something like DriverManagerDataSource and see.
Confused - you said "I have set the concurrency to 10" but then show compute.consumer.concurrency = 8. So it is working as configured. It is impossible to have only 8 consumer threads if the property is set to 10.
From Rabbit's perspective, all consumers are equal - if there are 10 consumers on a slow box and 10 consumers on a fast box, and you only have 10 partitions, it is possible that all 10 partitions will end up on the slow box.
RabbitMQ does not distribute work across servers, it distributes the work across consumers only.
You might get better distribution by reducing the concurrency. You should also set the concurrency lower on the slower boxes.

Resources