spring integration reading many files - spring

We have a requirement to parse lots of incoming files (into a directory) and processing them and putting the outcome onto AWS kinesis for each file.
The frequency of files can be 60,000 per day and files can arrive every 15 seconds. Each file may contain about 1000 entries.
Can spring-integration handle this load?
Would there be any issues processing this kind of volumes?
As the files are coming in on to an inbound-channel-adapter can we execute a service-activator for each file?
I believe we need to use task-executors on channels with poller? Any examples?
Would task-executors call the service-activators in a multi-threaded manner?
Any pointers would be helpful. Links to any code examples would be nice.

This is not the kind of question one asks here on SO - too broad and too much questions in a single thread. I assume even if I answer to all of them, you are going to ask more and SO is not good for Q&A chat. Anyway:
Yes, Spring Integration can handle this. You can use simple FileReadingMessageSource to poll the directory periodically.
Each file (an outbound message payload) can be fed to the FileSplitter to parse it line by line.
After splitter you indeed can use an ExecutorChannel to process those lines in parallel.
The Service Activator can be called in multi-threaded environment as long as it is a thread-safe.
In the end you can use KinesisMessageHandler to send record to the AWS Kinesis. And yes, this one can be used from different threads as well.
All the information you can find in the Spring Integration Reference Manual. Some Samples may help you as well. And also Spring Integration AWS Extension is there for you.

Related

Notifying golongpoll.SubscriptionManager of an event from kafka-go

I was writing a POC on long-polling using go.
I see the general package to be used is https://github.com/jcuga/golongpoll .
But assuming that I would want to publish an event to the golongpoll.SubscriptionManager from a general context, especially when there is a possibility that the long poll API request is being served by one machine, while the Kafka event for that particular consumer group is consumed by another instance in the cluster.
The examples given in the documentation did not talk of such a scenario at all, even though this seems like a common scenario. One way I can think of is have a distributed cache like Redis in between and have all the services poll this for a change? But that sounds a bit dumb to me.

Exchange files (up to many GB)

For my project, I have to create a file manager which aims at storing many files (from many locations) and exposing URL to download them.
In a micro-service ecosystem (I am used to use spring boot), I wonder what is the best way to exchange such files, I mean sending files to file manager?
On a one hand, I always thought it is better to exchange them asynchronously, so HTTP does not seem a good choice. But maybe I am wrong.
Is it a good choice to split files into fragments (in order to reduce number of bytes for each part) and send each of them through something like RabbitMQ or Kafka? Or should I rather transfer entire files on a NAS or through FTP and let file manager handling them? Or something else, like for example storing bytes in a temp database (maybe not a good choice)...
The problem of fragmentation is I have to implement a logic for keeping sort of each fragments which complicates processing of queues of topics.
IMO, never send actual files through a message broker.
First, setup some object storage system, for example S3 (with AWS or locally with Ceph), then send the path to the file as a string with the producer, then have the consumer read that path, and download the file.
If you want to collect files off of NAS or FTP, then Apache NiFi is one tool that has connectors to systems like that.
Based on my professional experience working with distributed systems (JMS based), to transfer huge content between participants:
a fragment approach should be used for request - reply model + control signals (has next, fragment counter)
delta approach for updates.
To avoid corrupt data, a hash function result can also be transmitted and checked in both scenarios.
But as mentioned in this e-mail thread, a better approach is to use FTP for this kind of scenarios:
RabbitMQ should actually not be used for big file transfers or only
with great care and fragmenting the files into smaller separate
messages.
When running a single broker instance, you'd still be safe, but in a
clustered setup, very big messages will break the cluster.
Clustered nodes are connected via 1 tcp connection, which must also
transport a (erlang) heartbeat. If your big message takes more time to
transfer between nodes than the heartbeat timeout (anywhere between
~20-45 seconds if I'm correct), the cluster will break and your
message is lost.
The preferred architecture for file transfer over amqp is to just send
a message with a link to a downloadable resource and let the file
transfer be handle by specialized protocol like ftp :-)
Hope it helps.

Add Batch capabilities to Integration flow

I currently have a Spring Integration flow that works fine (see link for diagram). I would like to add Batch on top of my current configuration to allow for retry with exponential back-off, circuit breaker pattern, and persisting jobs to the database for restart.
The Integration flow consists of a Gateway that takes a Message<MyObj>, which is eventually routed to a Transformer that converts Message<MyObj> to a Message<String>. The Aggregator then takes Message<String> and eventually releases a concatenated Message<String> (using both a size release-strategy and a MessageGroupStoreReaper with a timeout). The concatenated String is then the payload of the File uploaded using SFTP outbound-channel-adapter.
I have searched, read through docs, looked at tons of examples, and I can't figure out how to encapsulate the last step of the process into a Batch Job. I need the ability to retry uploading the String (as payload of File) if there is an SFTP connection issue or other Exception thrown during the upload. I also want to be able to restart (using database backed JobRepository) in case of some failure, so I don't think using Retry Advice is sufficient.
Please explain and help me understand how to wire together the pieces and which to use (job-launching-gateway, MessageToJobRequest Transformer, ItemReader, ItemWriter??). I'm also unsure how to access each Message<String> and send to the SFTP channel-adaptor inside of a Job, Step, or Tasklet.
Current flow:
First of all let's take a look how we can overcome your requirements without Batch.
<int-sftp:outbound-channel-adapter> has <request-handler-advice-chain>, where you can configure RequestHandlerRetryAdvice and RequestHandlerCircuitBreakerAdvice.
To achieve the restartable option you can make the input channel of that adapter as a persistent queue with message-store.
Now about Batch.
To start a Job from the Integration flow you should write some MessageToJobRequest and use <batch-int:job-launching-gateway> after that. Here, of course, you can place your payload to the jobParameters.
To send a message from Job to some channel (e.g. to sftp adapter) you can use org.springframework.batch.integration.chunk.ChunkMessageChannelItemWriter.
Read more here: http://docs.spring.io/spring-batch/reference/html/springBatchIntegration.html

Spring Integration JMS Threadsafe

I'm pretty new to Spring Integration and still trying to get my head around it. Right now I'm just trying to understand if the example I've found here is actually safe across multiple threads:
https://github.com/spring-projects/spring-integration-samples/blob/master/basic/jms/src/test/java/org/springframework/integration/samples/jms/ChannelAdapterDemoTest.java
My use case is as follows:
Send request to queue with JMS Reply-to as a temporary queue
Wait for response to be received on the temporary queue
Need this to happen synchronously within a method -- I don't want to split it up and make it asynchronous across several methods
Will the above example work for this? If not, am I barking up the wrong tree?
Thanks in advance.
That sample is pretty simple; it just sends the message to stdout so, yes, it's perfectly thread safe.
For the request/reply scenario you are talking about, you need to use a <gateway/> - see the other example in that sample project. In that case, you can see that the message is handled by 'demoBean' which, again, is perfectly thread safe.
For a real application, the thread-safetyness depends on the code in the services invoked by the flow receiving the message.
If you wish, you can use Spring Integration on the client side too (with an outbound gateway).

Spring Batch or JMS for long running jobs

I have the problem that I have to run very long running processes on my Webservice and now I'm looking for a good way to handle the result. The scenario : A user executes such a long running process via UI. Now he gets the message that his request was accepted and that he should return some time later. So there's no need to display him the status of his request or something like this. I'm just looking for a way to handle the result of the long running process properly. Since the processes are external programms, my application server is not aware of them. Therefore I have to wait for these programms to terminate. Of course I don't want to use EJBs for this because then they would block for the time no result is available. Instead I thought of using JMS or Spring Batch. Does anyone ever had the same problem or an advice which solution would be better?
It really depends on what forms of communication your external programs have available. JMS is a very good approach and immediately available in your app server but might not be the best option if your external program is a long running DB query which dumps the result in a text file...
The main advantage of Spring Batch over "just" using JMS as an aynchronous communcations channel is the transactional properties, allowing the infrastructure to retry failed jobs, group jobs together and such. Without knowing more about your specific setup, it is hard to give detailed advise.
Cheers,
I had a similar design requirement, users were sending XML files and I had to generate documents from them. Using JMS in this case is advantageous since you can always add new instances of these processes which can consume and execute the jobs in parallel.
You can use a timer task to check status or monitor these processes. Also, you can publish a message to a JMS queue once the processes are completed.

Resources