Spring Batch : Remote Chunking & Partitioning without using jms - jms

I am new to spring batch. I want to run spring batch jobs using remote chunking & partitioning technique on multiple servers without using jms.
I want to use HTTP Invoker or RMI rather than using jms.
But, All examples of remote chunking & partitioning use jms.
I can't find examples that use HTTP Invoker or RMI.
I wonder if it is possible..
English is not my mother language.. please excuse any errors on my part

You can use any form of communication you want for remote partitioning. However, remote chunking does require persistent communication which is why JMS is typically used.
The reason you see JMS for remote partitioning is because it's easier to configure a clustered environment with JMS than it is for HTTP. The reason for that is everyone (master and all the slaves) only need to know where the queue is to talk to. Using HTTP as a communication mechanism requires the master and slaves to know a lot more. The master needs to know how to evenly distribute the partitions over all the slaves and where to send the requests to for each slave. All the slaves also need to know where the master is. JMS's centralized distribution model also allows you to dynamically add new slaves during processing where HTTP would require you to have some way to register a new slave with the master.
The reason persistent communication is required for remote chunking is that there is nothing in the remote partition model to prevent an item from being processed twice since it's sent over the wire (remote partitioning just sends descriptions of the data across and the job repository prevents data from being processed twice).
You can read more about the difference between the two in my answer here: Difference between spring batch remote chunking and remote partitioning

Related

Microservice failure Scenario

I am working on Microservice architecture. One of my service is exposed to source system which is used to post the data. This microservice published the data to redis. I am using redis pub/sub. Which is further consumed by couple of microservices.
Now if the other microservice is down and not able to process the data from redis pub/sub than I have to retry with the published data when microservice comes up. Source can not push the data again. As source can not repush the data and manual intervention is not possible so I tohught of 3 approaches.
Additionally Using redis data for storing and retrieving.
Using database for storing before publishing. I have many source and target microservices which use redis pub/sub. Now If I use this approach everytime i have to insert the request in DB first than its response status. Now I have to use shared database, this approach itself adding couple of more exception handling cases and doesnt look very efficient to me.
Use kafka inplace if redis pub/sub. As traffic is low so I used Redis pub/sub and not feasible to change.
In both of the above cases, I have to use scheduler and I have a duration before which I have to retry else subsequent request will fail.
Is there any other way to handle above cases.
For the point 2,
- Store the data in DB.
- Create a daemon process which will process the data from the table.
- This Daemon process can be configured well as per our needs.
- Daemon process will poll the DB and publish the data, if any. Also, it will delete the data once published.
Not in micro service architecture, But I have seen this approach working efficiently while communicating 3rd party services.
At the very outset, as you mentioned, we do indeed seem to have only three possibilities
This is one of those situations where you want to get a handshake from the service after pushing and after processing. In order to accomplish the same, using a middleware queuing system would be a right shot.
Although a bit more complex to accomplish, what you can do is use Kafka for streaming this. Configuring producer and consumer groups properly can help you do the job smoothly.
Using a DB to store would be a overkill, considering the situation where you "this data is to be processed and to be persisted"
BUT, alternatively, storing data to Redis and reading it in a cron-job/scheduled job would make your job much simpler. Once the job is run successfully, you may remove the data from cache and thus save Redis Memory.
If you can comment further more on the architecture and the implementation, I can go ahead and update my answer accordingly. :)

Exchange files (up to many GB)

For my project, I have to create a file manager which aims at storing many files (from many locations) and exposing URL to download them.
In a micro-service ecosystem (I am used to use spring boot), I wonder what is the best way to exchange such files, I mean sending files to file manager?
On a one hand, I always thought it is better to exchange them asynchronously, so HTTP does not seem a good choice. But maybe I am wrong.
Is it a good choice to split files into fragments (in order to reduce number of bytes for each part) and send each of them through something like RabbitMQ or Kafka? Or should I rather transfer entire files on a NAS or through FTP and let file manager handling them? Or something else, like for example storing bytes in a temp database (maybe not a good choice)...
The problem of fragmentation is I have to implement a logic for keeping sort of each fragments which complicates processing of queues of topics.
IMO, never send actual files through a message broker.
First, setup some object storage system, for example S3 (with AWS or locally with Ceph), then send the path to the file as a string with the producer, then have the consumer read that path, and download the file.
If you want to collect files off of NAS or FTP, then Apache NiFi is one tool that has connectors to systems like that.
Based on my professional experience working with distributed systems (JMS based), to transfer huge content between participants:
a fragment approach should be used for request - reply model + control signals (has next, fragment counter)
delta approach for updates.
To avoid corrupt data, a hash function result can also be transmitted and checked in both scenarios.
But as mentioned in this e-mail thread, a better approach is to use FTP for this kind of scenarios:
RabbitMQ should actually not be used for big file transfers or only
with great care and fragmenting the files into smaller separate
messages.
When running a single broker instance, you'd still be safe, but in a
clustered setup, very big messages will break the cluster.
Clustered nodes are connected via 1 tcp connection, which must also
transport a (erlang) heartbeat. If your big message takes more time to
transfer between nodes than the heartbeat timeout (anywhere between
~20-45 seconds if I'm correct), the cluster will break and your
message is lost.
The preferred architecture for file transfer over amqp is to just send
a message with a link to a downloadable resource and let the file
transfer be handle by specialized protocol like ftp :-)
Hope it helps.

Use Hazelcast Executor Service to be executed on clients

I all the documentation and all the "Google search results" I saw, the hazelcast executor service can be used to be executed on "Members".
I wonder if it is possible to also have things being executed on hazelcast clients?
The distributed executor service is intended to run processing where the data is hosted, on the servers. This is a similar idea to a stored procedure, run the processing where the data lives, save data transfer.
In general, you can't run a Java Runnable or Callable on the clients as the clients may not be Java.
Also, the clients don't host any data, so they'd have to fetch what data they need from the servers potentially.
If you want something to run on all or some connected clients, you could implement this yourself using the publish/subscribe mechanism. A payload could be sent to an ITopic with the necessary execution parameters, and clients listening can act on the message.
You can also create a Near Cache on client side and use JDK’s ExecutorService that runs in your local jvm app.

Spring boot applications high availability

We have a microservice which is developed using spring boot. couple of the functionalities it implements is
1) A scheduler that triggers, at a specified time, a file download using webhdfs and process it and once the data is processed, it will send an email to users with the data process summary.
2) Read messages from kafka and once the data is read, send an email to users.
We are now planning to make this application high available either in Active-Active or Active-passive set up. The problem we are facing now is if both the instances of the application are running then both of them will try to download the file/read the data from kafka, process it and send emails. How can this be avoided? I mean to ensure that only one instance triggers the download and process it ?
Please let me know if there is known solution for this kind of scenarios as this seems to be a common scenario in most of the projects? Is master-slave/leader election approach a correct solution?
Thanks
Let the service download that file, extract the information and publish them via kafka.
Check beforehand if the information was already processed by querying kafka or a local DB.
You also could publish an DataProcessed-Event that triggers the EmailService, that sends the corresponding E-Mail.

how to communicatie rocketmq and rocketmq directly?

I have two network envionments (such as NETWORK -A and NETWORK -B). Now, I deployed rocketmq-a in the NETWORK-A and deployed rocketmq-b in the NETWORK-B, how to communicate rocketmq-a and rocketmq-b directly?
According to your comment, you have two rocketmq clusters and one message should replicate to another rocketmq cluster.
So this is a message replication.
you have two choices:
Implement a send message hook
Use a messageStore plugIn that extend AbstractPluginMessageStore and load it using broker configuration
Both of them needs to implement replication by yourself.
However, if you make them the same broker group, it is very easy.
Just make rocketmq-b as a slave of rocketmq-a, and deploy them in different machine room.
Then rocketmq-b will only provide read operations and always replicate the data from master

Resources