Clustering the Batch Job & distributing the data load - cluster-computing

I have Batch Processing project, wanted to cluster on 5 machines.
Suppose I have input source is database having 1000 records.
I want to split these records equally i.e. 200 records/instance of batch job.
How could we distribute the work load ?

Given below, is the workflow that you may want to follow.
Assumptions:
You have the necessary Domain Objects respective to the DB table.
You have a batch flow configured wherein, there is a
reader/writer/tasklet mechanism.
You have a Messaging System (Messaging Queues are a great way to
make distributed applications talk to each other)
Input object is an object to the queue that contains the set of
input records split as per the required size.
Result object is an object to the queue that contains the processed
records or result value(if scalar)
The chunkSize is configured in a property file. Here 200
Design:
In the application,
Configure a queueReader to read from a queue
Configure a queueWriter to write to a queue
If using the task/tasklet mechanism, configure different queues to carry the input/result objects.
Configure a DB reader which reads from a DB
Logic in the DBReader
Read records from DB one by one and count of records maintained. if
(count%chunkSize==0) then write all the records to the inputMessage
object and write the object to the queue.
Logic in queueReader
Read the messages one by one
For each present message do the necessary processing.
Create a resultObject
Logic in the queueWriter
Read the resultObject (usually batch frameworks provide a way to
ensure that writers are able to read the output from readers)
If any applicable processing or downstream interaction is needed,
add it here.
Write the result object to the outputQueue.
Deployment
Package once, deploy multiple instances. For better performance, ensure that the chunkSize is small to enable fast processing. The queues are managed by the messaging system (The available systems in the market provide ways to monitor the queues) where you will be able to see the message flow.

Related

Spring Batch Remote Chunking Chunk Response

I have implemented Spring Batch Remote Chunking with Kafka. I have implemented both Manager and worker configuration. I want to send some DTO or object in chunkresponse from worker side to Manager and do some processing once I receive the response. Is there any way to achieve this. I want to know the count of records processed after each chunk is processed from worker side and I have to update the database frequently with count.
I want to send some DTO or object in chunkresponse from worker side to Manager and do some processing once I receive the response. Is there any way to achieve this.
I'm not sure the remote chunking feature was designed to send items from the manager to workers and back again. The ChunkResponse is what the manager is expecting from workers and I see no way you can send processed items in it (except probably serializing the item in the ChunkResponse#message field, or storing it in the execution context, which both are not good ideas..).
I want to know the count of records processed after each chunk is processed from worker side and I have to update the database frequently with count.
The StepContribution is what you are looking for here. It holds all the counts (read count, write count, etc). You can get the step contribution from the ChunkResponse on the manager side and do what is required with the result.

Best way to track/trace a JSON Object (a time series data) as it flows through a system of microservices on a IOT platform

We are working on an IOT platform, which ingests many device parameter
values (time series) every second from may devices. Once ingested the
each JSON (batch of multiple parameter values captured at a particular
instance) What is the best way to track the JSON as it flows through
many microservices down stream in an event driven way?
We use spring boot technology predominantly and all the services are
containerised.
Eg: Option 1 - Is associating UUID to each object and then updating
the states idempotently in Redis as each microservice processes it
ideal? Problem is each microservice will be tied to Redis now and we
have seen performance of Redis going down as number api calls to Redis
increase as it is single threaded (We can scale this out though).
Option 2 - Zipkin?
Note: We use Kafka/RabbitMQ to process the messages in a distributed
way as you mentioned here. My question is about a strategy to track
each of this message and its status (to enable replay if needed to
attain only once delivery). Let's say a message1 is being by processed
by Service A, Service B, Service C. Now we are having issues to track
if the message failed getting processed at Service B or Service C as
we get a lot of messages
Better approach will be using Kafka instead of Redis.
Create a topic for every microservice & keep moving the packet from
one topic to another after processing.
topic(raw-data) - |MS One| - topic(processed-data-1) - |MS Two| - topic(processed-data-2) ... etc
Keep appending the results to same object and keep moving it down the line, untill every micro-service has processed it.

Access ProcessContext::forward from multiple user threads

Given: DSL topology with KStream::transform. As part of Transformer::transform execution multiple messages are generated from the input one (it could be thousands of output messages from the single input message).
New messages are generated based on the data retrieved from the database. To speed up the process I would like to create multiple user threads to access data in DB in parallel. Upon generating a new message the thread will call ProcessContext::forward to send the message downstream.
Is it safe to call ProcessContext::forward from the different threads?
It is not safe and not allowed to call ProcessorContext#forward() from a different thread. If you try it, an exception will be thrown.
As a workaround, you could let all threads "buffer" their result data, and collect all data in the next call to process(). As an alternative, you could also schedule a punctuation that collects and forwards the data from the different threads.

Spring Cloud Dataflow - Retaining Order of Messages

Let's say I have a stream with 3 applications - a source, processor, and sink.
I need to retain the order of my the messages I received from my source. When I receive messages A,B,C,D, I have to send them to sink as A,B,C,D. (I can't send them as B,A,C,D).
If I have just have 1 instance of each application, everything will run sequentially and the order will be retained.
If I have 10 instances of each application, the messages A,B,C,D might get processed at the same time in different instances. I don't know what order these messages will wind up in.
So is there any way I can ensure that I retain the order of my messages when using multiple instances?
No; when you scale out (either by concurrency in the binder or by deploying multiple instances), you lose order. This is true for any multi-threaded application, not just spring-cloud-stream.
You can use partitioning so that each instance gets a partition of the data, but ordering is only retained within each partition.
If you have sequence information in your messages, you can add a custom module using a Spring Integration Resequencer to reassemble your messages back into the same sequence - but you'll need a single instance of the resequencer before a single sink instance.

How to ensure data is eventually written to two Azure blobs?

I'm designing a multi-tenant Azure Service Fabric application in which we'll be storing event data in Azure Append-Only blobs.
There'll be two kinds of blobs; merge blobs (one per tenant); and instance blobs (one for each "object" owned by a tenant - there'll be 100K+ of these per tenant)
There'll be a single writer per instance blob. This writer keeps track of the last written blob position and can thereby ensure (using conditional writes) that no other writer has written to the blob since the last successful write. This is an important aspect that we'll use to provide strong consistency per instance.
However, all writes to an instance blob must also eventually (but as soon as possible) reach the single (per tenant) merge blob.
Under normal operation I'd like these merge writes to take place within ~100 ms.
My question is about how we best should implement this guaranteed double-write feature:
The implementation must guarantee that data written to an instance blob will eventually also be written to the corresponding merge blob exactly once.
The following inconsistencies must be avoided:
Data is successfully written to an instance blob but never written to the corresponding merge blob.
Data is written more than once to the merge blob.
Most easiest way as for me is to use events: Service Bus or Event Hubs or any other provider to guaranty that an event will be stored and reachable at least somewhere. Plus, it will give a possibility to write events to Blob Storage in batches. Also, I think it will significantly reduce pressure on Service Fabric and will allow to process events at desired timing.
So you could have a lot of Stateless Services or just Web Workers that will pick up new messages from a queue and in batch send them to a Statefull Service.
Let's say that it will be a Merge service. You would need to partition these services and the best way to send a batch of events grouped by one partition is to make such Stateless Service or Web Worker.
Than you can have a separate Statefull Actor for each object. But on your place I would try to create 100k actors or any other real workload and see how expensive it would be. If it is too expensive and you cannot afford such machines, then everything could be handled in another partitioned Stateless Service.
Okay, now we have the next scheme: something puts logs into ESB, something peaks these evetns from ESB in batches or very frequently, handling transactions and processing errors. After that something peaks bunch of events from a queue, it sends it to a particular Merge service that stores data in its state and calls particular actor to do the same thing.
Once actor writes its data to its state and service does the same, then such sevent in ESB can be marked as processed and removed from the queue. Then you just need to write stored data from Merge service and actors to Blob storage once in a while.
If actor is unable to store event, then operation is not complete and Merge service should not store data too. If Blob storage is unreachable for actors or Merge services, it will become reachable in the future and logs will be stored as they are saved in state or at least they could be retrieved from actors/service manually.
If Merge service is unreachable, I would store such event in a poison message queue for later processing, or try to write logs directly to Blob storage but it is a little bit dangerous though chances to write at that moment only to one kind of storage are pretty low.
You could use a Stateful Actor for this. You won't need to worry about concurrency, because there is none. In the state of the Actor you can keep track of which operations were successfully completed. (write 1, write 2)
Still, writing 'exactly once' in a distributed system (without a DTC) is never 100% waterproof.
Some more info about that:
link
link

Resources