Maintaining order in parallel processing

Maintaining order in parallel processing - parallel-processing

We are implementing a system where a thousands requests will come to our queue in an order and we are processing the data parallel, when we do that we are losing the order that we received, but how do we maintain an order if we do parallel processing.
for ex:
An uber driver is sending his location every 10s to the queue with lat and lang, and it will be in the order, but when we process the data parallelly, we are losing the data order. how do we prevent it.

Your overall requirements aren't clear to me, but it seems simple enough to add an incrementing ID to each inbound request. That would let you (re)order processed data so it matches your inbound order.

Related

Better approach to send bulk data from database to kafka using spring-kafka

I have a scenario where I need to send bulk data(more than a million records) to kafka topic. Fetch data from database and iterate through each record while publishing each into kafka topic. Currently I have added kafka transactionality and achieved atomicity(publishing everything or nothing). But I am getting gateway timeout exception when I tried publishing more than 35k records within a same transaction.
Is there a better way to handle this scenario in spring kafka?

You may be able to tweak the configurations in order to be able to send more records in a single transaction, or improve performance, for example by tweaking the batch.size or the transaction.timeout.
But in my experience, trying to do a one-by-one or all-at-once approach is usually not a good idea - the former tends to be inefficient, and the later will likely be susceptible to hitting some limit as data grows in size.
So, unless you really need the 1 million messages to be sent atomically, what I would try to do in your case is to deliver the records in batches - maybe 30K records is a good batch size, and just commit each batch. So you fetch 30K records from the database, send the records, commit, and repeat until all records are sent.
That would probably be more performant and also more scalable as there's no limit on how many records you can send this way.

A SpringBatch job to produce events for a PubSub preserving source order

I'm considering to create a SpringBatch job that uses rows from a table to create events and push the events to a PubSub implementation. The problem here is that the order of events should be the same as the order of the rows in the table that used as source for the events creation process.
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel. The only ugly but probably working solution for this problem would be to do all the processing in the reader (so the reader could do reading+processing+writing to PubSub), that could help to keep order inside paginated batches, but even that doesn't seem to guarantee the batches order, according to the doc
Any ideas how the transition ordered rows->ordered events could be implemented using SpringBatch or, at least, SpringBoot? Thank you in advance!

It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel.
This is true only for a multi-threaded or a partitioned step. The default (single-threaded) chunk-oriented step implementation processes items in the same order returned by the item reader. So if you make your database reader return items in the order you want, those items will be written to your Pub/Sub broker in the same order.

elasticsearch bulk ingestion how to avoid updates

Within my product I use elasticsearch for storing CDRs (call them txn logs, if you will). My transactions are asynchronous and happen at a very fast rate i.e. around 5000 txns/sec. My transaction involves submitting request to a network entity, and later at some other point of time I receive the response.
The data ingestion technique to ES, earlier involved two phase operations viz., 1) add an entry into ES as soon as I submit to the network layer; 2) when I get response, then update the previous entry with additional status such as delivery succeeded.
I am doing this with bulk insertion method, in which the bulk records contain both inserts and updates. As a result the ingestion is very very slow, which ended up hogging / halting my application. Later, we changed the ingestion technique in such a way that we only insert to elastic when we get final response. Till such time we store the data in a redis store. But this has disadvantages of data loss and non-realtime reports.
So, I was looking at some option like having 2 indexes for the same record. Parent index will have all data, and the child record will have delivery status. I don't know if this is possible. I studied about nested queries and has-child, has-parent queries. What I am unsure is, can I insert the parent and child data at separate points in time, without having to use update. Or should I create two different records with common txn-id without worrying about parent/child?
What is the best way?

Version number in event sourcing aggregate?

I am building Microservices. One of my MicroService is using CQRS and Event sourcing. Integration events are raised in the system and i am saving my aggregates in event store also updating my read model.
My questions is why we need version in aggregate when we are updating the event stream against that aggregate ? I read we need this for consistency and events are to be replayed in sequence and we need to check version before saving (https://blog.leifbattermann.de/2017/04/21/12-things-you-should-know-about-event-sourcing/) I still can't get my head around this since events are raised and saved in order , so i really need concrete example to understand what benefit we get from version and why we even need them.
Many thanks,
Imran

Let me describe a case where aggregate versions are useful:
In our reSove framework aggregate version is used for optimistic concurrency control.
I'll explain it by example. Let's say InventoryItem aggregate accept commands AddItems and OrderItems. AddItems increases number of items in stock, OrderItems - decreases.
Suppose you have an InventoryItem aggregate #123 with one event - ITEMS_ADDED with quantity of 5. Aggregate #123 state say there are 5 items in stock.
So your UI is showing users that there are 5 items in stock. User A decide to order 3 items, user B - 4 items. Both issue OrderItems commands, almost at the same time, let's say user A is first by couple milliseconds.
Now, if you have a single instance of aggregate #123 in memory, in the single thread, you don't have a problem - first command from user A would succeed, event would be applied, state say quantity is 2, so second command from user B would fail.
In a distributed or serverless system where commands from A and B would be in separate processes, both commands would succeed and bring aggregate into incorrect state if we don't use some concurrency control. There several ways to do this - pessimistic locking, command queue, aggregate repository or optimistic locking.
Optimistic locking seems to be simplest and most practical solution:
We say that every aggregate has a version - number of events in its stream. So our aggregate #123 has version 1.
When aggregate emits an event, this event data has an aggregate version. In our case ITEMS_ORDERED events from users A and B will have event aggregate version of 2. Obviously, aggregate events should have versions to be sequentially increasing. So what we need to do is just put a database constraint that tuple {aggregateId, aggregateVersion} should be unique on write to event store.
Let's see how our example would work in a distributed system with optimistic concurrency control:
User A issues a command OrderItem for aggregate #123
Aggregate #123 is restored from events {version 1, quantity 5}
User B issues a command OrderItem for aggregate #123
Another instance of Aggregate #123 is restored from events (version 1, quantity 5)
Instance of aggregate for user A performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} is written to event store.
Instance of aggregate for user B performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} it attempts to write it to event store and fails with concurrency exception.
On such exception command handler for user B just repeats the whole procedure - then Aggregate #123 would be in a state of {version 2, quantity 2} and command will be executed correctly.
I hope this clears the case where aggregate versions are useful.

Yes, this is right. You need the version or a sequence number for consistency.
Two things you want:
Correct ordering
Usually events are idempotent in nature because in a distributed system idempotent messages or events are easier to deal with. Idempotent messages are the ones that even when applied multiple times will give the same result. Updating a register with a fixed value (say one) is idempotent, but incrementing a counter by one is not. In distributed systems when A sends a message to B, B acknowledges A. But if B consumes the message and due to some network error the acknowledgement to A is lost, A doesn't know if B received the message and so it sends the message again. Now B applies the message again and if the message is not idempotent, the final state will go wrong. So, you want idempotent message. But if you fail to apply these idempotent messages in the same order as they are produced, your state will be again wrong. This ordering can be achieved using the version id or a sequence. If your event store is an RDBMS you cannot order your events without any similar sort key. In Kafka also, you have the offset id and client keeps track of the offset up to which it has consumed
Deduplication
Secondly, what if your messages are not idempotent? Or what if your messages are idempotent but the consumer invokes some external services in a non-deterministic way. In such cases, you need an exactly-once semantics because if you apply the same message twice, your state will be wrong. Here also you need the version id or sequence number. If at the consumer end, you keep track of the version id you have already processed, you can dedupe based on the id. In Kafka, you might then want to store the offset id at the consumer end
Further clarifications based on comments:
The author of the article in question assumed an RDBMS as an event store. The version id or the event sequence is expected to be generated by the producer. Therefore, in your example, the "delivered" event will have a higher sequence than the "in transit" event.
The problem happens when you want to process your events in parallel. What if one consumer gets the "delivered" event and the other consumer gets the "in transit" event? Clearly you have to ensure that all events of a particular order are processed by the same consumer. In Kafka, you solve this problem by choosing order id as the partition key. Since one partition will be processes by one consumer only, you know you'll always get the "in transit" before delivery. But multiple orders will be spread across different consumers within the same consumer group and thus you do parallel processing.
Regarding aggregate id, I think this is synonymous to topic in Kafka. Since the author assumed RDBMS store, he needs some identifier to segregate different categories of message. You do that by creating separate topics in Kafka and also consumer groups per aggregate.

How do you react to the absence of an event in a distributed system?

I have a system that collects session data. A session consists of a number of distinct events, for example "session started" and "action X performed". There is no way to determine when a session ends, so instead heartbeat events are sent at regular intervals.
This is the main complication: without a way to determine if a session has ended the only way is to try to react to the absence of an event, i.e. no more heartbeats. How can I do this efficiently and correctly in a distributed system?
Here is some more background to the problem:
The events must then be assembled into objects representing sessions. The session objects are later updated with additional data from other systems, and eventually they are used to calculate things like the number of sessions, average session length, etc.
The system must scale horizontally, so there are multiple servers that receive the events, and multiple servers that process them. Events belonging to the same session can be sent to and processed by different servers. This means that there's no guarantee that they will be processed in order, and there are additional complications that meant that events can be duplicated (and there's always the risk that some are lost, either before they reach our servers, or when processed).
Most of this exists already, but I have no good solution to how to efficiently and correctly determine when a session has ended. The way I do it now is to periodically search through the collection of "incomplete" session objects looking for any that have not been updated in an amount of time equal to two heartbeats, and moving these to another collection with "complete" sessions. This operation is time consuming and inefficient, and it doesn't scale well horizontally. Basically it consists of sorting a table on a column representing the last timestamp and filtering out any rows that aren't old enough. Sounds simple, but it's hard to parallelize, and if you do it too often you won't be doing anything else, the database will be busy filtering your data, if you don't do it often enough each run will be slow because there's too much to process.
I'd like to react to when a session has not been updated for a while, not poll every session to see if it's been updated.
Update: Just to give you a sense of scale; there are hundreds of thousands of sessions active at any time, and eventually there will be millions.

One possibility that comes to mind:
In your database table that keeps track of sessions, add a timestamp field (if you don't have one already) that records the last time the session was "active". Update the timestamp whenever you get a heartbeat.
When you create a session, schedule a "timer event" to fire after some suitable delay to check whether the session should be expired. When the timer event fires, check the session's timestamp to see if there's been more activity during the interval that the timer was waiting. If so, the session is still active, so schedule another timer event to check again later. If not, the session has timed out, so remove it.
If you use this approach, each session will always have one server responsible for checking whether it's expired, but different servers can be responsible for different sessions, so the workload can be spread around evenly. When a heartbeat comes in, it doesn't matter which server handles it, because it just updates a timestamp in a database that's (presumably) shared between all the servers.
There's still some polling involved since you'll get periodic timer events that make you check whether a session is expired even when it hasn't expired. That could be avoided if you could just cancel the pending timer event each time a heartbeat arrives, but with multiple servers that's tricky: the server that handles the heartbeat may not be the same one that has the timer scheduled. At any rate, the database query involved is lightweight: just looking up one row (the session record) by its primary key, with no sorting or inequality comparisons.

So you're collecting heartbeats; I'm wondering if you could have a batch process (or something) that ran across the collected heartbeats looking for patterns that implied the end of a session.
The level of accuracy is governed by how regular the heartbeats are and how often you scan across the collected heartbeats.
The advantage is you're processing all heartbeats through a single mechanism (in one spot - you don't have to poll each heartbeat on it's own) so that should be able to scale - if it was a database centric solution that should be able to cope with lots of data, right?
There might be a more elegant solution but my brains a bit full just now :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio