Version number in event sourcing aggregate? - events

I am building Microservices. One of my MicroService is using CQRS and Event sourcing. Integration events are raised in the system and i am saving my aggregates in event store also updating my read model.
My questions is why we need version in aggregate when we are updating the event stream against that aggregate ? I read we need this for consistency and events are to be replayed in sequence and we need to check version before saving (https://blog.leifbattermann.de/2017/04/21/12-things-you-should-know-about-event-sourcing/) I still can't get my head around this since events are raised and saved in order , so i really need concrete example to understand what benefit we get from version and why we even need them.
Many thanks,
Imran

Let me describe a case where aggregate versions are useful:
In our reSove framework aggregate version is used for optimistic concurrency control.
I'll explain it by example. Let's say InventoryItem aggregate accept commands AddItems and OrderItems. AddItems increases number of items in stock, OrderItems - decreases.
Suppose you have an InventoryItem aggregate #123 with one event - ITEMS_ADDED with quantity of 5. Aggregate #123 state say there are 5 items in stock.
So your UI is showing users that there are 5 items in stock. User A decide to order 3 items, user B - 4 items. Both issue OrderItems commands, almost at the same time, let's say user A is first by couple milliseconds.
Now, if you have a single instance of aggregate #123 in memory, in the single thread, you don't have a problem - first command from user A would succeed, event would be applied, state say quantity is 2, so second command from user B would fail.
In a distributed or serverless system where commands from A and B would be in separate processes, both commands would succeed and bring aggregate into incorrect state if we don't use some concurrency control. There several ways to do this - pessimistic locking, command queue, aggregate repository or optimistic locking.
Optimistic locking seems to be simplest and most practical solution:
We say that every aggregate has a version - number of events in its stream. So our aggregate #123 has version 1.
When aggregate emits an event, this event data has an aggregate version. In our case ITEMS_ORDERED events from users A and B will have event aggregate version of 2. Obviously, aggregate events should have versions to be sequentially increasing. So what we need to do is just put a database constraint that tuple {aggregateId, aggregateVersion} should be unique on write to event store.
Let's see how our example would work in a distributed system with optimistic concurrency control:
User A issues a command OrderItem for aggregate #123
Aggregate #123 is restored from events {version 1, quantity 5}
User B issues a command OrderItem for aggregate #123
Another instance of Aggregate #123 is restored from events (version 1, quantity 5)
Instance of aggregate for user A performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} is written to event store.
Instance of aggregate for user B performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} it attempts to write it to event store and fails with concurrency exception.
On such exception command handler for user B just repeats the whole procedure - then Aggregate #123 would be in a state of {version 2, quantity 2} and command will be executed correctly.
I hope this clears the case where aggregate versions are useful.

Yes, this is right. You need the version or a sequence number for consistency.
Two things you want:
Correct ordering
Usually events are idempotent in nature because in a distributed system idempotent messages or events are easier to deal with. Idempotent messages are the ones that even when applied multiple times will give the same result. Updating a register with a fixed value (say one) is idempotent, but incrementing a counter by one is not. In distributed systems when A sends a message to B, B acknowledges A. But if B consumes the message and due to some network error the acknowledgement to A is lost, A doesn't know if B received the message and so it sends the message again. Now B applies the message again and if the message is not idempotent, the final state will go wrong. So, you want idempotent message. But if you fail to apply these idempotent messages in the same order as they are produced, your state will be again wrong. This ordering can be achieved using the version id or a sequence. If your event store is an RDBMS you cannot order your events without any similar sort key. In Kafka also, you have the offset id and client keeps track of the offset up to which it has consumed
Deduplication
Secondly, what if your messages are not idempotent? Or what if your messages are idempotent but the consumer invokes some external services in a non-deterministic way. In such cases, you need an exactly-once semantics because if you apply the same message twice, your state will be wrong. Here also you need the version id or sequence number. If at the consumer end, you keep track of the version id you have already processed, you can dedupe based on the id. In Kafka, you might then want to store the offset id at the consumer end
Further clarifications based on comments:
The author of the article in question assumed an RDBMS as an event store. The version id or the event sequence is expected to be generated by the producer. Therefore, in your example, the "delivered" event will have a higher sequence than the "in transit" event.
The problem happens when you want to process your events in parallel. What if one consumer gets the "delivered" event and the other consumer gets the "in transit" event? Clearly you have to ensure that all events of a particular order are processed by the same consumer. In Kafka, you solve this problem by choosing order id as the partition key. Since one partition will be processes by one consumer only, you know you'll always get the "in transit" before delivery. But multiple orders will be spread across different consumers within the same consumer group and thus you do parallel processing.
Regarding aggregate id, I think this is synonymous to topic in Kafka. Since the author assumed RDBMS store, he needs some identifier to segregate different categories of message. You do that by creating separate topics in Kafka and also consumer groups per aggregate.

Related

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

Analytics Microservices

I'm implementing a microservice responsible to generate analytics, retrieving data asynchronously from microservices through RabbitMQ.
I'm trying to understand if every times there is event on domain data, it should be sent on rabbitmq and update the analytics-database (MongoDB).
This approach would update the same document (retrieved from database) every time there is an event that needs that document.
-- Example:
{
"date":"2022-06-15",
"day":"Monday",
"restaurantId": 2,
"totalSpent":250,
"nOfLogin":84,
"categories":[
{
"category":"wine",
"total":100
},
{
"category":"burgers",
"total":150
}
],
"payment":[
{
"method":"POS",
"total":180
},
{
"method":"Online",
"total":20
},
{
"method":"Cash",
"total":50
}
],
...
}
So if an event with some data arrives, it updates its relative data and save on mongodb:
{
{
"category":"wine",
"total":2
}
}
it should update its category adding its total and saving it.
--End Example
The struggling part is that if there are a lot of events on the same document, it would be retrieved twice (or more, depending on events) from database, generating a concurrency error.
Firstly I thought the best approach would be using Spring Batch (retrieving data from different databases, transforming it and send on rabbitmq), but it's not real-time and it would be scheduled with Quartz.
To make you understand the kind of data are:
quantity of product ordered (real time and from database)
quantity of customers logged in (daily and subdivided in hours, always on real-time)
These are not all data, but these are the ones that would have been sent a lot of times during the day.
I don't want to make some kind of flooding inside of rabbitmq, but I'm struggling understanding which approach is the best (thinking even about the design pattern to use for this kind of situation).
Thanks in advice
I see several possible solutions to this.
Optimistic locking.
What this strategy does is it maintains the version of your document.
On each document read the version attribute is fetched alongside other attributes.
Document update is performed as usual, but what's different to this approach is that an update query must check if the version have changed (document was updated by another concurrent event) since the read operation.
In case the version did change, you would have to handle an optimistic locking exception.
How would you do that largely depends on your needs, i.e., log & discard the event, retry, etc.
Otherwise query increments the version and updates the rest of the attributes.
Different event updates different attributes. In this case all you need to do is to update the individual attributes. This approach is a simpler and a more efficient one since there's no need for the read operation, and no extra effort of maintaining and checking the version attribute on each update.
Message ordering. For this to work properly all of the events must be coming from the same domain object (aggregate) and the message broker has to support this mechanism. Kafka, for example, does this through topic partitioning, where each partition is created using a calculated hash key (partition id), it might be the hash of the domain object's id or some other identifier.

Plugin to trigger only once on create / update of multiple records in entity / table x

We have an entity which holds 3 types of Records A, B, C. Record type A is the parent of B (each A can be the parent to multiple B records) and further B is the parent of C (each B can be the parent to multiple C records). On creation / update of every record C, CalculateCommercials plugin will run that will pull all the sibling C records under a given record B and aggregate / roll-up the totals and update that parent record B with those rolled-up sum / totals. Same thing happens on update of B record totals which will roll-up the totals to grand-parent A record. The problem here is, when we create/update multiple C records under a given parent B this will trigger multiple CalculateCommercials plugin instances which is a bit inefficient & resource intensive. Is there a better approach where we let the CalculateCommercials to trigger only once irrespective of the number of C records being created / updated?
For example, if we create / update 10 C records at a given time we want the CalculateCommercials plugin to run only once which will roll-up the totals to B and again only once to update A, instead of 10 times (currently its triggering 10 CalculateCommercials plugin instances to roll-up to B, which in-turn trigger another 10 more instances to update grand-parent record type C).
Sometimes this chain of auto-triggers resulting in the exceeding of 2 minute time limit for the plugin instance. Is there a better approach to simplify the rolling up of totals to parent B and then on to A?
A lot depends upon how the rollup fields are used, and their consistency requirements.
Simplest approach is to plan for eventual consistency rather than strict consistency. The update-grandchild plugin would merely mark a "dirty" field in the grandchild entity, which requires no additional DB locking. Separately, a scheduled workflow or Power Automate flow runs every 'N' minutes, which finds all the dirty grandchildren, updates their parents' rollup fields, and resets the dirty fields. Because this workflow is scheduled and nothing else takes a write lock on the rollup fields, there is little contention on the parent and grandparent records. However, it means that rollup values are always a few minutes out-of-date if the grandchildren are constantly changing.
D365 seems to implement a "rollup field" concept that does exactly this: https://www.encorebusiness.com/blog/rollup-fields-in-microsoft-dynamics-365-crm/ The scheduled task runs every hour by default, but can be configured by an admin.
To improve the update latency, you could make the flow on-demand or triggered by updates to the relevant grandchild fields, but have it check & update a "flow already triggered" table to ensure that multiple flow instances aren't trying to do the updates simultaneously. The "winning" instance of the flow sets the "running" flag, loops repeatedly until it finds no more dirty records, and then clears the "running" flag. The "losing" instances terminate immediately when they see that another instance is active.
This pattern requires a lot more care to handle potential race conditions related to records being marked dirty after the running flag is cleared, as well as transient errors that might terminate the flow before it completes. Again, a simple solution is to schedule a "sweeper" instance once per hour, or day.
If you need it to be strictly consistent, then this whitepaper might help: https://download.microsoft.com/download/D/6/6/D66E61BA-3D18-49E8-B042-8434E64FAFCA/ScalableDynamicsCRMCustomizations.docx It discusses the tradeoffs inherent in various locking and registration options for plugins.
If the rollup fields are rarely consumed, then it might be more efficient to recalculate the field only when a client requests that field while reading the parent or grandparent entity, using a separate plugin on those entities. You'd want to ensure that clients don't request that field unless they need it.

How to avoid concurrent requests to a lambda

I have a ReportGeneration lambda that takes request from client and adds following entries to a DDB table.
Customer ID <hash key>
ReportGenerationRequestID(UUID) <sort key>
ExecutionStartTime
ReportExecutionStatus < workflow status>
I have enabled DDB stream trigger on this table and a create entry in this table triggers the report generation workflow. This is a multi-step workflow that takes a while to complete.
Where ReportExecutionStatus is the status of the report processing workflow.
I am supposed to maintain the history of all report generation requests that a customer has initiated.
Now What I am trying to do is avoid concurrent processing requests by the same customer, so if a report for a customer is already getting generated don’t create another record in DDB ?
Option Considered :
query ddb for the customerID(consistent read) :
- From the list see if any entry is either InProgress or Scheduled
If not then create a new one (consistent write)
Otherwise return already existing
Issue: If customer clicks in a split second to generate report, two lambdas can be triggered, causing 2 entires in DDB and two parallel workflows can be initiated something that I don’t want.
Can someone recommend what will be the best approach to ensure that there are no concurrent executions (2 worklflows) for the same Report from same customer.
In short when one execution is in progress another one should not start.
You can use ConditionExpression to only create the entry if it doesn't already exist - if you need to check different items, than you can use DynamoDB Transactions to check if another item already exists and if not, create your item.
Those would be the ways to do it with DynamoDB, getting a higher consistency.
Another option would be to use SQS FIFO queues. You can group them by the customer ID, then you wouldn't have concurrent processing of messages for the same customer. Additionally with this SQS solution you get all the advantages of using SQS - like automated retry mechanisms or a dead letter queue.
Limiting the number of concurrent Lambda executions is not possible as far as I know. That is the whole point of AWS Lambda, to easily scale and run multiple Lambdas concurrently.
That said, there is probably a better solution for your problem using a DynamoDB feature called "Strongly Consistent Reads"
By default reads to DynamoDB (if you use the AWS SDK) are eventually consistent, causing the behaviour you observed: Two writes to the same table are made but your Lambda only was able to notice one of those writes.
If you use Strongly consistent reads, the documentation states:
When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
So your Lambda needs to do a strongly consistent read to your table to check if the customer already has a job running. If there is already a job running the Lambda does not create a new job.

ActiveMQ message grouping performance

Has anyone used the Message Grouping feature in ActiveMQ?
http://activemq.apache.org/message-groups.html
This would be a really useful feature for a project I'm working on, but I'm curious how well this feature scales and performs. In our system, we would need to group messages into groups of about 3-5 messages, so we would be continuously adding groups as the process runs. In this case, it seems like we'd eventually just run out of memory trying to store all the groups.
I'm interested in any experiences/thoughts/pros/cons.
I've used Message Groups on many projects and it works great. Though for full disclosure I was one of the folks pushing for Message Groups and did much of the initial implementation work.
The use case of Message Groups came from partitioning large topic hierarchies; such as dealing with financial stock symbols and the like. We wanted message groups to be able to use very fine grained correlation expressions (JMSXGroupID strings) - so you could use the date, stock symbol and product type as groupID - or the customer or business transaction ID or whatever.
To avoid having to keep every group ID string in memory, the default provider uses hash buckets - so we only store the mapping of hash buckets to consumers - not the individual strings. So it scales to as many group IDs as you want to use! It also means we don't have to 'clean' the old message group IDs out etc

Resources