Message idempotence - ordering considerations - events

Let's say we have a system were there's one producer that queues messages in a queue and multiple instances of the same consumer processing this events.
Since we are in a Competing Consumers pattern we know that ordering is no longerd guaranteed. This means that we must ensure that our messages are idempotent.
From what I read here (under the Message ordering bullet point), we must ensure that message processing is idempotent.
Here's the questions:
How can we design our message processing to be idempotent?
If we are saving every event in an event store, are there any consideration to be taken into account when designing each event's payload and the events aggregation to get the aggregate state?
An example: let's say that we have a "User Created" and a "User Deleted" messages (or any other couple of events which NEED to be processed in order). If we process "User deleted" before "User created" the user won't be deleted. Even if they're coming ordered in the event queue. Can really an idempotent processing/idempotent events give to a deleted user?
Another example.
Let's suppose that we have an entity that have a score attribute. An user can modify the score. A second service consumes events of the "score entity" service and if the score reaches 100 the entity (or an entity reference) is inserted by the second service in the "Best category" entity. If the score reaches -20 the second service insert the score entity in the "Worse category". Having multiple instace of the second service can give an impredictable result if the "score 100" and "score -20" events are within a tiny interval of time. Any ideas on how to design the "score x" events or how to process these events?
Thank you so much for your help!

How can we design our message processing to be idempotent?
You should:
ignore messages that you have already "seen"; this means that the consumer should have a way of detecting that; it could, for example, keep a list of message IDs that it has processed (this means that every message should have an unique ID).
not throw an exception if a message does not change the state; for example, if you receive a second DeleteUser event (so the user is already deleted, a second delete should have no side effect) then you ignore it. Not every event can be idempotent, for example UpdateUserName should not be idempotent.
If we are saving every event in an event store, are there any consideration to be taken into account when designing each event's payload and the events aggregation to get the aggregate state?
You should design the events based on your Domain; the payload should not contain more information than it is needed by the domain. If additional information would make your readmodels a lot easier to implement then you could add it to the payload but be careful to mark it somehow as redundant.
An example: let's say that we have a "User Created" and a "User Deleted" messages (or any other couple of events which NEED to be processed in order). If we process "User deleted" before "User created" the user won't be deleted. Even if they're coming ordered in the event queue. Can really an idempotent processing/idempotent events give to a deleted user?
In this particular case, you can have an additional collection of deleted users; you can keep only their ID. When a CreateUser event arrives you can check to see if the user is already deleted by looking in the DeletedUsers collection and ignore it if the user is there. You can ignore every other event that comes for that user.
This solution is very dependent on the domain.
Another example. Let's suppose that we have an entity that have a score attribute. An user can modify the score. A second service consumes events of the "score entity" service and if the score reaches 100 the entity (or an entity reference) is inserted by the second service in the "Best category" entity. If the score reaches -20 the second service insert the score entity in the "Worse category". Having multiple instace of the second service can give an impredictable result if the "score 100" and "score -20" events are within a tiny interval of time. Any ideas on how to design the "score x" events or how to process these events?
This situation can be resolved by keeping in-the/attached-to readmodels (the two collections, best and worse) the timestamp/order/stream-version/whatever of the last processed event and ignore every event that is less-or-equal to that timestamp. In this way, if "score 100" is emitted after "score -20" but arrives first you should ignore the "score -20" because it has a lower timestamp, although it comes last.
This solution is generic but it relies on the existence of some ordering.

Related

Fire Message Event Only when These other Messages have been sent

I'm working on architecting a micro-service solution where most code will be C# and most likely Angular for any front end. My question is about message chaining. I am still figuring out what message broker to use; Azure Service Bus , RabbitMQ, etc.. There is a concept which I haven't found much about.
How do I handle cases when I want to fire a message when a specific set of messages have fired. An example but not part of my actual solution: I want to say Notify someone when pays a bill. We send a message "PAIDBILL"
which will fire off microservices which will be processed independently:
FinanceService to Debit the ledger and fire "PaymentPosted"
EmailService: email Customer Saying thank you for paying the bill
"CustomerPaymentEmailSent"
DiscountService: Check if they get a discount for paying on time then send
"CustomerCanGetPaymentDiscount"
If all three messages have fired for the Same PAIDBILL: Message "PaymentPosted", "CustomerPaymentEmailSent", "CustomerCanGetPaymentDiscount"
then I want to email the customer that they will get a discount on their next bill. It Must be done AFTER all three have tiggered and the order doesn't matter. How do I Schedule a new message to be sent "EmailNextTimeDiscount" message, without having to poll for what messages have fired every minute, hour, day?
All I can think of is to have a SQL table which marks that each one is complete (by locking the table) and when the last one is filled then send off the message. Would this be a good solution? I find it an anti-pattern for the micro-service & message queue design.
If you're using messages (e.g. Service Bus / RabbitMQ), then I think the solution you have described is the best one. This type of design - where services have knowledge about the other domains in the system - is typically known as choreography.
You'll want to pick a service which will be responsible for this business logic. That service will need to receive all the preceding types of messages so that it can determine when (if) all have been met, which it probably wants to do by recording which of the gates have already passed in a database.
One alternative you could consider is chaining the business processes instead of doing them in parallel. So...
PAYBILL causes FinanceService to Debit the ledger and fire "PaymentPosted"
"PayentPosted" causes EmailService to email Customer Saying thank you for paying the bill and broadcasts "CustomerPaymentEmailSent"
"CustomerPaymentEmailSent" causes DicsountService to check if they get a discount for paying on Time then sends "CustomerCanGetPaymentDiscount"
The email you want to send is just triggered by "CustomerCanGetPaymentDiscount".
If I'm honest, I would switch around the dependency model you're using at this last stage. So, instead of some component listening for "CustomerCanGetPaymentDiscount" events from DiscountService and sending an email, I think I would instead have the DiscountService tell some other component to send an email. It seems natural to me for something that calculates discounts to know that an email should be sent. It seems less natural for something that sends emails to know about discounts (and everything else that needs emails sent). This is why I don't like architectures where the assumption is that every message should be an event and every action should be triggered by an event: it removes a lot of decisions about where domain logic can live, because the message receiver always has to know about the domain of the message sender, never vice versa.

CQRS - out of order messages

Suppose we have 3 different services producing events, each of them publishing to its own event store.
Each of these services consumes other producers services events.
This because each service has to process another service's events AND to create its own projection. Each of the service runs on multiple instances.
The most straight forward way to do it (for me) was to put "something" in front of each ES which is picking events and publishing (pub/sub) them in queues of every other service.
This is perfect because every service can subscribe to each topics it likes, while the event publisher is doing the job and if a service is unavailable events are still delivered. This seems to me to guarantee high scalability and availability.
My problem is the queue. I can't get an easily scalable queue that guarantees ordering of the messages. It actually guarantees "slightly out of order" with at-least once delivery: to be clear, it's AWS SQS.
So, the ordering problems are:
No order guaranteed across events from the same event stream.
No order guaranteed across events from the same ES.
No order guaranteed across events from different ES (different services).
I though I could solve the first two problems just by keeping track of the "sequence number" of the events coming from the same ES.
This would be done by tracking the last sequence number of each topic from which we are consuming events
This should be easy for reacting to events and also building our projection.
Then, when I pop an event from the queue, if the eventSequenceNumber > previousAppliedEventSequenceNumber + 1 i renqueue it (or make it invisible for a certain time).
But it turns out that using this solution, it will destroy performances when events are produced at high rates (I can use a visibility timeout or other stuff, the result should be the same).
This because when I'm expecting event 10 and I ignore event 11 for a moment, I should ignore also all events (from ES) with sequence numbers coming after that event 11, until event 11 shows up again and it's effectively processed.
Other difficulties were:
where to keep track of the event's sequence number for build the projection.
how to keep track of the event's sequence number for build the projection so that when appling it, I have a consistent lastSequenceNumber.
What I'm missing?
P.S.: for the third problem think at the following scenario. We have a UserService and a CartService. The CartService has a projection where for each user keeps track of the products in the cart. Each cart's projection must have also user's name and other info's that are coming from the UserCreated event published from the UserService. If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
What I'm missing?
You are missing flow -- consumers pull messages from sources, rather than having sources push the messages to the consumers.
When I wake up, I check my bookmark to find out which of your messages I read last, and then ask you if there have been any since. If there have, I retrieve them from you in order (think "document message"), also writing down the new bookmarks. Then I go back to sleep.
The primary purpose of push notifications is to interrupt the sleep period (thereby reducing latency).
With SQS acting as a queue, the idea is that you read all of the enqueued messages at once. If there are no gaps, then you can order the collection then start processing them and acking them. If there are gaps, you either wait (leaving the messages in the queue) or you go to the event store to fetch copies of the missing messages.
There's no magic -- if the message pipeline is promising "at least once" delivery, then the consumers must take steps to recognize duplicate messages as they arrive.
If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
Review Race Conditions Don't Exist, by Udi Dahan: "A microsecond difference in timing shouldn’t make a difference to core business behaviors."
The basic issue is assuming we can get messages IN ORDER...
This is a fallacy in distributed computing...
I suggest you design for no message ordering in your system.
As for your issues, try and use UTC time in the message body/header created by the originator and try and work around this data point. Sequence numbers are going to fail unless you have a central deterministic sequence creator (which will be a non-scalable, single point of failure).
Using Sagas/State machine is a path that can help to make sense of (business) events ordering.

CQRS + Microservices Handling event rollback

We are using microservices, cqrs, event store using nodejs cqrs-domain, everything works like a charm and the typical flow goes like:
REST->2. Service->3. Command validation->4. Command->5. aggregate->6. event->7. eventstore(transactional Data)->8. returns aggregate with aggregate ID-> 9. store in microservice local DB(essentially the read DB)-> 10. Publish Event to the Queue
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
Any suggestions would be highly appreciated.
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
You retry it later.
The "book of record" is the event store. The downstream views (the "published events", the read models) are derived from the book of record. They are typically behind the book of record in time (eventual consistency) and are not typically synchronized with each other.
So you might have, at some point in time, 105 events written to the book of record, but only 100 published to the queue, and a representation in your service database constructed from only 98.
Updating a view is typically done in one of two ways. You can, of course, start with a brand new representation and replay all of the events into it as part of each update. Alternatively, you track in the metadata of the view how far along in the event history you have already gotten, and use that information to determine where the next read of the event history begins.
Inside your event store, you could track whether read-side replication was successful.
As soon as step 9 suceeds, you can flag the event as 'replicated'.
That way, you could introduce a component watching for unreplicated events and trigger step 9. You could also track whether the replication failed multiple times.
Updating the read-side (step 9) and flagigng an event as replicated should happen consistently. You could use a saga pattern here.
I think i have now understood it to a better extent.
The Aggregate would still be created, answer is that all the validations for any type of consistency should happen before my aggregate is constructed, it is in case of a failure beyond the purview of the code that a failure exists while updating the read side DB of the microservice which needs to be handled.
So in an ideal case aggregate would be created however the event associated would remain as undispatched unless all the read dependencies are updated, if not it remains as undispatched and that can be handled seperately.
The Event Store will still have all the event and the eventual consistency this way is maintained as is.

Event-sourcing: Dealing with derived data

How does an event-sourcing system deal with derived data? All the examples I've read on event-sourcing demonstrate services reacting to fact events. A popular example seems to be:
Bank Account System
Events
Funds deposited
Funds withdrawn
Services
Balance Service
They then show how the Balance service can, at any point, derive a state (I.e. balance) from the events. That makes sense; those events are facts. There's no question that they happened - they are external to the system.
However, how do we deal with data calculated BY the system?
E.g.
Overdrawn service:
A services which is responsible for monitoring the balance and performing some action when it goes below zero.
Does the event-sourcing approach dictate how we should use (or not use) derived data? I.e. The balance. Perhaps one of the following?
1) Use: [Funds Withdrawn event] + [Balance service query]
Listen for the "Funds withdrawn" event and then ask the Balance service for the current balance.
2) Use: [Balance changed event]
Get the balance service to throw a "Balance changed" event containing the current balance. Presumably this isn't a "fact" as it's not external to the system, therefore prone to miscalculation.
3) Use: [Funds withdrawn event] + [Funds deposited event]
We could just skip the Balance service and have each service maintain its own balance directly from the facts. ...though that would result in each service having its own (potentially different) version of the balance.
A services which is responsible for monitoring the balance and performing some action when it goes below zero.
Executive summary: the way this is handled in event sourced systems is not actually all that different from the alternatives.
Stepping back a second - the advantage of having a domain model is to ensure that all proposed changes satisfy the business rules. Borrowing from the CQRS language: we send command messages to a command handler. The handler loads the state of the model, and tries to apply the command. If the command is allowed, the changes to the state of the domain model is updated and saved.
After persisting the state of the model, the command handler can query that state to determine if their are outstanding actions to be performed. Udi Dahan describes this in detail in his talk on Reliable messaging.
So the most straight forward way to describe your service is one that updates the model each time the account balance changes, and sets the "account overdrawn" flag if the balance is negative. After the model is saved, we schedule any actions related to that state.
Part of the justification for event sourcing is that the state of the domain model is derivable from the history. Which is to say, when we are trying to determine if the model allows a command, we load the history, and compute from the history the current state, and then use that state to determine whether the command is permitted.
What this means, in practice, is that we can write an AccountOverdrawn event at the same time that we write the AccountDebited event.
That AccountDebited event can be subscribed to - Pub/Sub. The typical handling is that the new events get published after they are successfully written to the book of record. An event listener subscribing to the events coming out of the domain model observes the event, and schedules the command to be run.
Digression: typically, we'll want at-least-once execution of these activities. That means keeping track of acknowledgements.
Therefore, the event handler is also a thing with state. It doesn't have any business state in it, and certainly no rules that would allow it to reject events. What it does track is which events it has seen, and which actions need to be scheduled. The rules for loading this event handler (more commonly called a process manager) are just like those of the domain model - load events from the book of record to obtain the current state, then see if the event being handled changes anything.
So it is really subscribing to two events - the AccountDebited event, and whatever event returns from the activity to acknowledge that it has completed.
This same mechanic can be used to update the domain model in response to events from elsewhere.
Example: suppose we get a FundsWithdrawn event from an ATM, and we need to update the account history to match it. So our event handler gets loaded, updates itself, and schedules a RecordATMWithdrawal command to be run. When the command loads, it loads the account, updates the balances, and writes out the AccountCredited and AccountOverdrawn events as before. The event handler sees these events, loads the correct state process state based on the meta data, and updates the state of the process.
In CQRS terms, this is all taking place in the "write models"; these processes are all about updating the book of record.
The balance query itself is easy - we already showed that the balance can be derived from the history of the domain model, and that's just how your balance service is expected to do it.
To sum up; at any given time you can load the history of the domain model, to query its state, and you can load up the history of the event processor, to determine what work has yet to be acknowledged.
Event sourcing is an evolving discipline with a bunch of diverse practices, practitioners and charismatic people. You can't expect them to provide you with some very consistent modelling technique for all scenarios like you described. Each one of those scenarios has it's pros and cons and you specified some of them. Also it may vary dramatically from one project to another, because business requirements (evolutionary pressures of the market) will be different.
If you are working on some mission-critical system and you want to have very consistent balance all the time - it's better to use RDBMS and ACID transactions.
If you need maximum speed and you are okay with eventually consistent states and not very anxious about precision of your balances (some events may be missing here and there for bunch of reasons) then you can derive your projections for balances from events asynchronously.
In both scenarios you can use event sourcing, but you don't necessarily have to generate your projections asynchronously. It's okay to generate projection in the same transaction scope as you making changes to your write model if you really need to do that.
Will it make Greg Young happy? I have no idea, but who cares about such things if your balances one day may go out of sync in mission-critical system ...

Dequeuing events during discrete event simulation

I have a question regarding the dequeue mechanism during discrete event simulation.
Most of the implementations use some kind of priority queue which can be used to quickly retrieve the event with the earliest timestamp. What happens when such an event cannot be scheduled because, say, it needs a resource to be able to run.
There may be another event in the queue whose timestamp is greater than the timestamp of the event that is blocked on a resource.
For example, let us assume we are modelling a grocery-store with separate checkout lines and a cashier per line. A shopper entering a checkout line is an event. We enqueue this event based on the time the shopper entered the checkout line. However, the order in which our simulation should execute two such events in not necessarily the time order in which they entered the checkout line because the cashiers might free up in a different order.
In such a scenario how does using a priority queue solely based on timestamp --- and independent of resource availability --- work out?
You need a queue for each cashier, or at least a count of waiting customers if customer identity is not important in your simulation ( e.g. I would join a queue of three people with one item each over a queue with one person with a full trolley, so just a queue length may not capture the information needed to incorporate that heuristic ).
When a customer joins the queue, the number of queuing customers is incremented or the customer is pushed onto the cashier's queue.
When the cashier is ready to serve, the first customer is popped of the cashier's queue. So the customer service event is dependent not on the time the customer arrives, but when the cashier is ready.
These queues or counters are independent of the scheduling mechanism for events - the events scheduled manipulate these queues, they aren't dependent on them for scheduling.
As Pete Kirkham pointed out, it's important to be aware that the lines (queues) that customers wait in are completely separate things from the priority queue that's used to determine event ordering.
In discrete-event simulation an event is a point in time at which the system state changes. When an event occurs you figure out what to do next based on the state. Joining the line of customers is an event, but so is becoming eligible for service. Once a customer becomes eligible for service, the logic of that event has to check whether service is possible or not. If so, schedule a new event for when the service will end. If there are resource constraints, then nothing gets scheduled and that customer is on hold. However, at some point in the future the required resource will become available. That's an event too, and that event's logic should check to see if there are customers on hold due to lack of the resource. If not, there's no need to schedule anything, but if so, you can now schedule the actual service for the customer. You can see that customer delays in the queue will increase with resource constraints.
For a much fuller explanation of how discrete-event simulations work, please look at this introductory tutorial paper.

Resources