EventSourcing fix business logic bugs - event-sourcing

I'm making some study of eventsourcing before applying it (or not).
Quick question : When using EventSourcing pattern we can imagine this scenario to handle an event :
command sent
command handler receive the previous command, validate it then
command handler persist this event and publish it
business model apply (business logic algorithm v1 for example) this event mutating its internal state
We can replay all the events and reconstruct the business object state.
How to handle business logic bugs (business logic algorithm v1 contains a nasty bugs).
I read we can fix the bug and replay the events and then we got the business model in a valid state once again.
But what happens if when fixing the business rule when applying event#1 would have caused the 'futurs' commands to fails ? In other words, the event#2, event#3, event#n was dependend of the state of the domain model after applying event#0. How can we fix the cascading events failure ?
I don't have a specific usecase : but we can imagine an account where balance is currently positive. Applying Event#0 increment the balance but this was a bug, the developer wanted to reduce the balance. Event#1 is a purchase that was valid because of the positive balance at this time.
The developer fixes the bug and replay the events. Event#0 decrease the balance which becomes negative. Event#1 is replayed : what happens ?
Do we need to handle this case with 'compensation' ? how ?
thanks in advance for your comments, external ressources that can be of any help (articles, blogs).
bye

Minor correction
When using EventSourcing pattern we can imagine this scenario to handle an event
command sent
command handler receive the previous command, validate it then
business model verifies that the command can be satisfied without violating the business invariant, and calculates the ensuing events
command handler persist this events and publish them
The command handler (specifically, the anti-corruption layer) is responsible for making sure that the command is well formed. The business model decides if the command is permitted by the business.
The good news: the events are just state changes; all of the rule validation is already done. When you fix the bug in the domain object so that it produces the correct events in response to the command, you aren't changing the way the event is applied.
And you certainly aren't changing the history -- if the ATM gave away $20 that it wasn't supposed to, you can't get the money back by editing the record.
What that means is that deploying the bug fix keeps the problem from getting worse; but it doesn't do anything for the event histories that are incorrect.
Compensating events are the right answer here. Ever have a grocery clerk double scan an item, and have to back one of them out? If you look closely, you'll see all three items
+1 candy bar
+1 candy bar
-1 candy bar
That's the idiom of the compensating event being appended to end of the stream.
So if the error showed first appeared in event #0, and then [event #1 .. event #99] have been played on top of that, the remedy for the error is to publish a compensating event #100.
Notice that this is exactly what book keepers would do. You put the wrong sign on the entry on line #1, add a bunch more entries, realize your mistake, and add a new entry that compensates for the earlier mistake.
More good news: in mature business processes, there are already mitigation procedures in place to handle various contingencies. So you can grab a meeting with your domain experts, and doodle on the whiteboard explaining the problem, and your experts should be able to show you the right way to compensate for it. Everything after that is feature management (does the mitigation need to be automated? Does the system need to do the mitigation automatically, or can it let human experts tell it what mitigation to apply, etc. etc.)

Related

Is there any way to replay events in a date range?

I am implementing an example of spring-boot and axon. I have two events
(deposit and withdraw account balance). I want to know is there any way to get the state of the Account Aggregate by a given date ?
I want to get not just the final state, but to replay events in a range of dates.
I think I can help with this.
In the context of Axon Framework, you can start a replay of events by telling a given TrackingEventProcessor to 'reset' it's Tokens. By the way, the current description on this in the Reference Guide can be found here.
These TrackingTokens are the objects which know how far a given TrackingEventProcessor is in terms of handling events from the Event Stream. Thus resetting/adjusting these TrackingTokens is what will issue a Replay of events.
Knowing all these, the second step is to look at the methods the TrackingEventProcessor provides to 'reset tokens', which is threefold:
TrackingEventProcessor#resetTokens()
TrackingEventProcessor#resetTokens(Function<StreamableMessageSource, TrackingToken>)
TrackingEventProcessor#resetTokens(TrackingToken)
Option one will reset your tokens to the beginning of the event stream, which will thus replay everything.
Option two and three however give you the opportunity to provide a TrackingToken.
Thus, you could provide a TrackingToken starting from several points on the Event Stream. So, how do you go about to creating such a TrackingToken at a specific point in time? To that end, you should take a look at the StreamableMessageSource interface, which has the following operations:
StreamableMessageSource#createTailToken()
StreamableMessageSource#createHeadToken()
StreamableMessageSource#createTokenAt(Instant)
StreamableMessageSource#createTokenSince(Duration)
Option 1 is what's used to create a token at the start of the stream, whilst 2 will create a token at the head of the stream.
Option 3 and 4 will however allow you to create a token at a specific point in time, thus allowing you to replay all the events since the defined instance up to now.
There is one caveat in this scenario however. You're asking to replay an Aggregate. From Axon's perspective by default the Aggregate is the Command Model in a CQRS set up, thus dealing with Commands going in to your system. In the majority of the applications, you want Commands (e.g. the requests to change something) to occur on the current state of the application. As such, the Repository provided to retrieve an Aggregate does not allow specifying a point in time.
The above described solution in regards to replaying is thus solely tied to Query Model creation, as the TrackingEventProcessor is part of the Event Handling side in your application most often used to create views. This idea also ties in with your questions, that you want to know the "state of the Account Aggregate" at a given point in time. That's not a command, but a query, as you have 'a request for data' instead of 'the request to change state'.
Hope this helps you out #Safe!

Compensating Events on CQRS/ES Architecture

So, I'm working on a CQRS/ES project in which we are having some doubts about how to handle trivial problems that would be easy to handle in other architectures
My scenario is the following:
I have a customer CRUD REST API and each customer has unique document(number), so when I'm registering a new customer I have to verify if there is another customer with that document to avoid duplicity, but when it comes to a CQRS/ES architecture where we have eventual consistency, I found out that this kind of validations can be very hard to address.
It is important to notice that my problem is not across microservices, but between the command application and the query application of the same microservice.
Also we are using eventstore.
My current solution:
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%. That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
Altough this works, there are 2 things that bother me here, the first thing is my command application relying on the query application, so if my query application is down, my command is affected (today I just return false on my validation if query is down but still...) and second thing is, should a query/read model really be able to emit events? And if so, what is the correct way of doing it? Should the command have some kind of API for that? Or should the query emit the event directly to eventstore using some common shared library? And if I have more than one view/read? Which one should I choose to handle this?
Really hope someone could shine a light into these questions and help me this these matters.
For reference, you may want to be reviewing what Greg Young has written about Set Validation.
I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right?
That's exactly right - your read model is stale copy, and may not have all of the information collected by the write model.
That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
This spelling doesn't quite match the usual designs. The more common implementation is that, if we detect a problem when reading data, we send a command message to the write model, telling it to straighten things out.
This is commonly referred to as a process manager, but you can think of it as the automation of a human supervisor of the system. Conceptually, a process manager is an event sourced collection of messages to be sent to the command model.
You might also want to consider whether you are modeling your domain correctly. If documents are supposed to be unique, then maybe the command model should be using the document number as a key in the book of record, rather than using the customer. Or perhaps the document id should be a function of the customer data, rather than being an arbitrary input.
as far as I know, eventstore doesn't have transactions across different streams
Right - one of the things you really need to be thinking about in general is where your stream boundaries lie. If set validation has significant business value, then you really need to be thinking about getting the entire set into a single stream (or by finding a way to constrain uniqueness without using a set).
How should I send a command message to the write model? via API? via a message broker like Kafka?
That's plumbing; it doesn't really matter how you do it, so long as you are sure that the command runs within its own transaction/unit of work.
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%.
No, you cannot safely rely on the query side, which is eventually consistent, to prevent the system to step into an invalid state.
You have two options:
You permit the system to enter in a temporary, pending state and then, eventually, you will bring it into a valid permanent state; for this you could allow the command to pass, yield CustomerRegistered event and using a Saga/Process manager you verify against a uniquely-indexed-by-document-collection and issue a compensating command (not event!), i.e. UnregisterCustomer.
Instead of sending a command, you create&start a Saga/Process that preallocates the document in a uniquely-indexed-by-document-collection and if successfully then send the RegisterCustomer command. You can model the Saga as an entity.
So, in both solution you use a Saga/Process manager. In order for the system to be resilient you should make sure that RegisterCustomer command is idempotent (so you can resend it if the Saga fails/is restarted)
You've butted up against a fairly common problem. I think the other answer by VoicOfUnreason is worth reading. I just wanted to make you aware of a few more options.
A simple approach I have used in the past is to create a lookup table. Your command tries to register the key in a unique constraint table. If it can reserve the key the command can go ahead.
Depending on the nature of the data and the domain you could let this 'problem' occur and raise additional events to mark it. If it is something that's important to the business/the way the application works then you can deal with it either manually or at the time via compensating commands. if the latter then it would make sense to use a process manager.
In some (rare) cases where speed/capacity is less of an issue then you could consider old-fashioned locking and transactions. Admittedly these are much better suited to CRUD style implementations but they can be used in CQRS/ES.
I have more detail on this in my blog post: How to Handle Set Based Consistency Validation in CQRS
I hope you find it helpful.

Is it ok to have FAT events with event sourcing?

I have recently been building an application on top of Greg Young EventStore as my peristance layer and I have been pondering how big should I allow an event to get?
For example I have an UK Address Aggregate with the following fields
UK_Address
-BuildingName
-Street
-Locality
-Town
-Postcode
Now I'm building the UI using React/Redux and was thinking should I create a single FAT addressUpdated Event contatining all the above fields?
Or should I Create a event for each of the different fields? and batch them within the client until the Save event is fired? buildingNameUpdated Event, streetUpdated Event, localityUpdated Event.
I'm not sure if the answer is as black and white ask I have asked it what I really would like to know is what conditions/constraints could you use to make the decision?
should I create a event for each of the different fields?
No. The representations of your events are part of the API -- so you want to use spellings that make sense at the level of the business, not at the level of the implementation.
Now I'm building the UI using React/Redux and was thinking should I create a single FAT updateAddress Event containing all the above fields?
You don't need to constrain the data that you send to your UI to match that which is in the persistence store. The UI is just a cached representation of a read model; there's no reason that representation needs to have the same form as what is in your event store.
Consider the React model itself -- your code makes changes to the "in memory" representation of your data, and then the library computes the new DOM and replaces it, which in turn causes the browser to update its view, which in turn causes the pixels on the screen to change.
So taking a fat event from the store, and breaking it into field level events for the UI is fine. Taking multiple events from the store and aggregating them into a single message for the UI is also fine. Taking events from the event store and transforming them into a spelling that the UI will recognize is also fine.
Do you have any comment regarding Arien answer regarding keeping fields that need to be consistent together? so regardless of when your snapshop the current state of the world it would be in a valid state?
I don't believe that this makes sense, and I'm not sure if it is possible in general.
It doesn't make sense, because "valid state" is a write model concern only; events are things that have happened, its too late to vote on whether they are valid or not. For instance, if you deploy a new model, with a new invariant, it still needs to respect the history of what happened before. So you can build a snapshot for that new model, but the snapshot may not be "valid". Too bad.
Given that, I don't think it makes sense to worry over whether each individual event in a commit leaves the snapshot in a valid state.
In particular, if a particular transaction involves multiple entities, it is very likely that the domain language will suggest an event for each entity (we "debit cash" and "credit accounts receivable"). The entities themselves, of course, are capable of changing independently of each other -- it's the aggregate that maintains the balance.
You have to bundle al the information together in one event when this data has to be consistent with each other.
So when you update one field of an address you probably get an unwanted address.
This will happen when the client has not processed all the events at a certain time due to eventual consistency.
Example:
Change address (City=1, Street=1, Housenumber=1) to (City=2, Street=2, Housenumber=2)
When you do this with 3 events and you have just processed one at the time of reading you could get the address: (City=2, Street=1, Housenumber=1).
If puzzled, give a try to a solution that is easier to implement. I guess "FAT" event will be easier: you will end up spending less time for implementing/debugging/supporting.
It is usually referred as YAGNI-KISS-Occam's Razor principles.
In theory and I find it to be a good rule of thumb is to have your commands and events reflecting the intent of the user staying true to DDD. You can find a good explanation of the pros and cons about event granularity here: https://medium.com/#hugo.oliveira.rocha/what-they-dont-tell-you-about-event-sourcing-6afc23c69e9a

Is it acceptable to have an invalid state in eventsourcing after event upgrading and before patching?

Let say I have a stream of persisted events that build a valid state according to some "schema" I have defined.
I change the schema and the events are upgraded to reflect this.
However, some state could not be made valid just by upgrading events, I also needed to add more events to patch the state to make it fully valid.
Firstly, is this reasoning at all valid in terms of event sourcing?
If so, how do I handle cases where a specific version of a state no longer becomes valid? I mean is this acceptable? Should it still be possible to rehydrate a version with invalid state? If this is a write model and it's not the latest version, I could not modify this state anyway so maybee it's no big deal?
However, some state could not be made valid just by upgrading events, I also needed to add more events to patch the state to make it fully valid.
"Compensating events" is the usual term; there is a clerical error in the book of record, so we need to add a new event to the history that corrects the mistake.
If so, how do I handle cases where a specific version of a state no longer becomes valid?
As a rule, you want to be wary, extremely wary, of introducing any automated validation that prevents you from loading an invalid history. Remember, state is just state; the business rules constrain the way the domain is allowed to change. Leaving broken states readable, but broken, is safe.
In particular, if you allow the state to load, it is a straight forward exercise to enumerate your event streams, test the final state of the object, and produce an exception report for any streams that produce an invalid state, escalating them to operators/management for handling, and so on.
Assuming that you are reasonable careful about input validation, and comparing whether your proposed command is consistent with latest known state (aggregates enforce business rules, but they don't need to hoard those rules for themselves), then you can probably achieve error rates low enough that you don't need aggressive data validation. That's especially true when the errors are easy to detect and cheap to fix.
Failing that, freezing any aggregates while they are in an invalid state is a good way to prevent further damage.
But if you really need the state to stay valid, there's a trick that you can play with compensating events.
Consider: the basic pattern of event sourcing looks something like
History history = repository.getHistoryById(id)
State current = State.SEED
for (Event e : history) {
current = current.apply(e)
}
There's actually a hidden concept here, which encapsulates the logic for processing the events prior to passing them to the state. Hidden, because the null case just passes the enumerated events straight through to the target.
History history = repository.getHistoryById(id)
Historian historian = new Historian();
State current = State.SEED
for (Event e : historian.reviewEvents(history)) {
current = current.apply(e)
}
The historian gives you a place to put your compensating event logic - based on its own state, the historian passes through most events, but fixes the ones that knows needs edits/compensation/redactions
Where does the historian state come from? Why, from the history of the historian, of course. You load the history of the event corrections, which will typically be short, into the historian, and then let the historian clean up the events for the aggregate.
And if you need corrections for the historian? It's turtles all the way down! Each stream has a unique historian; the identifier for the historian's stream is calculated from the stream it filters (named UUID's, for example, would allow you to do this). So for each stream, you check to see if a historian stream exists; when you find one that doesn't, you know to stop searching and use the null historian, roll up the changes, process the final sequence of events to regenerate the state of your real object, and off you go.
Mind you, I haven't seen a reference implementation of this idea anywhere; it's whiteboard sound, but the truth is I've been deferring this requirement in my own designs.

Events changing state in CQRS

This should be easy to follow, but after some reading I still can find an answer.
So, say that the user needs to change his mobile number, to accomplished that, we might have a command as: ChangedUserMobileNumber
holding the new number. The domain responsible for handling the command will perform the change in the aggregate and publish an event: UserMobilePhoneChanged
There is a subscriber for that event in another domain, which also holds the user mobile number in its aggregate but according to our software architect, events can not old any data so what we end up is rather stupid to say the least:
The Domain 1, receives the command to update the mobile number, the number is updated and one event is published, also, because the event cannot hold data, the command handler in the Domain 1 issues yet another command which is sent to Domain 2. The subscriber of that event lives in Domain 2 too, we then have a Saga to handle both the event and the command.
In terms of implementation we are using NServiceBus, so we have this saga to handle these message and in it we have this line of code, where the entity.IsMobilePhoneUpdated field stored in a saga entity is changed when the event is handeled.
bool isReady = (entity.IsMobilePhoneUpdated && entity.MobilePhoneNumber != null);
Effectively the Saga is started by both the command and the event raised in the Domain 1, and until this condition is met, the saga is kept alive.
If it was up to me, I would be sending the mobile number in the event itself, I just want to get a few other opinions on this.
Thanks
I'm not sure how a UserMobilePhoneChanged event could be useful in any way unless it contained the new phone number. User asks to change a number, the event shoots out that it has. Should be very simple indeed. Why does your architect say that events shouldn't contain any information?
In the first event based system i've designed events also had no data. I also did enforce that rule. At the time that sounded like a clever decision. After a while i realised that it was dumb, and i was making a lot of workarounds because of it. Also this caused a lot of querying form the event subscribers, even for trivial data. I had no problem changing this "rule" after i realised i'm doing it wrong.
Events should have all the data required to make them meaningful. Also they should only have the data that makes sense for that event. ( No point in having the user address in a ChangePhoneNumber message )
If your architect imposes such a restriction, it's not going to be easy to develop a CQRS system. How are the read models updated? Since the events have no data then you either query something to get the data ( the write side ? ) of find some way of sending a command to the read model ( then what's the point of publishing events? ). To fix your problem you should try to have a professional discussion with this architect, preferably including other tech heads and without offending anybody try to get him to relax this constraint.
On argument you could use is Event Sourcing. Event Sourcing is complementary to CQRS and would not make sense without events that have data. Even more when using event sourcing, the only data you have is the data stored in the events. Even if you don't actually implement event sourcing you can use it's existence as a reason for events to have data.
There is little point in finding a technical solution to a people problem.

Resources