Event sourcing - error handling when events are not created - event-sourcing

As to my understanding, in event sourcing, events are recorded. However that would also mean a state changed first happened and thereafter we record the event. For example, assuming:
A Client sends a command to a server to "Create user".
The server validates the command and creates user i.e. stores new
user in a database.
The server then logs/stores a Created User event. i.e event
sourcing.
Created User event is propagated to subscribers
In the scenario above, how do we handle scenarios where step (2) succeeded but step (3) failed due to say network failures, database offline etc? The whole system would be in an indeterminate state now that there was a new user created but the event was never logged. How do we mitigate these types of failures? Or are the steps that I've listed above not the way to do event sourcing?
Thanks!

This is not what happens exactly in Event sourcing, not even in plain CQRS.
In Event sourcing, after the command is validated, the domain events are generated by the source (the Aggregate in DDD) and then they are appended to the Event store in the first step. After that the subscribers (read models, projections, Sagas, external systems) receive and process the new domain events.
In CQRS, after the domain events are generated, they are applied to the Aggregate and then the Aggregate's state and the new events are persisted in the same local transaction, as the first step. Only after that the subscribers receive the events.
So you see? Your situation cannot happen: steps 2 and 3 are persisted atomically, they succeed or fail together.

Related

Order of wl_display_dispatch and wl_display_roundtrip call

I am trying to make sense of which one should be called before and which one later between wl_display_dispatch and wl_display_roundtrip. I have seen both order so wondering which one is correct.
1st order:
wl_display_get_registry(display); wl_registry_add_listener() // this call is just informational
wl_display_dispatch();
wl_display_roundtrip();
what i think : wl_display_dispatch() will read and dispatch events from display fd, whatever is sent by server but in between server might be still processing requests and for brief time fd might be empty.
wl_display_dispatch returns assuming all events are dispatched. Then wl_display_roundtrip() is called and will block until server has processed all request and put then in event queue. So after this, event queue still has pending events, but there is no call to wl_display_dispatch(). How those pending events will be dispatched ? Is that wl_display_dispatch() wait for server to process all events and then dispatch all events?
2nd order:
wl_display_get_registry(display); wl_registry_add_listener() // this call is just informational
wl_display_roundtrip();
wl_display_dispatch();
In this case, wl_display_roundtrip() wait for server to process all events and put them in event queue, So once this return we can assume all events sent from server are available in queue. Then wl_display_dispatch() is called which will dispatch all pending events.
Order 2nd looks correct and logical to me, as there is no chance of leftover pending events in queue. but I have seen Order 1st in may places including in weston client examples code so I am confused whats the correct order of calling.
It would be great if someone could clarify here.
Thanks in advance
2nd order is correct.
client can't do much without getting proxy(handle for global object). what i mean is client can send request by binding to the global object advertised by server so for this client has to block until all global object are bind in registry listener callback.
for example for client to create surface you need to bind wl_compositor interface then to shell interface to give role and then shm(for share memory) and so on.wl_display_dispatch cannot guaranty all the events are processed if your lucky it may dispatch all events too but cannot guarantee every-time. so you should use wl_display_roundtrip for registry at-least.

How to handle side effects based on multiple events in a message driven microservice system?

we are currently working in a message driven Microservice environment and some of our messages/events are event sourced (using Apache Kafka). Now we are struggling with implementing more complex business requirements, were we have to take multiple events into account to create new events and side effects.
In the current situation we are working with devices that can produce errors and we already process them and have a single topic which contains ERROR_OCCURRED and ERROR_RESOLVED events (so they are in order). We also make sure, that all messages regarding a specific device always go onto the same partition. And both messages share an ID that identifies that specific error incident. We already have a projection that consumes those events and provides an API for our customers, s.t. they can see all occurred errors and their current state.
Now we have to deal with the following requirement:
Reporting Errors
We need a push system that reports errors of devices to our external partners, but only after 15 minutes and if they have not been resolved in that timeframe. Our first approach was to consume all ERROR_RESOLVED events, store the IDs and have another consumer that is handling the ERROR_OCCURRED events in a delayed fashion (e.g. by only consuming the next ERROR_OCCURRED event on the topic if its timestamp is at least 15 minutes old). We would then be able to know if that particular error has already been resolved and does not need to be reported (since they share a common ID with the corresponding ERROR_RESOLVED event). Otherwise we send an HTTP request to our external partner and create an ERROR_REPORTED event on a new topic. Is there any better approach for delayed and conditional message processing?
We also have to take the following special use cases into account:
Service restarts: currently we are planning to keep the list of resolved errors in memory, so if a service restarts, that list has to be created from scratch. We could just replay the ERROR_RESOLVED messages, but that may take some time and in that time no ERROR_OCCURRED events should be processed because that may result in reporting errors that have been resolved in less then 15 minutes, but we are just not aware of it. Are there any good practices regarding replay vs. "normal" processing?
Scaling: we may increase or decrease the number of instances of our service at any time, so the partition assignment may change during runtime. That should not be a problem if we create a consumer group for each service instance when consuming the ERROR_RESOLVED events, s.t. every instance knows all resolved errors while still only handling the ERROR_OCCURRED events of its assigned partitions (in another consumer group which is shared by all instances). Is there a better approach for handling partition reassignment and internal state?
Thanks in advance!
For side effects, I would record all "side" actions in the event store. In your particular example, when it is time to send a notification, I would call SEND_NOTIFICATION command that emit NOTIFICATION_SENT event. These events would be processed by some worker process that does actual HTTP request.
Actually I would elaborate this even furter, since notifications could fail, so I would have, say, two events NOTIFICATION_REQUIRED, and NORIFICATION_SENT, so we can retry failed notifications.
And finally your logic would be "if error was not resolved in 15 minutes and notification was not sent - send a notification (or just discard if it missed its timeframe)"

CQRS + Microservices Handling event rollback

We are using microservices, cqrs, event store using nodejs cqrs-domain, everything works like a charm and the typical flow goes like:
REST->2. Service->3. Command validation->4. Command->5. aggregate->6. event->7. eventstore(transactional Data)->8. returns aggregate with aggregate ID-> 9. store in microservice local DB(essentially the read DB)-> 10. Publish Event to the Queue
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
Any suggestions would be highly appreciated.
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
You retry it later.
The "book of record" is the event store. The downstream views (the "published events", the read models) are derived from the book of record. They are typically behind the book of record in time (eventual consistency) and are not typically synchronized with each other.
So you might have, at some point in time, 105 events written to the book of record, but only 100 published to the queue, and a representation in your service database constructed from only 98.
Updating a view is typically done in one of two ways. You can, of course, start with a brand new representation and replay all of the events into it as part of each update. Alternatively, you track in the metadata of the view how far along in the event history you have already gotten, and use that information to determine where the next read of the event history begins.
Inside your event store, you could track whether read-side replication was successful.
As soon as step 9 suceeds, you can flag the event as 'replicated'.
That way, you could introduce a component watching for unreplicated events and trigger step 9. You could also track whether the replication failed multiple times.
Updating the read-side (step 9) and flagigng an event as replicated should happen consistently. You could use a saga pattern here.
I think i have now understood it to a better extent.
The Aggregate would still be created, answer is that all the validations for any type of consistency should happen before my aggregate is constructed, it is in case of a failure beyond the purview of the code that a failure exists while updating the read side DB of the microservice which needs to be handled.
So in an ideal case aggregate would be created however the event associated would remain as undispatched unless all the read dependencies are updated, if not it remains as undispatched and that can be handled seperately.
The Event Store will still have all the event and the eventual consistency this way is maintained as is.

ES,CQRS messaging flow

I was trying to understanding ES+CQRS and tech stack can be used.
As per my understanding flow should be as below.
UI sends a request to Controller(HTTP Adapter)
Controller calls application service by passing Request Object as parameter.
Application Service creates Command from Request Object passed from controller.
Application Service pass this Command to Message Consumer.
Message Consumer publish Command to message broker(RabbitMQ)
Two Subscriber will be listening for above command
a. One subscriber will generate Aggregate from eventStore using command
and will apply command than generated event will be stored in event store.
b. Another subscriber will be at VIEW end,that will populate data in view database/cache.
Kindly suggest my understanding is correct.
Kindly suggest my understanding is correct
I think you've gotten a bit tangled in your middleware.
As a rule, CQRS means that the writes happen to one data model, and reads in another. So the views aren't watching commands, they are watching the book of record.
So in the subscriber that actually processes the command, the command handler will load the current state from the book of record into memory, update the copy in memory according to the domain model, and then replace the state in the book of record with the updated version.
Having update the book of record, we can now trigger a refresh of the data model that backs the view; no business logic is run here, this is purely a transform of the data from the model we use for writes to the model we use for reads.
When we add event sourcing, this pattern is the same -- the distinction is that the data model we use for writes is a history of events.
How atomicity is achieved in writing data in event store and writing data in VIEW Model?
It's not -- we don't try to make those two actions atomic.
how do we handle if event is stored in EventStrore but System got crashed before we send event in Message Queue
The key idea is to realize that we typically build new views by reading events out of the event store; not by reading the events out of the message queue. The events in the queue just tell us that an update is available. In the absence of events appearing in the message queue, we can still poll the event store watching for updates.
Therefore, if the event store is unreachable, you just leave the stale copy of the view in place, and wait for the system to recover.
If the event store is reachable, but the message queue isn't, then you update the view (if necessary) on some predetermined schedule.
This is where the eventual consistency part comes in. Given a successful write into the event store, we are promising that the effects of that write will be visible in a finite amount of time.

Event-sourcing: Dealing with derived data

How does an event-sourcing system deal with derived data? All the examples I've read on event-sourcing demonstrate services reacting to fact events. A popular example seems to be:
Bank Account System
Events
Funds deposited
Funds withdrawn
Services
Balance Service
They then show how the Balance service can, at any point, derive a state (I.e. balance) from the events. That makes sense; those events are facts. There's no question that they happened - they are external to the system.
However, how do we deal with data calculated BY the system?
E.g.
Overdrawn service:
A services which is responsible for monitoring the balance and performing some action when it goes below zero.
Does the event-sourcing approach dictate how we should use (or not use) derived data? I.e. The balance. Perhaps one of the following?
1) Use: [Funds Withdrawn event] + [Balance service query]
Listen for the "Funds withdrawn" event and then ask the Balance service for the current balance.
2) Use: [Balance changed event]
Get the balance service to throw a "Balance changed" event containing the current balance. Presumably this isn't a "fact" as it's not external to the system, therefore prone to miscalculation.
3) Use: [Funds withdrawn event] + [Funds deposited event]
We could just skip the Balance service and have each service maintain its own balance directly from the facts. ...though that would result in each service having its own (potentially different) version of the balance.
A services which is responsible for monitoring the balance and performing some action when it goes below zero.
Executive summary: the way this is handled in event sourced systems is not actually all that different from the alternatives.
Stepping back a second - the advantage of having a domain model is to ensure that all proposed changes satisfy the business rules. Borrowing from the CQRS language: we send command messages to a command handler. The handler loads the state of the model, and tries to apply the command. If the command is allowed, the changes to the state of the domain model is updated and saved.
After persisting the state of the model, the command handler can query that state to determine if their are outstanding actions to be performed. Udi Dahan describes this in detail in his talk on Reliable messaging.
So the most straight forward way to describe your service is one that updates the model each time the account balance changes, and sets the "account overdrawn" flag if the balance is negative. After the model is saved, we schedule any actions related to that state.
Part of the justification for event sourcing is that the state of the domain model is derivable from the history. Which is to say, when we are trying to determine if the model allows a command, we load the history, and compute from the history the current state, and then use that state to determine whether the command is permitted.
What this means, in practice, is that we can write an AccountOverdrawn event at the same time that we write the AccountDebited event.
That AccountDebited event can be subscribed to - Pub/Sub. The typical handling is that the new events get published after they are successfully written to the book of record. An event listener subscribing to the events coming out of the domain model observes the event, and schedules the command to be run.
Digression: typically, we'll want at-least-once execution of these activities. That means keeping track of acknowledgements.
Therefore, the event handler is also a thing with state. It doesn't have any business state in it, and certainly no rules that would allow it to reject events. What it does track is which events it has seen, and which actions need to be scheduled. The rules for loading this event handler (more commonly called a process manager) are just like those of the domain model - load events from the book of record to obtain the current state, then see if the event being handled changes anything.
So it is really subscribing to two events - the AccountDebited event, and whatever event returns from the activity to acknowledge that it has completed.
This same mechanic can be used to update the domain model in response to events from elsewhere.
Example: suppose we get a FundsWithdrawn event from an ATM, and we need to update the account history to match it. So our event handler gets loaded, updates itself, and schedules a RecordATMWithdrawal command to be run. When the command loads, it loads the account, updates the balances, and writes out the AccountCredited and AccountOverdrawn events as before. The event handler sees these events, loads the correct state process state based on the meta data, and updates the state of the process.
In CQRS terms, this is all taking place in the "write models"; these processes are all about updating the book of record.
The balance query itself is easy - we already showed that the balance can be derived from the history of the domain model, and that's just how your balance service is expected to do it.
To sum up; at any given time you can load the history of the domain model, to query its state, and you can load up the history of the event processor, to determine what work has yet to be acknowledged.
Event sourcing is an evolving discipline with a bunch of diverse practices, practitioners and charismatic people. You can't expect them to provide you with some very consistent modelling technique for all scenarios like you described. Each one of those scenarios has it's pros and cons and you specified some of them. Also it may vary dramatically from one project to another, because business requirements (evolutionary pressures of the market) will be different.
If you are working on some mission-critical system and you want to have very consistent balance all the time - it's better to use RDBMS and ACID transactions.
If you need maximum speed and you are okay with eventually consistent states and not very anxious about precision of your balances (some events may be missing here and there for bunch of reasons) then you can derive your projections for balances from events asynchronously.
In both scenarios you can use event sourcing, but you don't necessarily have to generate your projections asynchronously. It's okay to generate projection in the same transaction scope as you making changes to your write model if you really need to do that.
Will it make Greg Young happy? I have no idea, but who cares about such things if your balances one day may go out of sync in mission-critical system ...

Resources