What does it mean when someone says store event itself but not the state/data? - event-sourcing

What does it mean when someone says store event itself in the journal but not the state/data? such that one can replay the events to build the state. Can someone explain how the data in journal look like since there is no storing of state? Does one store the functions itself?

It means that you persist only what has changed in the state and not all the state. I will give you and example.
Let's suppose that you have an InventoryAggregate that stores items, like Products. Internally, this Aggregate stores a list of all items but when a command to add the item X arrives (AddItemXCommand) it produces the event that item x was added (ItemXWasAddedEvent). Next time a similar command arrives, the Aggregate Repository loads from the Event Store (from the Journal if you want) all the previous generated events and apply them one-by-one onto the Aggregate in order to build the state (in this particular example, to build the list of all added items).
So, after the command arrives and is executed, the state changes (an item is added to the list) but you don't persist it (you don't persist the item list), you persist only the generated event (that an item was added). After you persist that event you can discard the Aggregate and its internal state.
As an optimization, you can persist the internal state, so when a command arrives, you don't have to load+apply all the events, only the events that were generated after the state was cached. This state at a particular moment is know as Snapshot.

Related

Filter out duplicates and only generate events when something has changed

We have a system that sends us events when any attribute has changed on an order. They have a full order structure with say 500 attributes and we only storing a subset of them say 50 attributes + some transformation to our internal order data model. Now whenever any of these 500 attributes on the order changes at the source system, it triggers an event to us (note that they share the complete order state every time). Most of the times nothing changes on the 50 attributes we have in our model so many are duplicate events for us. Currently we are storing all the events we receive from the source in a database.
The requirement now is that we have to generate events to other consumers when the order data in our system has changed. So we will have to eliminate the duplicates from the source and only share the event when some attribute on our model has changed.
Possible solution:
For every event we receive from source we could compare it with our internal data if something has changed on "any" attribute. We could have a memory database like REDIS. But there is going to be a lot of IO on the database. Also not sure whats that best way to compare if 2 json objects are exactly the same.
Other solution is that, instead of storing all attributes we could store only some important "meta data" and compare on that meta data only. Will be good for performance but will not be fully perfect if we have to generate event for any attribute change.
Would like to know how would you design this scenario.
Thanks.

Version number in event sourcing aggregate?

I am building Microservices. One of my MicroService is using CQRS and Event sourcing. Integration events are raised in the system and i am saving my aggregates in event store also updating my read model.
My questions is why we need version in aggregate when we are updating the event stream against that aggregate ? I read we need this for consistency and events are to be replayed in sequence and we need to check version before saving (https://blog.leifbattermann.de/2017/04/21/12-things-you-should-know-about-event-sourcing/) I still can't get my head around this since events are raised and saved in order , so i really need concrete example to understand what benefit we get from version and why we even need them.
Many thanks,
Imran
Let me describe a case where aggregate versions are useful:
In our reSove framework aggregate version is used for optimistic concurrency control.
I'll explain it by example. Let's say InventoryItem aggregate accept commands AddItems and OrderItems. AddItems increases number of items in stock, OrderItems - decreases.
Suppose you have an InventoryItem aggregate #123 with one event - ITEMS_ADDED with quantity of 5. Aggregate #123 state say there are 5 items in stock.
So your UI is showing users that there are 5 items in stock. User A decide to order 3 items, user B - 4 items. Both issue OrderItems commands, almost at the same time, let's say user A is first by couple milliseconds.
Now, if you have a single instance of aggregate #123 in memory, in the single thread, you don't have a problem - first command from user A would succeed, event would be applied, state say quantity is 2, so second command from user B would fail.
In a distributed or serverless system where commands from A and B would be in separate processes, both commands would succeed and bring aggregate into incorrect state if we don't use some concurrency control. There several ways to do this - pessimistic locking, command queue, aggregate repository or optimistic locking.
Optimistic locking seems to be simplest and most practical solution:
We say that every aggregate has a version - number of events in its stream. So our aggregate #123 has version 1.
When aggregate emits an event, this event data has an aggregate version. In our case ITEMS_ORDERED events from users A and B will have event aggregate version of 2. Obviously, aggregate events should have versions to be sequentially increasing. So what we need to do is just put a database constraint that tuple {aggregateId, aggregateVersion} should be unique on write to event store.
Let's see how our example would work in a distributed system with optimistic concurrency control:
User A issues a command OrderItem for aggregate #123
Aggregate #123 is restored from events {version 1, quantity 5}
User B issues a command OrderItem for aggregate #123
Another instance of Aggregate #123 is restored from events (version 1, quantity 5)
Instance of aggregate for user A performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} is written to event store.
Instance of aggregate for user B performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} it attempts to write it to event store and fails with concurrency exception.
On such exception command handler for user B just repeats the whole procedure - then Aggregate #123 would be in a state of {version 2, quantity 2} and command will be executed correctly.
I hope this clears the case where aggregate versions are useful.
Yes, this is right. You need the version or a sequence number for consistency.
Two things you want:
Correct ordering
Usually events are idempotent in nature because in a distributed system idempotent messages or events are easier to deal with. Idempotent messages are the ones that even when applied multiple times will give the same result. Updating a register with a fixed value (say one) is idempotent, but incrementing a counter by one is not. In distributed systems when A sends a message to B, B acknowledges A. But if B consumes the message and due to some network error the acknowledgement to A is lost, A doesn't know if B received the message and so it sends the message again. Now B applies the message again and if the message is not idempotent, the final state will go wrong. So, you want idempotent message. But if you fail to apply these idempotent messages in the same order as they are produced, your state will be again wrong. This ordering can be achieved using the version id or a sequence. If your event store is an RDBMS you cannot order your events without any similar sort key. In Kafka also, you have the offset id and client keeps track of the offset up to which it has consumed
Deduplication
Secondly, what if your messages are not idempotent? Or what if your messages are idempotent but the consumer invokes some external services in a non-deterministic way. In such cases, you need an exactly-once semantics because if you apply the same message twice, your state will be wrong. Here also you need the version id or sequence number. If at the consumer end, you keep track of the version id you have already processed, you can dedupe based on the id. In Kafka, you might then want to store the offset id at the consumer end
Further clarifications based on comments:
The author of the article in question assumed an RDBMS as an event store. The version id or the event sequence is expected to be generated by the producer. Therefore, in your example, the "delivered" event will have a higher sequence than the "in transit" event.
The problem happens when you want to process your events in parallel. What if one consumer gets the "delivered" event and the other consumer gets the "in transit" event? Clearly you have to ensure that all events of a particular order are processed by the same consumer. In Kafka, you solve this problem by choosing order id as the partition key. Since one partition will be processes by one consumer only, you know you'll always get the "in transit" before delivery. But multiple orders will be spread across different consumers within the same consumer group and thus you do parallel processing.
Regarding aggregate id, I think this is synonymous to topic in Kafka. Since the author assumed RDBMS store, he needs some identifier to segregate different categories of message. You do that by creating separate topics in Kafka and also consumer groups per aggregate.

Addressing CRUD "tables" in event sourcing

I'm starting down an ES journey and want to know if traditional support tables should be stored in the event log or should those be handled differently? These tables would typical have a CRUD page. In other words, would it be common to have 2 approaches in the same application, one for support tables and one for transactional data?
A support table would be like "Account" in an accounting application or "Product Type" or the actual "Product" table in an ERP application (I'm not writing an ERP application - that's an example of the type of table I'm talking about).
If we store CRUD-type data in the event log, then we might have events:
ProductCreated
ProductUpdated
ProductDeleted (which would just mark it as deleted)
Then, do we attempt to find out what changed (in ProductUpdated event) and just store the change and replay to get the latest image of the Product?
Mostly, I'm after what approach to use for CRUD tables - traditional or store in the event log? Additional information would be great!
Suppose you start purely with an event log, including for events like ProductCreated, etc., and no other data store. What happens then is that every time your application starts up, it has to replay all the events in the log to build its current state.
Now, suppose you create a traditional SQL table to store the current state of your app (say a products table) and the ID of the last event that was processed to get to that state (say a last_event table). What happens then is every time your app starts up, it has to replay only the events with higher IDs than the stored ID and process those to build its new state.
On the flip side, your app now has to be careful to keep these two states synchronised. If you need to have concurrency, you'll need to be careful to do atomic operations only on your SQL tables--but that should be reasonably easy with transacctions.
Your support tables are just a read-model/projection of the event stream. In general you don't create those support models in case you need them. You create a read-model only if you use it somewhere in the UI.
Anyway, one important benefit behind Event sourcing is that you won't need to use join in your queries. That is, you create a table for each read-model that contains all the data it needs - full denormalisation. You keep that table super-optimised for the query.

CouchDB change API on view

Say I have a document "schema" that includes a show_from field that contains a timestamp as a Unix epoch. I then create a view keyed by this show_from date and only return those documents with a key on or before the current timestamp (per the request). Thus documents will appear in the view "passively", rather than from any update request.
Is it possible to use the CouchDB change API to monitor this change of view state, or would I have to poll the view to watch for changes? (My guess is the latter, because the change API seems only to be triggered by updates, but just for the sake of confirmation!)
The _changes feed can be filtered in a number of ways.
One of the ways of filtering the _changes feed is reusing a view's map function.
GET /[DB]/_changes?filter=_view&view=[DESIGN_DOC]/[VIEW_NAME]
Note:
For every _changes request, CouchDB is going to look at each change and run it over the filter function (or in this case the view's map function). None of this is cached for subsequent requests (as on mapreduce views). So it can be quite taxing on resources, unless the changeset is small.
For a large dataset (with many changes) it can be useful to bootstrap with the view, and only incrementally keep track of changes.
Additional info:
Using _changes you can poll for changes since a given sequence point, for the latest N changes, etc. You can also use long polling, or a continuous feed. As long as the changeset to consider (and filter through) is small, it makes sense to use _changes.
But if the view is itself ordered chronologically, as it seems to be your case, it may be pointless to use changes. Just query the view.

What are all the changes other than config changes which will not be captured in update sets

I see that any updates to scheduled script execution is not captured in the update set.
What is the criteria to have changes captured?
Can we manually configure the list of items to be and not to be captured in updates set.?
Tables with the attribute update_synch set to true are captured in update sets. This is the attribute set on the collection entry in sys_dictionary.
Scheduled script execution definitions (sysauto_script) should actually be captured in update sets, but the actual sys_trigger record which actually causes the scheduled script to be executed per the schedule is NOT update_synch'd, and that's by design. The sys_trigger table is modified heavily by the actual scheduler service (e.g. resetting next action on every execution, run once jobs created and destroyed for things like workflow timers)
Technically, you could add the update_synch attribute to a sys_dictionary collection entry to cause it to be captured by update sets, but that is highly ill-advised, unless you really know what you're doing.
You can manually add non-update-synch'd records to your update set ad-hoc by way of a script described on the servicenowguru website.

Resources