system design - How to update cache only after persisted to database? - caching

After watching this awesome talk by Martin Klepmann about how Kafka can be used to stream events so that we can get rid of 2-phase-commits, I have a couple of questions related to updating a cache only when the database is updated properly.
Problem Statement
Lets say you have a Redis cache which stores the user's profile pic and a Postgres database which is used for all the User related operations(creating, updation, deletion, etc)
I want to update my Redis cache only and only when a new user has been successfully added to my database.
How can I do that using Kafka ?
If I am to take the example given in the video then the workflow would follow something like this:
User registers
Request is handled by User Registration Micro service
User Registration Microservice inserts a new entry into the User's table.
Then generates an User Creation Event in the user_created topic.
Cache population microservice consumes the newly created User Creation Event
Cache population microservice updates the redis cache.
The problem starts what would happen if the User Registration Microservice crashed just after writing to the database, but failed to send the event to Kafka ?
What would be the correct way of handling this ?
Does the User Registration Microservice maintain the last event it published ? How can it reliably do that ? Does it write to a DB ? Then the problem starts all over again, what if it published the event to Kafka but failed before it could update its last known offset.

There are three broad approaches one can take for this:
There's the transactional outbox pattern, wherein, in the same transaction as inserting the new entry into the user table, a corresponding user creation event is inserted into an outbox table. Some process then eventually queries that outbox table, publishes the events in that table to Kafka, and deletes the events in the table. Since the inserts are in the same transaction, they either both occur or neither occurs; barring a bug in the process which publishes the outbox to Kafka, this guarantees that every user insert eventually has an associated event published (at least once) to Kafka.
There's a more event-sourcingish pattern, where you publish the user creation event to Kafka and then some consuming process inserts into the user table based on the event. Since this happens with a delay, this strongly suggests that the user registration service needs to keep state of which users it has published creation events for (with the combination of Kafka and Postgres being the source of truth for this). Since Kafka allows a message to be consumed by arbitrarily many consumers, a different consumer can then update Redis.
Change data capture (e.g. Debezium) can be used to tie into Postgres' write-ahead log (as Postgres actually event sources under the hood...) and publish an event that essentially says "this row was inserted into the user table" to Kafka. A consumer of that event can then translate that into a user created event.
CDC in some sense moves the transactional outbox into the infrastructure, at the cost of requiring that the context it inherently throws away be reconstructed later (which is not always possible).
That said, I'd strongly advise against having ____ creation be a microservice and I'd likewise strongly advise against a RInK store like Redis. Both of these smell like attempts to paper over architectural deficiencies by adding microservices and caches.
The one-foot-on-the-way-to-event-sourcing approach isn't one I'd recommend, but if one starts there, the requirement to make the registration service stateful suddenly opens up possibilities which may remove the need for Redis, limit the need for a Kafka-like thing, and allow you to treat the existence of a DB as an implementation detail.

Related

Migrating an asynchronous businness flow to an event-driven system

In the effort to redesign an asynchronous flow based functional service to an event driven one, we have come up with changes on different part of this system. The service receives various statuses from external services through the API, which does computations and persists the result into the data store. The core logic is now moved from the api by introducing a queue (Kafka). Similarly the query functionality is provided through another interface (api) fronted by web UI. With this the command and query are separated. See below the diagram.
I have few questions on the approach
Is it right to have the query API (read) service & the event-complete-handler (write) operate on the same database with both dependent on the DB schema? Or is it better to have the query-api read from the replica DB?
The core-business-logic, at the end of computation, writes only to database and not to db+Kafka in a single transaction. Persisting to the database is handled by the event-complete-handler. Is this approach better?
Say in the future, if the core-business-logic needs to query the database to do the computation on every event, can it directly read from the database? Again, does it not create DB schema dependency between the services?
Is it right to have the query API (read) service & the event-complete-handler (write) operate on the same database with both dependent on the DB schema? Or is it better to have the query-api read from the replica DB?
"Right" is a loaded term. The idea behind CQRS is that the pattern can allow you to separate commands and queries so that your system can be distributed and scaled out. Typically they would be using different databases in a SOA/Microservice architecture. One service would process the command which produces an event on the service bus. Query handlers would listen to this event to change their data for querying.
For example:
A service which process the CreateWidgetCommand would produce an event onto the bus with the properties of the command.
Any query services which are interested widgets for producing their data views would subscribe to this event type.
When the event is produced, the subscribed query handlers will consume the event and update their respective databases.
When the query is invoked, their interrogate their own database.
This means you could, in theory, make the command handler as simple as throwing the event onto the bus.
The core-business-logic, at the end of computation, writes only to database and not to db+Kafka in a single transaction. Persisting to the database is handled by the event-complete-handler. Is this approach better?
No. If you question is about the transactionality of distributed systems, you cannot rely on traditional transactions, since any commands may be affecting any number of distributed data stores. The way transactionality is handled in distributed systems is often with a compensating transaction, where you code the steps to reverse the mutations made from consuming the bus messages.
Say in the future, if the core-business-logic needs to query the database to do the computation on every event, can it directly read from the database? Again, does it not create DB schema dependency between the services?
If you follow the advice in the first response, the approach here should be obvious. All distinct queries are built from their own database, which are kept "eventually consistent" by consuming events from the bus.
Typically these architectures have major complexity downsides, especially if you are concerned with consistency and transactionality.
People don't generally implement this type of architecture unless there is a specific need.
You can however design your code around CQRS and DDD so that in the future, transitioning to this type of architecture can be relatively painless.
The topic of DDD is too dense for this answer. I encourage you to do some independent learning.

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

CQRS Event-sourcing and own database per microservice

I have some questions above event-sourcing and cqrs in microservices architecture.
I understand that after send command some microservice executes it and emits event. Event-store subcsribes on it and saves inside his database. Also some ReadModel basing on this event generates and saves optimized data inside read database.
My first question is - Can microservice has his own database and store
data inside it too? Or maybe in event-sourcing approach microservices
don't have their own databases and everything is only stored inside
event store?
My second question is - when I execute command in microservice and
need some data for validation purposes do I need call ReadModel or
what? Assuming microservices haven't got their own databases I have no
choice?
Can microservice has his own database and store data inside it too?
Definitely, microservice can have its own database. But let's use terms from ES/CQRS. Database can represent Event Store (append-only log of immutabale events) and Read Model - some database used to answer queries which is populated by proseccing events.
So, microservice can have its own Read model, populated from events from other microservices.
Or microservice can process commands and save events to the shared Event Store.
Or microservice can process commands and save events to its own Event store.
Choice is yours, and it depends on degree of separation you want to achieve among microservices.
I would put all events that usually consumed together into same Event store. Which means I should be able to query for these events and have a single ordered stream as a result.
when I execute command in microservice and need some data for validation purposes do I need call ReadModel or what?
Command is executed by Aggregate, that has its own state. This state is built by processing all events for this aggregate, and this state should be used to validate a command.
You cannot/should not talk to Read Models in the command handler, primarily because those read models are not consistent with aggregate state. Aggregate state is consistent.
You can query Read Model before sending a command (to make sure it can be sent). But in command handler you need to rely on aggregate state only.
There is a famous case of registering user with requirement of a unique name. As a primary validation, in your UI code you can query read model and tell user that entered name is taken. If name is not taken, UI lets user issue a command. I'm assuming your Aggregate root is user.
But when processing this command ({id:123, type:CREATE_USER, name:somename}) you cannot check that "somename" is taken, because aggregate state for user 123 does not contain a list of taken names. You can potentially query some AllUsernames read model, but it can be milliseconds old, and some other user could take this "somename" already. So in this scenario, you will find a duplication during adding names to read model. And at that point you can do some compensation action - usually issue a command to suspend a user with duplicated name and ask him to re-register or change his name somehow.
It may seems strange, but if you have a really distributed system with several replicas of user list, you'll have the same problem, so why not just embrace the fact that data is always not fully consistent, and just deal with it?

microservice messaging db-assigned identifiers

The company I work for is investigating moving from our current monolithic API to microservices. Our current API is heavily dependent on spring and we use SQL server for most persistence. Our microservice investigation is leaning toward spring-cloud, spring-cloud-stream, kafka, and polyglot persistence (isolated database per microservice).
I have a question about how messaging via kafka is typically done in a microservice architecture. We're planning to have a coordination layer between the set of microservices and our client applications, which will coordinate activities across different microservices and isolate clients from changes to microservice APIs. Most of the stuff we've read about using spring-cloud-stream and kafka indicate that we should use streams at the coordination layer (source) for resource change operations (inserts, updates, deletes), with the microservice being one consumer of the messages.
Where I've been having trouble with this is inserts. We make heavy use of database-assigned identifiers (identity columns/auto-increment columns/sequences/surrogate keys), and they're usually assigned as part of a post request and returned to the caller. The coordination layer may be saving multiple things using different microservices and often needs the assigned identifier from one insert before it can move on to the next operation. Using messaging between the coordination layer and microservices for inserts makes it so the coordination layer can't get a response from the insert operation, so it can't get the assigned identifier that it needs. Additionally, other consumers on the stream (i.e. consumers that publish the data to a data warehouse) really need the message to contain the assigned identifier.
How are people dealing with this problem? Are database-assigned identifiers an anti-pattern in microservices? Should we expose separate microservice endpoints that return database-assigned identifiers so that the coordination layer can make a synchronous call to get an identifier before calling the asynchronous insert? We could use UUIDs but our DBAs hate those as primary keys, and they couldn't be used as an order number or other user-facing generated ids.
If you can programmatically create the identifier earlier while receiving from the message source, you can embed the identifier as part of the message header and subsequently use the message header information during database inserts and in any other consumers.
But this approach requires a separate verification by the other consumers against the database to process only the committed transactions (if you are concerned about processing only the inserts).
At our company, we built a dedicated service responsible for unique ids generation. And every other services grap the ids they need from there.
These generated ids couldn't be used as an order number but I think it's shouldn't be used for this job anyway. If you need to sort by created date, it's better to have a created_date field.
One more thing that is used to bug my mind with this approach is that the primary resource might be persisted after the other resource that rerefence it by the id. For example, a insert user, and insert user address request payload are sent asynchronously. The insert user payload contains a generated unique id, and user address payload contains that id as foreign reference back to user. The insert user address might be proccessed before the insert user request, but it's totally fine. I think it's called eventual consistency.

CQRS How to handle tasks of users / Stale data

I understand that data is always stale.
What is a way to handle a workflow task, like Approve Invoice. This task is allowed to execute once by the user. When this is processed by an async service it can take some seconds (or longer). In the meantime the user can approve the same invoice again, because the task is not updated yet in the DB.
Any ideas about this are appreciated.
The domain model must enforce consistency. The model on the write side should not be considered stale, only the projections on the read side.
It doesn't matter if the approval event hasn't been projected into the read model. But if the user sends an invalid command based on stale data, the domain model needs to know that the approval had already happened.
Your domain's repository should always get the aggregate root in its lates state (no matter if you use event sourcing or some state-based persistence as a SQL db).

Resources