DocumentDB unique concurrent insert? - event-sourcing

I have a horizontally event-source driven application that runs using an Azure Service Bus Topic and a Service Bus Queue. Some events for building up my domain model's state are received through the topic by all my servers, while the ones on the queue (the ones received a lot more often and not mutating domain model state) are distributed among the servers in order to distribute the load.
Now, every time one of my servers receives an event through the queue or topic, it stores it in a DocumentDB which it uses as event store.
Now here's the problem. How can I be sure that the same document is not inserted twice? Let's say 3 servers receive the same event. They all try to store it. How can I make it fail for 2 of the servers in the case they decide to do it all at the same time? Is there any form of unique constraint I can set in DocumentDB or some kind of transaction scope to prevent the document from being inserted twice?

The id property for each document has a uniqueness constraint. You can use this constraint to ensure that duplicate documents are not written to a collection.

Related

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

Kafka connect - connector per event type

I'm using kafka to transfer application events to the sql hisotrical database. The events are structured differently depending on the type eg. OrderEvent, ProductEvent and both have the relation Order.productId = Product.id. I want to store this events in seperate sql tables. I came up with two approaches to transfer this data, but each has a technical problem.
Topic per event type - this approach is easy to configure, but the order of events is not guaranteed with multiple topics, so there may be problem when product doesn't exist yet when the order is consumed. This may be solved with foreign keys in the database, so the consumer of the order topic will fail until the product be available in database.
One topic and multiple event types - using the schema regisrty it is possible to store multiple event types in one topic. Events are now properly ordered but I've stucked with jdbc connector configuration. I haven't found any solution how to set the sql table name depending of the event type. Is it possible to configure connector per event type?
Is the first approach with foreign keys correct? Is it possible to configure connector per event type in the second approach? Maybe there is another solution?

system design - How to update cache only after persisted to database?

After watching this awesome talk by Martin Klepmann about how Kafka can be used to stream events so that we can get rid of 2-phase-commits, I have a couple of questions related to updating a cache only when the database is updated properly.
Problem Statement
Lets say you have a Redis cache which stores the user's profile pic and a Postgres database which is used for all the User related operations(creating, updation, deletion, etc)
I want to update my Redis cache only and only when a new user has been successfully added to my database.
How can I do that using Kafka ?
If I am to take the example given in the video then the workflow would follow something like this:
User registers
Request is handled by User Registration Micro service
User Registration Microservice inserts a new entry into the User's table.
Then generates an User Creation Event in the user_created topic.
Cache population microservice consumes the newly created User Creation Event
Cache population microservice updates the redis cache.
The problem starts what would happen if the User Registration Microservice crashed just after writing to the database, but failed to send the event to Kafka ?
What would be the correct way of handling this ?
Does the User Registration Microservice maintain the last event it published ? How can it reliably do that ? Does it write to a DB ? Then the problem starts all over again, what if it published the event to Kafka but failed before it could update its last known offset.
There are three broad approaches one can take for this:
There's the transactional outbox pattern, wherein, in the same transaction as inserting the new entry into the user table, a corresponding user creation event is inserted into an outbox table. Some process then eventually queries that outbox table, publishes the events in that table to Kafka, and deletes the events in the table. Since the inserts are in the same transaction, they either both occur or neither occurs; barring a bug in the process which publishes the outbox to Kafka, this guarantees that every user insert eventually has an associated event published (at least once) to Kafka.
There's a more event-sourcingish pattern, where you publish the user creation event to Kafka and then some consuming process inserts into the user table based on the event. Since this happens with a delay, this strongly suggests that the user registration service needs to keep state of which users it has published creation events for (with the combination of Kafka and Postgres being the source of truth for this). Since Kafka allows a message to be consumed by arbitrarily many consumers, a different consumer can then update Redis.
Change data capture (e.g. Debezium) can be used to tie into Postgres' write-ahead log (as Postgres actually event sources under the hood...) and publish an event that essentially says "this row was inserted into the user table" to Kafka. A consumer of that event can then translate that into a user created event.
CDC in some sense moves the transactional outbox into the infrastructure, at the cost of requiring that the context it inherently throws away be reconstructed later (which is not always possible).
That said, I'd strongly advise against having ____ creation be a microservice and I'd likewise strongly advise against a RInK store like Redis. Both of these smell like attempts to paper over architectural deficiencies by adding microservices and caches.
The one-foot-on-the-way-to-event-sourcing approach isn't one I'd recommend, but if one starts there, the requirement to make the registration service stateful suddenly opens up possibilities which may remove the need for Redis, limit the need for a Kafka-like thing, and allow you to treat the existence of a DB as an implementation detail.

microservice messaging db-assigned identifiers

The company I work for is investigating moving from our current monolithic API to microservices. Our current API is heavily dependent on spring and we use SQL server for most persistence. Our microservice investigation is leaning toward spring-cloud, spring-cloud-stream, kafka, and polyglot persistence (isolated database per microservice).
I have a question about how messaging via kafka is typically done in a microservice architecture. We're planning to have a coordination layer between the set of microservices and our client applications, which will coordinate activities across different microservices and isolate clients from changes to microservice APIs. Most of the stuff we've read about using spring-cloud-stream and kafka indicate that we should use streams at the coordination layer (source) for resource change operations (inserts, updates, deletes), with the microservice being one consumer of the messages.
Where I've been having trouble with this is inserts. We make heavy use of database-assigned identifiers (identity columns/auto-increment columns/sequences/surrogate keys), and they're usually assigned as part of a post request and returned to the caller. The coordination layer may be saving multiple things using different microservices and often needs the assigned identifier from one insert before it can move on to the next operation. Using messaging between the coordination layer and microservices for inserts makes it so the coordination layer can't get a response from the insert operation, so it can't get the assigned identifier that it needs. Additionally, other consumers on the stream (i.e. consumers that publish the data to a data warehouse) really need the message to contain the assigned identifier.
How are people dealing with this problem? Are database-assigned identifiers an anti-pattern in microservices? Should we expose separate microservice endpoints that return database-assigned identifiers so that the coordination layer can make a synchronous call to get an identifier before calling the asynchronous insert? We could use UUIDs but our DBAs hate those as primary keys, and they couldn't be used as an order number or other user-facing generated ids.
If you can programmatically create the identifier earlier while receiving from the message source, you can embed the identifier as part of the message header and subsequently use the message header information during database inserts and in any other consumers.
But this approach requires a separate verification by the other consumers against the database to process only the committed transactions (if you are concerned about processing only the inserts).
At our company, we built a dedicated service responsible for unique ids generation. And every other services grap the ids they need from there.
These generated ids couldn't be used as an order number but I think it's shouldn't be used for this job anyway. If you need to sort by created date, it's better to have a created_date field.
One more thing that is used to bug my mind with this approach is that the primary resource might be persisted after the other resource that rerefence it by the id. For example, a insert user, and insert user address request payload are sent asynchronously. The insert user payload contains a generated unique id, and user address payload contains that id as foreign reference back to user. The insert user address might be proccessed before the insert user request, but it's totally fine. I think it's called eventual consistency.

How to handle set based consistency validation in CQRS?

I have a fairly simple domain model involving a list of Facility aggregate roots. Given that I'm using CQRS and an event-bus to handle events raised from the domain, how could you handle validation on sets? For example, say I have the following requirement:
Facility's must have a unique name.
Since I'm using an eventually consistent database on the query side, the data in it is not guaranteed to be accurate at the time the event processesor processes the event.
For example, a FacilityCreatedEvent is in the query database event processing queue waiting to be processed and written into the database. A new CreateFacilityCommand is sent to the domain to be processed. The domain services query the read database to see if there are any other Facility's registered already with that name, but returns false because the CreateNewFacilityEvent has not yet been processed and written to the store. The new CreateFacilityCommand will now succeed and throw up another FacilityCreatedEvent which would blow up when the event processor tries to write it into the database and finds that another Facility already exists with that name.
The solution I went with was to add a System aggregate root that could maintain a list of the current Facility names. When creating a new Facility, I use the System aggregate (only one System as a global object / singleton) as a factory for it. If the given facility name already exists, then it will throw a validation error.
This keeps the validation constraints within the domain and does not rely on the eventually consistent query store.
Three approaches are outlined in Eventual Consistency and Set Validation:
If the problem is rare or not important, deal with it administratively, possibly by sending a notification to an admin.
Dispatch a DuplicateFacilityNameDetected event, which could kick off an automated resolution process.
Maintain a Service that knows about used Facility names, maybe by listening to domain events and maintaining a persistent list of names. Before creating any new Facility, check with this service first.
Also see this related question: Uniqueness validation when using CQRS and Event sourcing
In this case, you may implement a simple CRUD style service that basically does an insert in a Sql table with a primary key constraint.
The insert will only happen once. When duplicate commands with the same value that should only exist one time hits the aggregate, the aggregate calls the service, the service fails the Insert operation due to a violation of the Primary Key constraint, throws an error, the whole process fails and no events are generated, no reporting in the query side, maybe a reporting of the failure in a table for eventual consistency checking where the user can query to know the status of the command processing. To check that, just query again and again the Command Status View Model with the Command Guid.
Obviously, when the command holds a value that does not exists in the table for primary key checking, the operation is a success.
The table of the primary key constraint should be only be used as a service, but, because you implemented Event sourcing, you can replay the events to rebuild the table of primary key constraint.
Because uniqueness check would be done before data writing, so the better method is to build a event-tracking service, which would send a notification when the process finished or terminated.

Resources