microservice messaging db-assigned identifiers - microservices

The company I work for is investigating moving from our current monolithic API to microservices. Our current API is heavily dependent on spring and we use SQL server for most persistence. Our microservice investigation is leaning toward spring-cloud, spring-cloud-stream, kafka, and polyglot persistence (isolated database per microservice).
I have a question about how messaging via kafka is typically done in a microservice architecture. We're planning to have a coordination layer between the set of microservices and our client applications, which will coordinate activities across different microservices and isolate clients from changes to microservice APIs. Most of the stuff we've read about using spring-cloud-stream and kafka indicate that we should use streams at the coordination layer (source) for resource change operations (inserts, updates, deletes), with the microservice being one consumer of the messages.
Where I've been having trouble with this is inserts. We make heavy use of database-assigned identifiers (identity columns/auto-increment columns/sequences/surrogate keys), and they're usually assigned as part of a post request and returned to the caller. The coordination layer may be saving multiple things using different microservices and often needs the assigned identifier from one insert before it can move on to the next operation. Using messaging between the coordination layer and microservices for inserts makes it so the coordination layer can't get a response from the insert operation, so it can't get the assigned identifier that it needs. Additionally, other consumers on the stream (i.e. consumers that publish the data to a data warehouse) really need the message to contain the assigned identifier.
How are people dealing with this problem? Are database-assigned identifiers an anti-pattern in microservices? Should we expose separate microservice endpoints that return database-assigned identifiers so that the coordination layer can make a synchronous call to get an identifier before calling the asynchronous insert? We could use UUIDs but our DBAs hate those as primary keys, and they couldn't be used as an order number or other user-facing generated ids.

If you can programmatically create the identifier earlier while receiving from the message source, you can embed the identifier as part of the message header and subsequently use the message header information during database inserts and in any other consumers.
But this approach requires a separate verification by the other consumers against the database to process only the committed transactions (if you are concerned about processing only the inserts).

At our company, we built a dedicated service responsible for unique ids generation. And every other services grap the ids they need from there.
These generated ids couldn't be used as an order number but I think it's shouldn't be used for this job anyway. If you need to sort by created date, it's better to have a created_date field.
One more thing that is used to bug my mind with this approach is that the primary resource might be persisted after the other resource that rerefence it by the id. For example, a insert user, and insert user address request payload are sent asynchronously. The insert user payload contains a generated unique id, and user address payload contains that id as foreign reference back to user. The insert user address might be proccessed before the insert user request, but it's totally fine. I think it's called eventual consistency.

Related

Migrating an asynchronous businness flow to an event-driven system

In the effort to redesign an asynchronous flow based functional service to an event driven one, we have come up with changes on different part of this system. The service receives various statuses from external services through the API, which does computations and persists the result into the data store. The core logic is now moved from the api by introducing a queue (Kafka). Similarly the query functionality is provided through another interface (api) fronted by web UI. With this the command and query are separated. See below the diagram.
I have few questions on the approach
Is it right to have the query API (read) service & the event-complete-handler (write) operate on the same database with both dependent on the DB schema? Or is it better to have the query-api read from the replica DB?
The core-business-logic, at the end of computation, writes only to database and not to db+Kafka in a single transaction. Persisting to the database is handled by the event-complete-handler. Is this approach better?
Say in the future, if the core-business-logic needs to query the database to do the computation on every event, can it directly read from the database? Again, does it not create DB schema dependency between the services?
Is it right to have the query API (read) service & the event-complete-handler (write) operate on the same database with both dependent on the DB schema? Or is it better to have the query-api read from the replica DB?
"Right" is a loaded term. The idea behind CQRS is that the pattern can allow you to separate commands and queries so that your system can be distributed and scaled out. Typically they would be using different databases in a SOA/Microservice architecture. One service would process the command which produces an event on the service bus. Query handlers would listen to this event to change their data for querying.
For example:
A service which process the CreateWidgetCommand would produce an event onto the bus with the properties of the command.
Any query services which are interested widgets for producing their data views would subscribe to this event type.
When the event is produced, the subscribed query handlers will consume the event and update their respective databases.
When the query is invoked, their interrogate their own database.
This means you could, in theory, make the command handler as simple as throwing the event onto the bus.
The core-business-logic, at the end of computation, writes only to database and not to db+Kafka in a single transaction. Persisting to the database is handled by the event-complete-handler. Is this approach better?
No. If you question is about the transactionality of distributed systems, you cannot rely on traditional transactions, since any commands may be affecting any number of distributed data stores. The way transactionality is handled in distributed systems is often with a compensating transaction, where you code the steps to reverse the mutations made from consuming the bus messages.
Say in the future, if the core-business-logic needs to query the database to do the computation on every event, can it directly read from the database? Again, does it not create DB schema dependency between the services?
If you follow the advice in the first response, the approach here should be obvious. All distinct queries are built from their own database, which are kept "eventually consistent" by consuming events from the bus.
Typically these architectures have major complexity downsides, especially if you are concerned with consistency and transactionality.
People don't generally implement this type of architecture unless there is a specific need.
You can however design your code around CQRS and DDD so that in the future, transitioning to this type of architecture can be relatively painless.
The topic of DDD is too dense for this answer. I encourage you to do some independent learning.

system design - How to update cache only after persisted to database?

After watching this awesome talk by Martin Klepmann about how Kafka can be used to stream events so that we can get rid of 2-phase-commits, I have a couple of questions related to updating a cache only when the database is updated properly.
Problem Statement
Lets say you have a Redis cache which stores the user's profile pic and a Postgres database which is used for all the User related operations(creating, updation, deletion, etc)
I want to update my Redis cache only and only when a new user has been successfully added to my database.
How can I do that using Kafka ?
If I am to take the example given in the video then the workflow would follow something like this:
User registers
Request is handled by User Registration Micro service
User Registration Microservice inserts a new entry into the User's table.
Then generates an User Creation Event in the user_created topic.
Cache population microservice consumes the newly created User Creation Event
Cache population microservice updates the redis cache.
The problem starts what would happen if the User Registration Microservice crashed just after writing to the database, but failed to send the event to Kafka ?
What would be the correct way of handling this ?
Does the User Registration Microservice maintain the last event it published ? How can it reliably do that ? Does it write to a DB ? Then the problem starts all over again, what if it published the event to Kafka but failed before it could update its last known offset.
There are three broad approaches one can take for this:
There's the transactional outbox pattern, wherein, in the same transaction as inserting the new entry into the user table, a corresponding user creation event is inserted into an outbox table. Some process then eventually queries that outbox table, publishes the events in that table to Kafka, and deletes the events in the table. Since the inserts are in the same transaction, they either both occur or neither occurs; barring a bug in the process which publishes the outbox to Kafka, this guarantees that every user insert eventually has an associated event published (at least once) to Kafka.
There's a more event-sourcingish pattern, where you publish the user creation event to Kafka and then some consuming process inserts into the user table based on the event. Since this happens with a delay, this strongly suggests that the user registration service needs to keep state of which users it has published creation events for (with the combination of Kafka and Postgres being the source of truth for this). Since Kafka allows a message to be consumed by arbitrarily many consumers, a different consumer can then update Redis.
Change data capture (e.g. Debezium) can be used to tie into Postgres' write-ahead log (as Postgres actually event sources under the hood...) and publish an event that essentially says "this row was inserted into the user table" to Kafka. A consumer of that event can then translate that into a user created event.
CDC in some sense moves the transactional outbox into the infrastructure, at the cost of requiring that the context it inherently throws away be reconstructed later (which is not always possible).
That said, I'd strongly advise against having ____ creation be a microservice and I'd likewise strongly advise against a RInK store like Redis. Both of these smell like attempts to paper over architectural deficiencies by adding microservices and caches.
The one-foot-on-the-way-to-event-sourcing approach isn't one I'd recommend, but if one starts there, the requirement to make the registration service stateful suddenly opens up possibilities which may remove the need for Redis, limit the need for a Kafka-like thing, and allow you to treat the existence of a DB as an implementation detail.

How to sync data between databases (each database for each instance of a service) in Microservices?

If each instance of service has a separate database in Microservices architecture, how can we keep the data synced? For instance, if instace#1 serves a request and stores data in its database db#1 and another request on instannce#2 wants the data that was inserted to db#1 through instance#1, how can the database db#2 of instance#2 get the data from the database db#1 of instance#2? I think z-scaling is the solution here!
The microservice architecture uses a pattern called 'Eventual consistency'. Like you described, newly inserted data won't be directly available in all databases. You can read more about it here
That being said, the CQRS pattern is a populair way to solve the data distrubution / eventual consistency problem.
By using a messagebroker / bus, you can publish so called 'events' on a queue.
Microservices interested in changes / certain entities, can subscribe to those entities and save them in their own database.
This enables loosely coupled microservices, and the data necessary for certain entities is stored in the same database. Data duplication is ok, since we use eventual cosistency to make sure (eventually) everything is in sync over all microservices.
More information about the CQRS pattern using microservices can be found here
Here's a more practical example of something i'm working on right now. The language is in Dutch, but the flow should be self explanatory:
Hope this helps!
I suggest reading up on the following topics: CQRS, microservices, eventual consistency and messagebrokers (rabbitmq, kafka, etc)

Microservice cross-db referencial integrity

We have a database that manages codes, such as a list of valid currencies, a list of country codes, etc (hereinafter known as CodesDB).
We also have multiple microservices that in a monolithic app + database would have foreign key constraints to rows in tables in the CodesDB.
When a microservice receives a request to modify data, what are my options for ensuring the codes passed in the request are valid?
I am currently leaning towards having the CodesDB microservice post an event onto a service bus announcing when a code is added or modified - and then each other microservice interested in that type of code (country / currency / etc) can then issue an API request to the CodeDB microservice to grab the state it needs and reflect the changes in its own local DB. That way we get referential integrity within each microservice DB.
Is this the correct approach? Are there any other recommended approaches?
Asynchronous event based notification is a pattern commonly used in micro services world for ensuring eventual consistency. Depending on how strict your consistency requirement are you may have to ensure additional checks.
Another possible approach could be to use
Read only data stores using materialized view. This is a form of CQRS pattern where data from multiple services is stored in a de-normalized form in read only data store. The data gets updated asynchronously using the approach mentioned above. The consumers gets fast access to data without having to query multiple services
Caching - You could also possibly use distributed or replicated depending on your performance or consistency requirements.

Distributed database design style for microservice-oriented architecture

I am trying to convert one monolithic application into micro service oriented architecture style. Back end I am using spring , spring boot frameworks for development. Front-end I am using angular 2. And also using PostgreSQL as database.
Here my confusion is that, when I am designing my databases as distributed, according to functionalities it may contain 5 databases. Means I am designing according to vertical partition. Then I am thinking to implement inter-microservice communication services to achieve the entire functionality.
The other way I am thinking that to horizontally partition the current structure. So my domain is based on some educational university. So half of university go under one DB and remaining will go under another DB. And deploy services according to Two region (two for two set of university).
Currently I am decided to continue with the last mentioned approach. I am new to these types of tasks, since it referring some architecture task. Also I am beginner to this microservice and distributed database world. Would someone confirm that my approach will give solution to my issue? Can I continue with my second approach - horizontal partitioning of databases according to domain object?
Can I continue with my second approach - Horizontal partitioning of
databases according to domain object?
Temporarily yes, if based on that you are able to scale your current system to meet your needs.
Now lets think about why on the first place you want to move to Microserices as a development style.
Small Components - easier to manager
Independently Deployable - Continous Delivery
Multiple Languages
The code is organized around business capabilities
and .....
When moving to Microservices, you should not have multiple services reading directly from each other databases, which will make them tightly coupled.
One service should be completely ignorant on how the other service designed its internal structure.
Now if you want to move towards microservices and take complete advantage of that, you should have vertical partition as you say and services talk to each other.
Also while moving towards microservices your will get lots and lots of other problems. I tried compiling on how one should start on microservices on this link .
How to separate services which are reading data from same table:
Now lets first create a dummy example: we have three services Order , Shipping , Customer all are three different microservices.
Following are the ways in which multiple services require data from same table:
Service one needs to read data from other service for things like validation.
Order and shipping service might need some data from customer service to complete their operation.
Eg: While placing a order one will call Order Service API with customer id , now as Order Service might need to validate whether its a valid customer or not.
One approach Database level exposure -- not recommened -- use the same customer table -- which binds order service to customer service Impl
Another approach, Call another service to get data
Variation - 1 Call Customer service to check whether customer exists and get some customer data like name , and save this in order service
Variation - 2 do not validate while placing the order, on OrderPlaced event check in async from Customer Service and validate and update state of order if required
I recommend Call another service to get data based on the consistency you want.
In some use cases you want a single transaction between data from multiple services.
For eg: Delete a customer. you might want that all order of the customer also should get deleted.
In this case you need to deal with eventual consistency, service one will raise an event and then service 2 will react accordingly.
Now if this answers your question than ok, else specify in what kind of scenario multiple service require to call another service.
If still not solved, you could email me on puneetjindal.11#gmail.com, will answer you
Currently I am decided to continue with the last mentioned approach.
If you want horizontal scalability (scaling for increasingly large number of client connections) for your database you may be better of with a technology that was designed to work as a scalable, distributed system. Something like CockroachDB or NoSQL. Cockroachdb for example has built in data sharding and replication and allows you to grow with adding server nodes as required.
when I am designing my databases as distributed, according to functionalities it may contain 5 databases
This sounds like you had the right general idea - split by domain functionality. Here's a link to a previous answer regarding general DB design with micro services.
In the Microservices world, each Microservice owns a set of functionalities and the data manipulated by these functionalities. If a microservice needs data owned by another microservice, it cannot directly go to the database maintained/owned by the other microservice rather it would call an API exposed by the other microservice.
Now, regarding the placement of data, there are various options - you can store data owned by a microservice in a NoSQL database like MongoDB, DynamoDB, Cassandra (it really depends on the microservice's use-case) OR you can have a different table for each micro-service in a single instance of a SQL database. BUT remember, if you choose a single instance of a SQL Database with multiple tables, then there would be no joins (basically no interaction) between tables owned by different microservices.
I would suggest you start small and then think about database scaling issues when the usage of the system grows.

Resources