Should database primary keys be used to identify entities across microservices? - microservices

Given I have two microservices: Service A and Service B.
Service A owns full customer data and Service B requires a small subset of this data (which it gets from Service A through some bulk load say).
Both services store customers in their own database.
If service B then needs to interact with service A say to get additional data (e.g. GET /customers/{id}) it clearly needs a unique identifier that is shared between the two services.
Because the ids are GUIDs I could simply use the PK from Service A when creating the record in Service B. So both PKs match.
However this sounds extremely fragile. One option is to store the 'external Id' (or 'source id') as a separate field in Service B, and use that to interact with Service A. And probably this is a string as one day it may not be a GUID.
Is there a 'best practice' around this?
update
So I've done some more research and found a few related discussions:
Should you expose a primary key in REST API URLs?
Is it a bad practice to expose the database ID to the client in your REST API?
Slugs as Primary Keys
conclusion
I think my idea of trying to keep both Primary Keys for Customer the same across Service A and B was just wrong. This is because:
Clearly PKs are service implementation specific, so they may be totally incompatible e.g. UUID vs auto-incremented INT.
Even if you can guarantee compatibility, although the two entities both happen to be called 'Customer' they are effectively two (potentially very different) concepts and both Service A and Service B both 'own' their own 'Customer' record. What you may want to do though is synchronise some Customer data across those services.
So I now think that either service can expose customer data via its own unique id (in my case the PK GUID) and if one service needs to obtain additional customer data from another service it must store the other service identifier/key and uses that. So essentially back to my 'external id' or 'source id' idea but perhaps more specific as 'service B id'.

I think it depends a bit on the data source and your design. But, one thing I would avoid sharing is a Primary key which is a GUID or auto-increment integer to an external service. Those are internal details of your service and not what other services should take a dependency on.
I would rather have an external id which is more understood by other services and perhaps business as a whole. It could be a unique customer number, order number or a policy number as opposed to an id. You can also consider them as a "business id". One thing to also keep in mind is that an external id can also be exposed to an end-user. Hence, it is a ubiquitous way of identifying that "entity" across the entire organization and services irrespective of whether you have an Event-Driven-Design or if your services talk through APIs. I would only expose the DB ids to the infrastructure or repository. Beyond that, it is only a business/ external id.

Well, if you have an idea on Value Object, a business ID will much better for designing.
DDD focus on business, an pure UUID/Auto increment ID can't present it.
Use a business meaning ID(UL ID), like a Customer ID, instead of a simple ID.

Related

How to handle data migrations in distributed microservice databases

so im learning about microservices and common patterns and i cant seem to find how to address this one issue.
Lets say that my customer needs a module managing customers, and a module managing purchase orders.
I believe that when dealing with microservices its pretty natural to split these two functionalities into separate services - each having its own data.
CustomerService
PurchaseOrderService
Also, he wants to have a table of purchase orders displaying the data of both customers and purchase orders, ie .: Customer name, Order number.
Now, i dont want to use the API Composition pattern because the user must be able to sort over any column he wants which (afaik) is impossible to do without slaughtering the performance using that pattern.
Instead, i choose CQRS pattern
after every purchase order / customer update a message is sent to the message broker
message broker notifies the third service about that message
the third service updates its projection in its own database
So, our third service .:
PurchaseOrderTableService
It stores all the required data in the single database - now we can query it, sort over any column we like while still maintaining a good performance.
And now, the tricky part .:
In the future, client can change his mind and say "Hey, i need the purchase orders table to display additional column - 'Customer country'"
How does one handle that data migration? So far, The PurchaseOrderTableService knows only about two columns - 'Customer name' and 'Order number'.
I imagine that this probably a pretty common problem, so what can i do to avoid reinventing the wheel?
I can of course make CustomerService generate 'CustomerUpdatedMessage' for every existing customer which would force PurchaseOrderTableService to update all its projections, but that seems like a workaround.
If that matters, the stack i thought of is java, spring, kafka, postgresql.
Divide the problem in 2:
Keeping live data in sync: your projection service from now on also needs to persist Customer Country, so all new orders will have the country as expected.
Backfill the older orders: this is a one off operation, so how you implement it really depends on your organization, technologies, etc. For example, you or a DBA can use whatever database tools you have to extract the data from the source database and do a bulk update to the target database. In other cases, you might have to solve it programmatically, for example creating a process in the projection microservice that will query the Customer's microservice API to get the data and update the local copy.
Also note that in most cases, you will already have a process to backfill data, because the need for the projection microservice might arrive months or years after the orders and customers services were created. Other times, the search service is a 3rd party search engine, like Elastic Search instead of a database. In those cases, I would always keep in hand a process to fully reindex the data.

Can I keep a copy of a table of one database in another database in a microservice architecture?

I am currently new to microservice architecture so thanks in advance.
I have two different services a User Service and a Footballer Service each having their individual databases.(User database and Footballer database).
The Footballer service has a database with a single table storing footballer informations.
The User service has a database which stores User details along with other user related data.
Now a User can add footballers to their team by querying the Footballer service and I need to store them somewhere in order to be displayed later.
Currently I'm storing the footballers for each user in a table in the User database whereby I make a call to the Footballer service to give me the details of a specific footballer by ID and save them in the USer database by mapping against the USer ID.
So is this a good idea to do that and by any chance does it mean im replicating data between two services
and if it is than what other ways can i achieve the same functionality ?
Currently I'm storing the footballers for each user in a table in the User database whereby I make a call to the Footballer service to give me the details of a specific footballer by ID and save them in the USer database by mapping against the USer ID.
"Caching" is a fairly common pattern. From the perspective of the User microservice, the data from Footballer is just another input which you might save or not. If you are caching, you'll usually want to have some sort of timestamp/version on the cached data.
Caching identifiers is pretty normal - we often need some kind of correlation identifier to connect data in two different places.
If you find yourself using Footballer data in your User domain logic (that is to say, the way that User changes depends on the Footballer data available)... that's more suspicious, and may indicate that your boundaries are incorrectly drawn / some of your capabilities are in the wrong place.
If you are expecting the User Service to be autonomous - that is to say, to be able to continue serving its purpose even when Footballer is out of service, then your code needs to be able to work from cached copies of the data from Footballer and/or be able to suspend some parts of its work until fresh copies of that data are available.
People usually follow DDD (Domain driven design) in case of micro-services :
So here in your case there are two domains i.e. 2 services :
Users
Footballers
So, user service should only do user specific tasks, it should not be concerned about footballer's data.
Hence, according to DDD, the footballers that are linked to the user should be stored in football service.
Replicating the ID wouldn't be considered replication in case of microservices architecture.

How can I divide one database to multi databases?

I want to decompose my application to adopt microservices architecture, and i will need to come up with a solid strategy to split my database (Mysql) into multiple small databases (mysql) aligned with my applications.
TL;DR: Depends on the scenario and from what each service will do
Although there is no clear answer to this, since it really depends on your needs and on what each service should do, you can come up with a general starting point (assuming you don't need to keep the existing database type).
Let's assume you have a monolithic application for an e-commerce, and you want to split this application into smaller services, each one with it's own database.
The approach you could use is to create some services that handles some parts of the website: for example you could have one service that handles users authentication,one for the orders, one for the products, one for the invoices and so on...
Now, each service will have it's own database, and here's come another question: which database a specific service should have? Because one of the advantages of this kind of architecture is that each service can have it's own kind of database, so for example the products service can have a non relational database, such as MongoDB, since all it does is getting details about products, so you don't have to manage any relation.
The orders service, on the other hand, could have a relational database, since you want to keep a relation between the order and the invoice for that order. But wait, invoices are handled by the invoice service, so how can you keep the relation between these two without sharing the database? Well, that's one of the "issues" of this approach: you have to keep services independent while also let them communicate each other. How can we do this? There is no clear answer here too... One approach could be to just pass all invoices details to the orders service as well, or you can just pass the invoice ID when saving the order and later retrieve the invoice via an API call to the invoice service, or you can pass all the relevant details you need for the invoice to an API endpoint in the order service that stores these data to a specific table in the database (since most of the times you don't need the entire actual object), etc... The possibilities are endless...

microservice messaging db-assigned identifiers

The company I work for is investigating moving from our current monolithic API to microservices. Our current API is heavily dependent on spring and we use SQL server for most persistence. Our microservice investigation is leaning toward spring-cloud, spring-cloud-stream, kafka, and polyglot persistence (isolated database per microservice).
I have a question about how messaging via kafka is typically done in a microservice architecture. We're planning to have a coordination layer between the set of microservices and our client applications, which will coordinate activities across different microservices and isolate clients from changes to microservice APIs. Most of the stuff we've read about using spring-cloud-stream and kafka indicate that we should use streams at the coordination layer (source) for resource change operations (inserts, updates, deletes), with the microservice being one consumer of the messages.
Where I've been having trouble with this is inserts. We make heavy use of database-assigned identifiers (identity columns/auto-increment columns/sequences/surrogate keys), and they're usually assigned as part of a post request and returned to the caller. The coordination layer may be saving multiple things using different microservices and often needs the assigned identifier from one insert before it can move on to the next operation. Using messaging between the coordination layer and microservices for inserts makes it so the coordination layer can't get a response from the insert operation, so it can't get the assigned identifier that it needs. Additionally, other consumers on the stream (i.e. consumers that publish the data to a data warehouse) really need the message to contain the assigned identifier.
How are people dealing with this problem? Are database-assigned identifiers an anti-pattern in microservices? Should we expose separate microservice endpoints that return database-assigned identifiers so that the coordination layer can make a synchronous call to get an identifier before calling the asynchronous insert? We could use UUIDs but our DBAs hate those as primary keys, and they couldn't be used as an order number or other user-facing generated ids.
If you can programmatically create the identifier earlier while receiving from the message source, you can embed the identifier as part of the message header and subsequently use the message header information during database inserts and in any other consumers.
But this approach requires a separate verification by the other consumers against the database to process only the committed transactions (if you are concerned about processing only the inserts).
At our company, we built a dedicated service responsible for unique ids generation. And every other services grap the ids they need from there.
These generated ids couldn't be used as an order number but I think it's shouldn't be used for this job anyway. If you need to sort by created date, it's better to have a created_date field.
One more thing that is used to bug my mind with this approach is that the primary resource might be persisted after the other resource that rerefence it by the id. For example, a insert user, and insert user address request payload are sent asynchronously. The insert user payload contains a generated unique id, and user address payload contains that id as foreign reference back to user. The insert user address might be proccessed before the insert user request, but it's totally fine. I think it's called eventual consistency.

optimization of db queries when implementing bounded contexts

In our project we're trying to apply the Bounded Context ideology and we've faced kind of obvious problem of performance. E.g., we have different classes (in different contexts) for representing a user in the system: Person in our core domain's context and User in security context. So, we have two different repositories for each of the aggregate, but they are using the same table in DB and sometimes accessing the same data.
Is there common solution to minimize db roundtrips in this case? Are there ORM's which deals with it, or should we code some caching system by ourselves?
upd: the db is from legacy app, and we'll have to use it "as is"
So, we have two different repositories for each of the aggregate, but
they are using the same table in DB and sometimes accessing the same
data.
The fact that you have two aggregates stored in the same table is an indication of a problem with the design. In this case, it seems you have two bounded contexts - a BC for the core domain (Person is here) and an identity/access BC (User is here). The BCs are related and the latter can be seen as upstream from the former. A Person in the core domain has a corresponding User in the identity BC, but they are not exactly the same thing.
Beyond this relationship between the BCs there are questions regarding ownership of behavior. For example, both a Person and a User may have a name and what is to be determined is who own's the behavior of changing a name. This can be implemented in several ways. Person may have its own name and changes should be propagated to the identity BC. Similarly, User may own changes to name, in which case they must be propagated to Person via a synchronization mechanism.
Overall, your problem could be addressed in two ways. First, you can store Person and User aggregates in different tables. Any given query should only use one of these tables and they can be synchronized in an eventually consistent matter. Another approach is to decouple the behavioral domain model from a model designed for queries (read-model). This way, you can create a read-model designed to serve a specific screen(s) and have a customized query, perhaps even outside of an ORM.
If all the Users are Person too (sometimes external services are modeled as special users too), the only data that User and Person should share on the database are their identifiers.
Indeed each entity in a domain model should hold references only to the data that they need to ensure their invariants.
Moreover I guess that Users are identified by Username and Persons are identified by something else (VAT code or so..).
Thus, the simplest optimization technique is to avoid to encapsulate in an entity those informations that are not required to ensure its invariants.
Furthermore you simply need an effective context mapping technique to easily pass from User to Person when needed. I use shared identifiers for this.
As an example you can expose the Person's identifier in the User class, so that a simple query to the Person's repository can provide you the data you need.
Finally I suggest you the Vaughn Vernon series on Aggregate Root Design.

Resources