Seeding microservices databases - microservices

Given service A (CMS) that controls a model (Product, let's assume the only fields that it has are id, title, price) and services B (Shipping) and C (Emails) that have to display given model what should the approach be to synchronize given model information across those services in event sourcing approach? Let's assume that product catalog rarely changes (but does change) and that there are admins that can access data of shipments and emails very often (example functionalities are: B:display titles of products the order contained and C:display content of email about shipping that is going to be sent). Each of the services has their own DB.
Solution 1
Send all required information about Product within event - this means following structure for order_placed:
{
order_id: [guid],
product: {
id: [guid],
title: 'Foo',
price: 1000
}
}
On service B and C product information is stored in product JSON attribute on orders table
As such, to display necessary information only data retrieved from the event is used
Problems: depending upon what other information needs to be presented in B and C, amount of data in event can grow. B and C might not require the same information about Product, but the event will have to contain both (unless we separate the events into two). If given data is not present within given event, code can not use it - if we'll add a color option to given Product, for existing orders in B and C, given product will be colorless unless we update the events and then rerun them.
Solution 2
Send only guid of product within event - this means following structure for order_placed:
{
order_id: [guid],
product_id: [guid]
}
On services B and C product information is stored in product_id attribute on orders table
Product information is retrieved by services B and C when required by performing an API call to A/product/[guid] endpoint
Problems: this makes B and C dependant upon A (at all times). If schema of Product changes on A, changes have to be done on all services that depend on them (suddenly)
Solution 3
Send only guid of product within event - this means following structure for order_placed:
{
order_id: [guid],
product_id: [guid]
}
On services B and C product information is stored in products table; there's still product_id on orders table, but there's replication of products data between A, B and C; B and C might contain different information about Product than A
Product information is seeded when services B and C are created and are updated whenever information about Products changes by making call to A/product endpoint (that displays required information of all products) or by performing a direct DB access to A and copying necessary product information required for given service.
Problems: this makes B and C dependant upon A (when seeding). If schema of Product changes on A, changes have to be done on all services that depend on them (when seeding)
From my understanding, the correct approach would be to go with solution 1, and either update events history per certain logic (if Product catalog hasn't changed and we want to add color to be displayed, we can safely update history to get current state of Products and fill missing data within the events) or cater for nonexistence of given data (if Product catalog has changed and we want to add color to be displayed, we can't be sure if at that point in time in the past given Product had a color or not - we can assume that all Products in previous catalog were black and cater for by updating events or code)

Solution #3 is really close to the right idea.
A way to think about this: B and C are each caching "local" copies of the data that they need. Messages processed at B (and likewise at C) use the locally cached information. Likewise, reports are produced using the locally cached information.
The data is replicated from the source to the caches via a stable API. B and C don't even need to be using the same API - they use whatever fetch protocol is appropriate for their needs. In effect, we define a contract -- protocol and message schema -- which constrain the provider and the consumer. Then any consumer for that contract can be connected to any supplier. Backward incompatible changes require a new contract.
Services choose the appropriate cache invalidation strategy for their needs. This might mean pulling changes from the source on a regular schedule, or in response to a notification that things may have changed, or even "on demand" -- acting as a read through cache, falling back to the stored copy of the data when the source is not available.
This gives you "autonomy", in the sense that B and C can continue to deliver business value when A is temporarily unavailable.
Recommended reading: Data on the Outside, Data on the Inside, Pat Helland 2005.

Generally speaking, I'd strongly recommend against option 2 because of the temporal coupling between those two service (unless communication between these services is super stable, and not very frequent). Temporal coupling is what you describe as this makes B and C dependant upon A (at all times), and means that if A is down or unreachable from B or C, B and C cannot fulfill their function.
I personally believe that both options 1 and 3 have situations where they are valid options.
If the communication between A and B & C is so high, or the amount of data needed to go into the event is large enough to make it a concern, then option 3 is the best option, because the burden on the network is much lower, and latency of operations will decrease as the message size decreases. Other concerns to consider here are:
Stability of contract: if the contract of message leaving A changed often, then putting a lot of properties in the message would result in lots of changes in consumers. However, in this case I believe this to not be a big problem because:
You mentioned that system A is a CMS. This means that you're working on a stable domain and as such I don't believe you'll be seeing frequent changes
Since the B and C are shipping and email, and you're receiving data from A, I believe you'll be experiencing additive changes instead of breaking ones, which are safe to add whenever you discover them with no rework.
Coupling: There is very little to no coupling here. First since the communication is via messages, there is no coupling between the services other than a short temporal one during seeding of the data, and the contract of that operation (which is not a coupling you can or should try to avoid)
Option 1 is not something I'd dismiss though. There is the same amount of coupling, but development-wise it should be easy to do (no need for special actions), and stability of the domain should mean that these won't change often (as I mentioned already).
Another option I'd suggest is a slight variation to 3, which is not to run the process during start-up, but instead observe a "ProductAdded and "ProductDetailsChanged" event on B and C, wheneve there is a change in the product catalogue in A. This would make your deployments faster (and so easier to fix a problem/bug if you find any).
Edit 2020-03-03
I have a specific order of priorities when determining the integration approach:
What is the cost of consistency? Can we accept some milliseconds of inconsistency between things changed in A and them being reflected in B & C?
Do you need point-in-time queries (also called temporal queries)?
Is there any source of truth for the data? A service which owns them and is considered upstream?
If there is an owner / single source of truth is that stable? Or do we expect to see frequent breaking changes?
If the cost of inconsistency is high, (basically the product data in A need to be consistent as soon as possible with product cached in B and C), then youb cannot avoid needing to accept unavaibility, and make a synchronous request (like a web/rest request) from B & C to A to fetch the data. Be aware! This still does not mean transactionally consistent, but just minimizes the windows for inconsistency. If you absolutely, positively have to be immediately consistent, you need to rething your service boundaries. However, I very strongly believe this should not be a problem. From experience, it's actually extremely rare that the company can't accept some seconds of inconsistency, so you shouldn't even need to make synchronous requests.
If you do need point-in-time queries (which I didn't notice in your question and hence didn't include above, maybe wrongly), the cost of maintaining this on downstream services is so high (you'd need to duplicate internal event projection logic in all downstream services) that makes the decision clear: you should leave ownership to A, and query A ad-hoc over web request (or similar), and A should use event sourcing to retrieve all the events you knew about at the time to project to the state, and return it. I guess this may be option 2 (if I understood correctly?), but the costs are such that while temporal coupling is better than maintainance cost of duplciated events and projection logic.
If you don't need a point in time, and there isn't a clear, single owner of the data (which in my initial answer I did assume this based on your question), then a very reasonable pattern would be to hold representations of the product in each service separately. When you update the data for products, you update A, B and C in parallel by making parallel web requests to each one, or you have a command API which send multiple commands to each of A, B and C. B & C use their local version of the data to do their job, which may or may not be stale. This isn't any of the options above (although it could be made to be close to option 3), as data in A, B and C may differ, and the "whole" of the product may be a composition of all three data sources.
Knowing if the source of truth is has a stable contract is useful because you can use it to use the domain/internal events (or events you store in your event sourcing as storage pattern in A) for integration across A and services B and C. If the contract is stable you can integrate through the domain events. However, then you have an additional concern in the case where changes are frequent, or that contract of message is large enough that make transport a concern.
If you have a clear owner, with a contrac that is expected to be stable, the best options would be option 1; an order would contain all necessary information and then B and C would do their function using the data in the event.
If the contract is liable to change, or break often, following your option 3, that is falling back to web requests to fetch product data is actually a better option, since it's a much easier to maintain multiple versions. So B would make a request on v3 of product.

There are two hard things in Computer Science, and one of them is cache invalidation.
Solution 2 is absolutely my default position, and you should generally only consider implementing caching if you run into one of the following scenarios:
The API call to Service A is causing performance problems.
The cost of Service A being down and being unable to retrieve the data is significant to the business.
Performance problems are really the main driver. There are many ways of solving #2 that don't involve caching, like ensuring Service A is highly available.
Caching adds significant complexity to a system, and can create edge cases that are hard to reason about, and bugs that are very hard to replicate. You also have to mitigate the risk of providing stale data when newer data exists, which can be much worse from a business perspective than (for example) displaying a message that "Service A is down--please try again later."
From this excellent article by Udi Dahan:
These dependencies creep up on you slowly, tying your shoelaces
together, gradually slowing down the pace of development, undermining
the stability of your codebase where changes to one part of the system
break other parts. It’s a slow death by a thousand cuts, and as a
result nobody is exactly sure what big decision we made that caused
everything to go so bad.
Also, If you need point-in-time querying of product data, this should be handled in the way the data is stored in the Product database (e.g. start/end dates), should be clearly exposed in the API (effective date needs to be an input for the API call to query the data).

It is very hard to simply say one solution is better than the other. Choosing one among Solution #2 and #3 depends on other factors (cache duration, consistency tolerance, ...)
My 2 cents:
Cache invalidation might be hard but the problem statement mentions that product catalog change rarely. This fact make product data a good candidate for caching
Solution #1 (NOK)
Data is duplicated across multiple systems
Solution #2 (OK)
Offers strong consistency
Works only when product service is highly available and offers good performance
If email service prepares a summary (with lot of products), then the overall response time could be longer
Solution #3 (Complex but preferred)
Prefer API approach instead of direct DB access to retrieve product information
Resilient - consuming services are not impacted when product service is down
Consuming applications (shipping and email services) retrieve product details immediately after an event is published. The possibility of product service going down within these few milliseconds is very remote.

Related

How to deal with "Foreign Key" in microservice architecture?

Quick question on Foreign key in Microservices. I already tried looking for answer. But, they did not give me the exact answer I was looking for.
Usecase : Every blog post will have many comments. Traditional monolith will have comments table with foreign key to blog post. However in microservice, we will have two services.
Service 1 : Post Microservie with these table fields (PostID, Name, Content)
Service 2 : Comments Microservie with these table fields (CommentID, PostID, Cpmment)
The question is, Do we need "PostID" in service 2 (Comments Microservice) ? I guess the answer is yes, as we need to know which comment belongs to which post. But then, it will create tight coupling? I mean if I delete service 1(Blog post service), it will impact service 2(Comments service) ?
I'm going to use another example I'm more familiar with to explain how I believe most people would do this.
Consider an Order Management System (OMS) and an Inventory Management System (IMS).
When a customer places an order in the company web site, we ask the OMS to create an order entry in the backend (e.g. via an HTTP endpoint).
The OMS system then broadcasts an event e.g. OrderPlaced containing all the details of the customer order. We may have a pub/sub (e.g. Redis), or a queue (e.g. RabbitMQ), or an event stream (e.g. Kafka) where we place the event (although this can be done in many other ways).
The thing is that we have one or more subscribers interested in this event. One of those could be the IMS, which has the responsibility of assigning the best inventory available every time an order is placed.
We can expect that the IMS will keep a copy of the relevant order information it received when it processed the OrderPlaced event such that it does not ask every little detail of the order to the OMS all the time. So, if the IMS needed a join with the order, instead of calling an endpoint in the Order API, it would probably just do a join with its local copy of the orders table.
Say now that our customer called to cancel her order. A customer service representative then cancelled it in the OMS Web User Interface. At that point an event OrderCanceled is broadcast. Guess who is listening for that event? Correct, the IMS receives notification and acts accordingly reversing the inventory assignation and probably even deleting the order record because it is no longer necessary on this domain.
So, as you can see, the best way to do this is by using events and making copies of the relevant details on the other domain.
Since events need time to get broadcast and processed by interested parties, we say that the order data in the IMS is eventually consistent.
Followup Questions
Q: So, if I understood right in microservises we prefer to duplicate data and get better performance? That is the concept? I mean I know the concept is scaling and flexibility but when we must share data we will just duplicate it?
Not really. That´s definitively not what I meant although it may have sounded like that due to my poor choice of words in the original explanation. It appears to me that at the heart of your question lies a lack of sufficient understanding of the concept of a bounded context.
In my explanation I meant to indicate that the OMS has a domain concept known as the order, but so does the IMS. Therefore, they both have an entity within their domain that represents it. There is a good chance that the order entity in the OMS is much richer than the corresponding representation of the same concept in the IMS.
For example, if the system I was describing was not for retail, but for wholesale, then the same concept of a "sales order" in our system corresponds to the concept of a "purchase order" in that of our customers. So you see, the same data, mapped under a different name, simply because under a different bounded context the data may have a different perspective and meaning.
So, this is the realization that a given concept from our model may be represented in multiple bounded contexts, perhaps from a different perspective and names from our ubiquitous language.
Just to give another example, the OMS needs to know about the customer, but the representation of the idea of a customer in the OMS is probably different than the same representation of such a concept or entity in the CRM. In the OMS the customer's name, email, shipping and billing addresses are probably enough representation of this idea, but for the CRM the customer encompasses much more.
Another example: the IMS needs to know the shipping address of the customer to choose the best inventory (e.g. the one in a facility closest to its final destination), but probably does not care much about the billing address. On the other hand, the billing address is fundamental for the Payment Management System (PMS). So, both the IMS and PMS may have a concept of an "order", it is just that it is not exactly the same, neither it has the same meaning or perspective, even if we store the same data.
One final example: the accounting system cares about the inventory for accounting purposes, to be able to tell how much we own, but perhaps accounting does not care about the specific location of the inventory within the warehouse, that's a detail only the IMS cares about.
In conclusion, I would not say this is about "copying data", this is about appropriately representing a fundamental concept within your bounded context and the realization that some concepts from the model may overlap between systems and have different representations, sometimes even under different names and levels of details. That's why I suggested that you investigate the idea of context mapping some more.
In other words, from my perspective, it would be a mistake to assume that the concept of an "order" only exists in the OMS. I could probably say that the OMS is the master of record of orders and that if something happens to an order we should let other interested systems know about those events since they care about some of that data because those other systems could have mapping concepts related to orders and when reacting to the changes in the master of record, they probably want to change their data as well.
From this point of view, copying some data is a side effect of having a proper design for the bounded context and not a goal in itself.
I hope that answers your question.

CQRS DDD: How to validate products existence before adding them to order?

CQRS states: command should not query read side.
Ok. Let's take following example:
The user needs to create orders with order lines, each order line contains product_id, price, quantity.
It sends requests to the server with order information and the list of order lines.
The server (command handler) should not trust the client and needs to validate if provided products (product_ids) exist (otherwise, there will be a lot of garbage).
Since command handler is not allowed to query read side, it should somehow validate this information on the write side.
What we have on the write side: Repositories. In terms of DDD, repositories operate only with Aggregate Roots, the repository can only GET BY ID, and SAVE.
In this case, the only option is to load all product aggregates, one by one (repository has only GET BY ID method).
Note: Event sourcing is used as a persistence, so it would be problematic and not efficient to load multiple aggregates at once to avoid multiple requests to the repository).
What is the best solution for this case?
P.S.: One solution is to redesign UI (more like task based UI), e.g.: User first creates order (with general info), then adds products one by one (each addition separate http request), but still I need to support bulk operations (api for third party applications as an example).
The short answer: pass a domain service (see Evans, chapter 5) to the aggregate along with the other command arguments.
CQRS states: command should not query read side.
That's not an absolute -- there are trade offs involved when you include a query in your command handler; that doesn't mean that you cannot do it.
In domain-driven-design, we have the concept of a domain service, which is a stateless mechanism by which the aggregate can learn information from data outside of its own consistency boundary.
So you can define a service that validates whether or not a product exists, and pass that service to the aggregate as an argument when you add the item. The work of computing whether the product exists would be abstracted behind the service interface.
But what you need to keep in mind is this: products, presumably, are defined outside of the order aggregate. That means that they can be changing concurrently with your check to verify the product_id. From the point of view of correctness, there's no real difference between checking the validity of the product_id in the aggregate, or in the application's command handler, or in the client code. In all three places, the product state that you are validating against can be stale.
Udi Dahan shared an interest observation years ago
A microsecond difference in timing shouldn’t make a difference to core business behaviors.
If the client has validated the data one hundred milliseconds ago when composing the command, and the data was valid them, what should the behavior of the aggregate be?
Think about a command to add a product that is composed concurrently with an order of that same product - should the correctness of the system, from a business perspective, depend on the order that those two commands happen to arrive?
Another thing to keep in mind is that, by introducing this check into your aggregate, you are coupling the ability to change the aggregate to the availability of the domain service. What is supposed to happen if the domain service can't reach the data it needs (because the read model is down, or whatever). Does it block? throw an exception? make a guess? Does this choice ripple back into the design of the aggregate, and so on.

Validate Command in CQRS that related to other domain

I am learning to develop microservices using DDD, CQRS, and ES. It is HTTP RESTful service. The microservices is about online shop. There are several domains like products, orders, suppliers, customers, and so on. The domains built in separate services. How to do the validation if the command payload relates to other domains?
For example, here is the addOrderItemCommand payload in the order service (command-side).
{
"customerId": "CUST111",
"productId": "SKU222",
"orderId":"SO333"
}
How to validate the command above? How to know that the customer is really exists in database (query-side customer service) and still active? How to know that the product is exists in database and the status of the product is published? How to know whether the customer eligible to get the promo price from the related product?
Is it ok to call API directly (like point-to-point / ajax / request promise) to validate this payload in order command-side service? But I think, the performance will get worse if the API called directly just for validation. Because, we have developed an event processor outside the command-service that listen from the event and apply the event to the materalized view.
Thank you.
As there are more than one bounded contexts that need to be queried for the validation to pass you need to consider eventual consistency. That being said, there is always a chance that the process as a whole can be in an invalid state for a "small" amount of time. For example, the user could be deactivated after the command is accepted and before the order is shipped. An online shop is a complex system and exceptions could appear in any of its subsystems. However, being implemented as an event-driven system helps; every time the ordering process enters an invalid state you can take compensatory actions/commands. For example, if the user is deactivated in the meantime you can cancel all its standing orders, release the reserved products, announce the potential customers that have those products in the wishlist that they are not available and so on.
There are many kinds of validation in DDD but I follow the general rule that the validation should be done as early as possible but without compromising data consistency. So, in order to be early you could query the readmodel to reject the commands that couldn't possible be valid and in order for the system to be consistent you need to make another check just before the order is shipped.
Now let's talk about your specific questions:
How to know that the customer is really exists in database (query-side customer service) and still active?
You can query the readmodel to verify that the user exists and it is still active. You should do this as a command that comes from an invalid user is a strong indication of some kind of attack and you don't want those kind of commands passing through your system. However, even if a command passes this check, it does not necessarily mean that the order will be shipped as other exceptions could be raised in between.
How to know that the product is exists in database and the status of the product is published?
Again, you can query the readmodel in order to notify the user that the product is not available at the moment. Or, depending on your business, you could allow the command to pass if you know that those products will be available in less than 24 hours based on some previous statistics (for example you know that TV sets arrive daily in your stock). Or you could let the customer choose whether it waits or not. In this case, if the products are not in stock at the final phase of the ordering (the shipping) you notify the customer that the products are not in stock anymore.
How to know whether the customer eligible to get the promo price from the related product?
You will probably have to query another bounded context like Promotions BC to check this. This depends on how promotions are validated/used.
Is it ok to call API directly (like point-to-point / ajax / request promise) to validate this payload in order command-side service? But I think, the performance will get worse if the API called directly just for validation.
This depends on how resilient you want your system to be and how fast you want to reject invalid commands.
Synchronous call are simpler to implement but they lead to a less resilient system (you should be aware of cascade failures and use technics like circuit breaker to stop them).
Asynchronous (i.e. using events) calls are harder to implement but make you system more resilient. In order to have async calls, the ordering system can subscribe to other systems for events and maintain a private state that can be queried for validation purposes as the commands arrive. In this way, the ordering system continues to work even of the link to inventory or customer management systems are down.
In any case, it really depends on your business and none of us can tell you exaclty what to do.
As always everything depends on the specifics of the domain but as a general principle cross domain validation should be done via the read model.
In this case, I would maintain a read model within each microservice for use in validation. Of course, that brings with it the question of eventual consistency.
How you handle that should come from your understanding of the domain. Factors such as the length of the eventual consistency compared to the frequency of updates should be considered. The cost of getting it wrong for the business compared to the cost of development to minimise the problem. In many cases, just recording the fact there has been a problem is more than adequate for the business.
I have a blog post dedicated to validation which you can find here: How To Validate Commands in a CQRS Application

Eventual Consistency in microservice-based architecture temporarily limits functionality

I'll illustrate my question with Twitter. For example, Twitter has microservice-based architecture which means that different processes are in different servers and have different databases.
A new tweet appears, server A stored in its own database some data, generated new events and fired them. Server B and C didn't get these events at this point and didn't store anything in their databases nor processed anything.
The user that created the tweet wants to edit that tweet. To achieve that, all three services A, B, C should have processed all events and stored to db all required data, but service B and C aren't consistent yet. That means that we are not able to provide edit functionality at the moment.
As I can see, a possible workaround could be in switching to immediate consistency, but that will take away all microservice-based architecture benefits and probably could cause problems with tight coupling.
Another workaround is to restrict user's actions for some time till data aren't consistent across all necessary services. Probably a solution, depends on customer and his business requirements.
And another workaround is to add additional logic or probably service D that will store edits as user's actions and apply them to data only when they will be consistent. Drawback is very increased complexity of the system.
And there are two-phase commits, but that's 1) not really reliable 2) slow.
I think slowness is a huge drawback in case of such loads as Twitter has. But probably it could be solved, whereas lack of reliability cannot, again, without increased complexity of a solution.
So, the questions are:
Are there any nice solutions to the illustrated situation or only things that I mentioned as workarounds? Maybe some programming platforms or databases?
Do I misunderstood something and some of workarounds aren't correct?
Is there any other approach except Eventual Consistency that will guarantee that all data will be stored and all necessary actions will be executed by other services?
Why Eventual Consistency has been picked for this use case? As I can see, right now it is the only way to guarantee that some data will be stored or some action will be performed if we are talking about event-driven approach when some of services will start their work when some event is fired, and following my example, that event would be “tweet is created”. So, in case if services B and C go down, I need to be able to perform action successfully when they will be up again.
Things I would like to achieve are: reliability, ability to bear high loads, adequate complexity of solution. Any links on any related subjects will be very much appreciated.
If there are natural limitations of this approach and what I want cannot be achieved using this paradigm, it is okay too. I just need to know that this problem really isn't solved yet.
It is all about tradeoffs. With eventual consistency in your example it may mean that the user cannot edit for maybe a few seconds since most of the eventual consistent technologies would not take too long to replicate the data across nodes. So in this use case it is absolutely acceptable since users are pretty slow in their actions.
For example :
MongoDB is consistent by default: reads and writes are issued to the
primary member of a replica set. Applications can optionally read from
secondary replicas, where data is eventually consistent by default.
from official MongoDB FAQ
Another alternative that is getting more popular is to use a streaming platform such as Apache Kafka where it is up to your architecture design how fast the stream consumer will process the data (for eventual consistency). Since the stream platform is very fast it is mostly only up to the speed of your stream processor to make the data available at the right place. So we are talking about milliseconds and not even seconds in most cases.
The key thing in these sorts of architectures is to have each service be autonomous when it comes to writes: it can take the write even if none of the other application-level services are up.
So in the example of a twitter like service, you would model it as
Service A manages the content of a post
So when a user makes a post, a write happens in Service A's DB and from that instant the post can be edited because editing is just a request to A.
If there's some other service that consumes the "post content" change events from A and after a "new post" event exposes some functionality, that functionality isn't going to be exposed until that service sees the event (yay tautologies). But that's just physics: the sun could have gone supernova five minutes ago and we can't take any action (not that we could have) until we "see the light".

Micro Services and noSQL - Best practice to enrich data in micro service architecture

I want to plan a solution that manages enriched data in my architecture.
To be more clear, I have dozens of micro services.
let's say - Country, Building, Floor, Worker.
All running over a separate NoSql data store.
When I get the data from the worker service I want to present also the floor name (the worker is working on), the building name and country name.
Solution1.
Client will query all microservices.
Problem - multiple requests and making the client be aware of the structure.
I know multiple requests shouldn't bother me but I believe that returning a json describing the entity in one single call is better.
Solution 2.
Create an orchestration that retrieves the data from multiple services.
Problem - if the data (entity names, for example) is not stored in the same document in the DB it is very hard to sort and filter by these fields.
Solution 3.
Before saving the entity, e.g. worker, call all the other services and fill the relative data (Building Name, Country name).
Problem - when the building name is changed, it doesn't reflect in the worker service.
solution 4.
(This is the best one I can come up with).
Create a process that subscribes to a broker and receives all entities change.
For each entity it updates all the relavent entities.
When an entity changes, let's say building name changes, it updates all the documents that hold the building name.
Problem:
Each service has to know what can be updated.
When a trailing update happens it shouldnt update the broker again (recursive update), so this can complicate to the microservices.
solution 5.
Keeping everything normalized. Fileter and sort in ElasticSearch.
Problem: keeping normalized data in ES is too expensive performance-wise
One thing I saw Netflix do (which i like) is create intermediary services for stuff like this. So maybe a new intermediary service that can call the other services to gather all the data then create the unified output with the Country, Building, Floor, Worker.
You can even go one step further and try to come up with a scheme for providing as input which resources you want to include in the output.
So I guess this closely matches your solution 2. I notice that you mention for solution 2 that there are concerns with sorting/filtering in the DB's. I think that if you are using NoSQL then it has to be for a reason, and more often then not the reason is for performance. I think if this was done wrong then yeah you will have problems but if all the appropriate fields that are searchable are properly keyed and indexed (as #Roman Susi mentioned in his bullet points 1 and 2) then I don't see this as being a problem. Yeah this service will only be as fast as the culmination of your other services and data stores, so they have to be fast.
Now you keep your individual microservices as they are, keep the client calling one service, and encapsulate the complexity of merging the data into this new service.
This is the video that I saw this in (https://www.youtube.com/watch?v=StCrm572aEs)... its a long video but very informative.
It is hard to advice on the Solution N level, but certain problems can be avoided by the following advices:
Use globally unique identifiers for entities. For example, by assigning key values some kind of URI.
The global ids also simplify updates, because you track what has actually changed, the name or the entity. (entity has one-to-one relation with global URI)
CAP theorem says you can choose only two from CAP. Do you want a CA architecture? Or CP? Or maybe AP? This will strongly affect the way you distribute data.
For "sort and filter" there is MapReduce approach, which can distribute the load of figuring out those things.
Think carefully about the balance of normalization / denormalization. If your services operate on URIs, you can have a service which turns URIs to labels (names, descriptions, etc), but you do not need to keep the redundant information everywhere and update it. Do not do preliminary optimization, but try to keep data normalized as long as possible. This way, worker may not even need the building name but it's global id. And the microservice looks up the metadata from another microservice.
In other words, minimize the number of keys, shared between services, as part of separation of concerns.
Focus on the underlying model, not the JSON to and from. Right modelling of the data in your system(s) gains you more than saving JSON calls.
As for NoSQL, take a look at Riak database: it has adjustable CAP properties, IIRC. Even if you do not use it as such, reading it's documentation may help to come up with suitable architecture for your distributed microservices system. (Of course, this applies if you have essentially parallel system)
First of all, thanks for your question. It is similar to Main Problem Of Document DBs: how to sort collection by field from another collection? I have my own answer for that so i'll try to comment all your solutions:
Solution 1: It is good if client wants to work with Countries/Building/Floors independently. But, it does not solve problem you mentioned in Solution 2 - sorting 10k workers by building gonna be slow
Solution 2: Similar to Solution 1 if all client wants is a list enriched workers without knowing how to combine it from multiple pieces
Solution 3: As you said, unacceptable because of inconsistent data.
Solution 4: Gonna be working, most of the time. But:
Huge data duplication. If you have 20 entities, you are going to have x20 data.
Large complexity. 20 entities -> 20 different procedures to update related data
High cohesion. All your services must know each other. Data model change will propagate to every service because of update procedures
Questionable eventual consistency. It can be done so data will be consistent after failures but it is not going to be easy
Solution 5: Kind of answer :-)
But - you do not want everything. Keep separated services that serve separated entities and build other services on top of them.
If client wants enriched data - build service that returns enriched data, as in Solution 2.
If client wants to display list of enriched data with filtering and sorting - build a service that provides enriched data with filtering and sorting capability! Likely, implementation of such service will contain ES instance that contains cached and indexed data from lower-level services. Point here is that ES does not have to contain everything or be shared between every service - it is up to you to decide better balance between performance and infrastructure resources.
This is a case where Linked Data can help you.
Basically the Floor attribute for the worker would be an URI (a link) to the floor itself. And Any other linked data should be expressed as URIs as well.
Modeled with some JSON-LD it would look like this:
worker = {
'#id': '/workers/87373',
name: 'John',
floor: {
'#id': '/floors/123'
}
}
floor = {
'#id': '/floor/123',
'level': 12,
building: { '#id': '/buildings/87' }
}
building = {
'#id': '/buildings/87',
name: 'John's home',
city: { '#id': '/cities/908' }
}
This way all the client has to do is append the BASE URL (like api.example.com) to the #id and make a simple GET call.
To remove the extra calls burden from the client (in case it's a slow mobile device), we use the gateway pattern with micro-services. The gateway can expand those links with very little effort and augment the return object. It can also do multiple calls in parallel.
So the gateway will make a GET /floor/123 call and replace the floor object on the worker with the reply.

Resources