Microservices and isolated persistence - how should the data be stored/fetched? - microservices

At my company, we're about to move to the micro services architecture. I read a lot about it, and there are tons of obscure areas where it's specific to the project built, but one area seems to get everyone to agree, microservices need to have isolated persistence or another way to say it, they need to have they own database.
Now I love the idea, that means every microservice has its own database schema, its own domain objects and is 100% independent of any other microservice data structure.
There are things I don't quite understand though.
The "Customer Service" is obviously central to the application, and we can see that basically any other microservice will need some data about the user at some point. Whether it'd be the user's credit amount, its ID, or its name.
But since other microservices can't directly read into the Customer Service database, they'll need to query this service over and over again. This is fine (I guess) for simple stuff like getting the name of current logged user, but when we need to display 60 users on a page and we can't do any SQL join, it feels like we're missing something. This is even worse when microservices depend upon tons of microservices.
So I found out that some people actually queried microservices X times a day to get data into their own microservices.
So if microservice "Search" needs data from "Product", "Customer", it'll actually query these microservices and will persist the data with its own data structure.
The question I have is should it be "Search" that queries "Product" and "Customer", or should "Product" and "Customer" send data to "Search" ?
The first option looks a bit easier to do, we only need to have this logic on one side, and that's where the data is needed. But we'll only get static freshness of data which is not very smart, but could definitely work.
The second option looks a bit more difficult but more scalable too, because we could have very fresh data when we need it, since the data changed where it's sent, it could also be more granular.

I think you correctly identified downsides to the microservices approach! And there are no elegant solutions to these specific problems. You will have to eat the additional work and architecture deterioration that this brings.
Concretely addressing your question now:
The question I have is should it be "Search" that queries "Product" and "Customer", or should "Product" and "Customer" send data to "Search" ?
You seem to be looking for a data synchronization service. You want to decide between push and pull. You are concerned about data freshness and logic duplication.
The key point here is that the source service cannot know about its consumers. This is to prevent an unwanted reverse dependency. This would break architectural isolation. Any data sync process that maintains this is fine. You can do what is most convenient.
For example, you could make the data source expose two APIs:
An API to get the whole data set. This would be called periodically by the destination (e.g. nightly). It can also be used to seed the destination at will and to fix data errors there.
A feed of changes in the source database keyed by the date and time the change occurred. The destination can now poll that change feed very frequently (e.g. every few seconds or minutes) and apply the small delta that occurred.
You can even build a realtime change feed through a publish-subscribe middleware. Many message queue softwares can do that. The source would just send out changes to the middleware.
Building all of this is conceptually simple but takes a lot of work. It also creates lots of ongoing work and increases the potential for bugs. Debugging becomes much harder. I have worked on systems like that.
I'm going to add a subjective note: Microservices are not well understood by many teams. The downsides are often ignored. You identified a few of the downsides correctly and they are nasty! Given what I read on the web I believe many teams do not realize the mess they are getting themselves into. Managing disparate data stores can be a nightmare. This is not a one-time "mess" but an ongoing one.
As an alternative I'd recommend using a common data store and building services simply as classes or projects that live in the same process. This gives you the microservices code structuring with the convenience of normal development. It also leaves a few of the upsides of microservices on the table.

your identification of the problem is correct.
But the solution to your problem will depend on use case to use case.
In your example of search service , product service and customer service should publish their events on kafka or similar messaging and search service listen to them and updates it.
In case of lets say in order service while creating an order for a customer , you want to check customer exists , then you might do it by calling the sync api of customer service , but for that also there are variour other approaches , i have answered here linking Microservices and allowing for one to be unavailable
From my perspective sync communication between services should be avoided , and there are way around for this , above link would help
You can use domain driven design philosophy to correctly break your services and their contract

Related

How do I access data that my microservice does not own?

A have a microservice that needs some data it does not own. It needs a read-only cache of data that is owned by another service. I am looking for guidence on how to implement this.
I dont' want my microserivce to call another microservice. I have too much data that is used in a join for this to be successful. In addition, I don't want my service to be dependent on another service (which may be dependent on another ...).
Currently, I am publishing an event to a queue. Then my service subscribes and maintains a copy of the data. I am haivng problem staying in sync with the source system. Plus, our DBAs are complaining about data duplication. I don't see a lot of informaiton on this topic.
Is there a pattern for this? What the name?
First of all, there are couple of ways to share data and two of them you mention.
One service call another service to get the data when it is required. This is good as you get up to date data and also there is no extra management required on consuming service. Problem is that if you are calling this too many times then other service performance may impact.
Another solution is maintained local copy of that data in consuming service using Pub/Sub mechanism.
Depending on your requirement and architecture you can keep this in actual db of consuming service or some type of cache ( persisted cache)
Here cons is consistency. When working with distributed architecture you will not get strong consistency but you have to depends on Eventual consistency.
Another solution is that and depends on your required you can separate out that tables that needs to join in some separate service. It depends on your use case.
If you still want consistency then at the time when first service call that update the data and then publish. Instead create some mediator component and that will call two service in sync fashion. Here things get complicated as you now try to implement transaction over distributed system.
One another point, when product build around Microservice architecture then it is not only technical move, as a organization and as a team your team needs to understand something that work in Monolith, it is not same in Microservices. DBA needs to understand that part and in Microservices Duplication of data across schema ( other aspect like code) prefer over reusability.
Last but not least, If it is always required to call another service to get data, It is worth checking service boundary as well. It may possible that sometime service needs to merge as business functionality required to stay together.

Why does each microservice get its own database?

It seems that in the traditional microservice architecture, each service gets its own database with a different understanding of the data (described here). Sometimes it is considered permissible for databases to duplicate data. For instance, the "Users" service might know essentially everything about a user, whereas the "Posts" service might just store primary keys and usernames (so that the author of a post can have their name displayed, for instance). This page talks about eventual consistency, sources of truth, and other related concepts when data is duplicated. I understand that microservice architectures sometimes include a shared database, but most places I look suggest that this is a rare strategy.
As for why each service typically gets its own database, all I've seen so far is "so that each service owns its own resources," but I'm not convinced that a) the service layer in any way "owns" the persisted resources accessed through the database to begin with, or that b) services even need to own the resources they require rather than accessing necessary subsets of the master resources through a shared database.
So what are some of the justifications that each service in a microservice architecture should get its own database?
There are a few reasons why it does make sense to use a separate database per micro-service. Some of them are:
Scaling
Splitting your domain in micro-services is fine. You can scale your particular micro-service on the deployed web-server on demand or scale out as needed. That it obviously one of the benefits when using micro-services. More importantly you can have micro-service-1 running for example on 10 servers as it demands this traffic but micro-service-2 only requires 1 web-server so you deploy it on 1 server. The good thing is that you control this and you can manage your computing resources like in order to save money as Cloud providers are not cheap.
Considering this what about the database?
If you have one database for multiple services you could not do this. You could not scale the databases individually as they would be on one server.
Data partitioning to reduce size
Automatically as you split your domain in micro-services with each containing 1 database you split the amount of data that is stored in each database. Ideally if you do this you can have smaller database servers with less computing power and/or RAM.
In general paying for multiple small servers is cheaper then one large one.
So in this case you could make use of this fact and save some resources as well.
If it happens that the already spited by domain database have large amount of data techniques like data sharding or data partitioning could be applied additional, but this is another topic.
Which db technology fits the business requirement
This is very important pro fact for having multiple databases. It would allow you to pick the database technology which fits your Business requirement best in order to get the best performance or usage of it. For example some specific micro-service might have some Read-heavy operations with very complex filter options and a full text search requirement. Using Elastic Search in this case would be a good choice. Some other micro-service might use SQL Server as it requires SQL specific features like transnational behavior or similar. If for some reason you have one database for all services you would be stuck with the particular database technology which might not be so performant for those requirement. It is a compromise for sure.
Developer discipline
If for some reason you would have a couple micro-services which would share their database you would need to deal with the human factor. The developers would need to be disciplined to not cross domains and access/modify the other micro-services database(tables, collections and etc) which would be hard to achieve and control. In large organisations with a lot of developers this could be a serious problem. With a hard/physical split this is not an issue.
Summary
There are some arguments for having database per micro-service but also some against it. In general the guidelines and suggestions when using micro-services are to have the micro-service together with its data autonomous in order to work independent in Ideal case(this is not the case always). It is defiantly a compromise as well as using micro-services in general. As always the rule is the rule but there are exceptions to it. Micro-services architecture is flexible and very dependent of your Domain needs and requirements. If you and your team identify that it makes sense to merge multiple micro-service databases to 1 and that it solves a lot of your problems then go for it.
Microservices
Microservices advocate design constraints where each service is developed, deployed and scaled independently. This philosophy is only possible if you have database per service. How can i continue my business if i have DB failure and what steps i can take to mitigate this?DB is essential part of any enterprise application. I agree there are different number of challenges when services has its own databases.
Why Independent database?
Unlike other approaches this approach not only keeps your code-base clean and extendable but you truly omit the single point of failure in your business. To achieve this services sometimes can have duplicated data as well, as long as my service is autonomous and services can only be autonomous if i have database per service.
From business point of view, Lets take eCommerce application. you have microserivces like Booking, Order, Payment, Recommendation , search and so on. Database is shared. What happens if the DB is down ? All your services are down ! and there is no point using Microservies architecture other than you have clean code base.
If you have each service having it's own database , i don't mind if my recommendation service is not working but i can still search and book the order and i haven't lost the customer. that's the whole point.
It comes at cost and challenges, but in longer run it pays off.
SQL / NoSQL
Each service has it's own needs. To get the best performance I can use SQL for payment service (transaction) and I can use (I should) NoSQL for recommendation service. Shared database wouldn't help me in this case. In modern cloud Architectures like CQRS, Event Sourcing, Materialized views, we sometimes use 2 different databases for same service to get the performance out of it.
Again Database per service is not only about resources or how much data should it own. But we really have to see the bigger picture. Yes we have certain practices how much data and duplication is good or bad but that's another debate.
Hope that helps !

Need defense against wacky challenge to Event Sourcing architecture w/CosmosDB

In the current plan, incoming commands are handled via Function Apps, resulting in Events being sent to an Event Hub, and then materializing the views
Someone is arguing that instead of storing events in something like table storage, and materializing views based on events and snapshots, that we should:
Just stream events to a log in Azure Monitor to have auditing
We can make changes to a domain object immediately in response to a command and use the change feed as our source of events for materialized views.
He doesn’t see the advantage of even having a materialized view. Why not just use a query? Argument is we don’t expect a lot of traffic.
He wants to fulfill the whole audit log by saving events to the azure monitor log - Just an application log. Instead, that commands should just directly modify the representation of an entity in cosmos, and we'd use the change feed from CosmosDB as our domain object events, or we would create new events off of that via subscribers to that stream.
Is this actually an advantageous approach? Can ya'll think of any reasons why we wouldn't want to do that? Seems like we'd be losing something here.
He's saying we'd no longer need to be concerned with eventual consistency, as we'd have immediate consistency.
Every reference implementation I've evaluated does NOT do it the way he's suggesting. I'm not deeply versed in the advantages/disadvantages of the event sourcing / CQRS paradigm so I'm at a loss at the moment.. Currently researching furiously
This is a conceptual issue so there's not so much a code example. However, here's some references that seem to back up the approach I'm taking..
https://medium.com/#thomasweiss_io/planet-scale-event-sourcing-with-azure-cosmos-db-48a557757c8d
https://sajeetharan.com/2019/02/03/event-sourcing-with-azure-eventhub-and-cosmosdb/
https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
If your goal is only to have the audit log, state-based persistence could be a good choice. Event sourcing adds some complexity to the implementation side and unless you can identify more advantages of using it, you might not convince your team to bring this complexity to the system. There are numerous questions and answers on SO, as well as in some blog posts, about pros and cons of event sourcing, so I won't get into that discussion here.
I can warn you, though, that the second article in your list is very weak and would most probably lead you to many difficulties. The role of Event Hub there is completely unclear and it doesn't explain anything about projections and read-models (what you call "materialised views"). Only a very limited number of use-cases can live with only getting one entity by id and without being able to execute a query across multiple entities. That also probably answers your concern of having read-models at all. You will need them very soon when for the first time you will start figuring out how to get a list of entities based on some condition (query).
Using CosmosDb as the event store is completely feasible, as described in the first article if you can manage the costs involved. Just remember to set the change feed TTL to -1, otherwise, you won't be able to replay your projections when you need to.
To summarise:
Keeping the audit log can be done without event-sourcing, but you need to ensure that events are published reliably, preferably in the same transaction as the entity state update. It is often hard or impossible but you might accept the risk of your audit requirement is not strict. You can also base your audit log on the CosmosDb change feed, just collecting document changes and logging them somewhere.
Event sourcing is a powerful technique but it has both pros and cons. The most common prejudice against using event sourcing is its implementation complexity. It might not be a big issue if you have a team that is somewhat experienced in building event-sourced systems. If you don't have such a team, you might want to build a small-scale spike to get some experience.
If you don't get full buy-in from the team to use event sourcing, you will later get all the blame if anything goes wrong. And it will go wrong at some point, especially with little experience in this area.
Spend some time reading books and trying out things yourself, before going wild in production.
Don't use Event Hub for anything that it is not designed for. Event Hub is the powerful event ingestion transport with limited TTL and it should be used for that purpose.
Don't use Table Storage as the event store, unless you only read entities by id. I used it in production for such a scenario and it worked (to some extent) but you can't project read-models from there.
A simple rule of thumb is to not use products for tasks they weren't designed for.
Azure Monitor was not designed to store application domain data. Azure Monitor is designed to store telemetry data from your applications and services and provides features such as alerts and other types of integration into DevOps tools for managing the operation and health of your apps.
There is a simple reason why you were able to find articles on event sourcing using Cosmos DB and why our own docs talk about it. Because it was designed to be used this way. It is simple to set up Cosmos DB to be an append only event store for your applications and use Change Feed to fire off messages in other apps or services or, in your case, to maintain a materialized view state of domain objects within your app.

Distributed Database Design Architecture Use Case for Users & Authentication

I am now trying to design database for my micro service-oriented application in a distributed way. My application is related with management of universities. I have different universities say A, B, C. Each university have separate users for using their business data. Now I am planning to design separate databases for separate universities for storing their user data. So each university has their own database for their users and additional one database for managing their application tables. If I have 2 universities, Then I have 2 user details DB and other 2 DB for application tables.
Here my confusion is that, when I am searching for database design, I only see the approach of keeping one common database for storing all users (Here for one DB for all users of all universities). So every user is mixed within one database.
If I am following separate database for each university, Is possible to support distributed DB architecture pattern and micro service oriented standard? Or Do I need to keep one DB for all users?
How can I find out which method is appropriate for microservice / Distributed database design pattern?
Actually there could be multiple solutions and not one solution is best, the best solution is the one which is appropriate for your product's requirements.
I think it would be a better idea to go with separate databases for each of your client (university) to keep the data always isolated even if somethings wrong happens. Also with time, the database could go so huge that it could cause problems to configure/manage separate backups, cleanups for individual clients etc.
Now with separate databases there comes a challenge for managing distributed transactions across databases as you don't know which part is going to fail among many. To manage that, you may have to implement message/event driven mechanism across all your micro-services and ensure consistency.
Regarding message/event mechanism, here is a simple use case scenario, suppose there are two services "A" (user-registration) and "B" (email-service)
"A" registers a user temporarily and publishes an event of sending confirmation email.
The message goes to message broker
The message is received by "B".
The confirmation email is sent to the user.
The user confirms the email to "B"
The "B" publishes event of user confirmation to the broker
"A" receives the event of confirmation and the process is completed.
The above is the best case scenario, problems still can happen in between even with broker itself.
You have to go deep into it if you think you need this.
Some links that may help.
http://how-to-implement-a-microservice-event-driven-architecture-with-spring-cloud-stre
A Guide to Transactions Across Microservices
I don't think that this is a valid design, using a database per client which is a Multi-tenant architecture practice, and database per microservice is a microservice architecture practice. you are mixing things up.
if you will use microservice architecture you better design it as Bounded contexts and each Context has its own database to achieve microservices main rule Autonomy

Sharing huge data between microservices

I am designing an review analysis platform in microservices architecture.
Application is works like below;
all product reviews retrieved from ecommerce-site-a ( site-a ) as an excel file
reviews are uploaded to system with excel
Analysis agent can list all reviews, edit some of them, delete or approve
Analysis agent can export all reviews for site-a
Automated regexp based checks are applied for each review on upload and editing.
I have 3 microservices.
Reviews: Responsible for Review Crud operations plus operations similar to approve/reject..
Validations: Responsible for defining and applying validation rules on review.
Export/Import: Export service exports huge files given site name (like site-a)
The problem is at some point, validation service requires to get all reviews for site-a, apply validation rules and generate errors if is there any. I know sharing database schema's and entities breaks micro-services architecture.
One possible solution is
Whenever validation service requires reviews for a site, it requests gateway, gateway redirects request to Reviews service and response taken.
Two possible drawbacks of this approach is
validation service knows about gateway? Is it brings a dependency?
in case I have 1b reviews for a site, getting all reviews via rest request may be a problem. ( or not, I can make paginated requests from validation service to gateway..)
So what is the best practice for sharing huge data between micro-services without
sharing entity
and dublicating data
I read lot about using messaging queues but I think in my case it is not good to use messaging queue to share gigabytes of data.
edit 1: Instead of sharing entity, using data stores with rest API can be a solution? Assume I am using mongodb, instead of sharing my entity object between microservices, I can use rest interface of mongo (http://restheart.org/) and query data whenever possible.
Your problem here is not "sharing huge data", but rather the boundaries you choose to separate your micro services based on.
I can tell from your requirements that the 3 micro services you chose to separate (Reviews, Validations, Import/Export) are actually operating on the same context and business domain .. which is Reviews.
I would encourage you to reconsider your design decision and consider Reviews, as a single micro service, that handles all reviews operations and logic as a black box.
I assume that reviews are independent from each other and that validating a review therefore requires only that review, and no other reviews.
You don't want to share entities, which rules out things like shared databases, Hadoop clusters or data stores like Redis. You also don't want to duplicate data, thereby ruling out plain file copies or trigger-based replication on database level.
In summary, I'd say your aim should be a stream. Let the Validator request everything from Reviews about Site A, but not in one bulk, but in a stream of single or small packages of reviews.
The Validator can now process the reviews one after the other, at constant memory and processor consumption. To gain performance, you can make multiple instances of the Validator who pull different, disjunct pieces of the stream at the same time. Similarly, you can create multiple instances of the Reviews microservice if one alone wouldn't be able to answer the pull fast enough.
The Validator does not persist this stream, it produces only the errors and stores or sends them somewhere; this should fulfill your no-sharing no-duplication requirements.

Resources