Data seeding of microservices - microservices

In order to perform testing in feature branches, sometimes it's needed to recreate data from scratch, e.g:
An experimental branch irreversibly transforms data on some testing environment, so before testing another branch it's needed to refill the databases/buckets
When testing, one microservice calls other microservice api and changes "that other" service data, so we want to always start testing on clean environment.
In case of monoliths, usually it's as easy as:
Create a dump
When needed, drop the database and apply the dump after (or in the middle of, depending on the dump last migration) migrations
But it's getting much harder when we're talking about microservice architecture:
Each microservice uses its own database(s).
It means we have N databases with some denormalized data.
Some microservices have links to entities of another microservice, so it's important that all the dumps are coherent and consistent.
Different teams own different services, so it's much harder to agree on how and when to update the initial data dumps so that dumps are consistent across the services.
I'm looking for some best practices on this. Do we have anything better than trying to make snapshots of all databases at once?

Related

Sharing entity with other microservice

There is a microservice in Spring with PostgreSQL database responsible for some Product entity.
As there is a lot of Product's and they are still growing exponentially we want to archive this data to other database (also PostgreSQL as we have best knowledge about it and we are limited by support of some other tool). In our main microservice (Product) is already happening lot of things so we want to extract archiving data to other job/microservice. We use migration tool in main microservice which is responsible for Product table changes.
Question: how to keep our Product entity synced with this new technical (archiving) microservice to let this new microservice always be able to get data from DB and push it in same state to other DB with same schema?
Don't.
The point of microservices is that each service has a narrow, clearly defined set of responsibilities, allowing them to be deployed independently of each other.
If you do this then your entity code will have two sets of responsibilities, and changes that might help it do something in one service might be unneeded or even cause issues in another. It complicates deployment and testing.
Better to keep separate code bases, allow the two services to evolve independently, and live with some duplication.
There is also the question of why an archive job would need jpa entities, this sounds more like a job for a bulk copy tool or replication service than jpa. Very likely this isn't the right technical choice, you'll have a very slow archive process that will end up getting rewritten to not use jpa and this effort to reuse the entity will have been wasted.

Microservice architecture - is database shared across all instances of the service?

I understand that microservice architecture suggests that each service should have its own private database. But when such a service is scaled, then is it one db per service instance or one db shared by all service instances?
Your first statement may be misleading to some: "each service should have its own private database."
Your architecture should be careful about sharing a single set of tables across multiple services-- that sharing frequently leads to a shared schema dependency, which creates a tight coupling that makes it difficult to update the schema without updating many of the services that share that schema at the same time.
However, sharing a single database instance (or database cluster) doesn't mean your services are accessing the same tables or even the same schema within the database. And if they aren't accessing the same tables, they aren't coupled. (Relying on the same database instance isn't coupling any more than relying on the same network. Don't confuse coupling with shared infrastructure.)
Frequently, multiple instances of the same service share the same database. In my opinion, there is nothing inherently wrong with this, but there are some things to be aware of. If you go this route, you need to be very careful when making changes to the data schema. Because multiple versions of that service may be accessing the data at the same time during updates, any schema changes need to compatible to at least any two adjacent versions. If you add a column or table, that's fine. The older version won't attempt to use it, so there will be no problem. (Note too, that the older version won't populate it either.) Removing a column or table is another problem entirely and to make that kind of breaking change, you will likely need to do it in several smaller steps to ensure that the older version of the service isn't broken. It can be done, it's just tougher.
A general rule of microservice development is that each microservice
should manage its own data. In an ideal world, the data managed by
each service would be completely independent. There would be no need
to propagate data changes made in one service to other services.
In the real world, however, complete data independence is impossible.
There will always be overlaps between the data used in different
services, Consequently, as an architect, you need to think carefully about
sharing data and managing data consistency. You need to think about
the microservices as an interacting system rather than as individual
units.
This means:
You should isolate data within each system service with as little
data sharing as possible.
If data sharing is mavoidable, you should design microservices so
that most sharing is read-only, with a minimal number of
services responsible for data updates.
If services are replicated in your system, you must include a
mechanism that can keep the database copies used by replica
services consistent.
Good question indeed. I would answer it like: "at least a database per microservice (not instance)"
A concern is the scalability of the databse itself, i.e. can service instances outscale the database?
If so, you could opt for e.g. an in-memory database or a sidecar for your microservice. The database would be ephemeral and you would need to populate it after the pod/container (re)starts. So the state not really lives in the database.
Apache Kafka is a tool that fits this spot, as it would allow you to populate the database after the service comes up and also provides the tooling to synchronize state for all currently running and future instances. But successfully implementing a Event-Sourcing with Kafka is not a trivial task, but you could come the conclusion that you don't need databases at all.
So the question remains, can service instances really outscale the database?
The answer would be "no" more often than not.
So by having a database instance per microservice (physically or logically) already gives you a lot in terms of "loose coupling and cohesive behaviour" as you don't share databases.
Another concern are breaking changes to the database between versions of the microservice. If things go wrong you could find yourself being unable to rollback. An ephemeral database could sync itself up in a compatible way.
Some say they change database technologies throughout the lifetime of a microservice, I never had the neccessity to do so, but an in-memory/sidecar approach would fit here very well.
I presume you share one database with all instances of one microservice. So that one update is available for every instance of the same microservice immediately. You may use one database instance per microservice instance to avoid the database as a single point of failure. But you would have to keep in sync every database which, it seems like an unnecesary overload for the database and application. I assume the database is able to keep a group of db instances in sync (every insert,update, delete is properly propagated).

Why does each microservice get its own database?

It seems that in the traditional microservice architecture, each service gets its own database with a different understanding of the data (described here). Sometimes it is considered permissible for databases to duplicate data. For instance, the "Users" service might know essentially everything about a user, whereas the "Posts" service might just store primary keys and usernames (so that the author of a post can have their name displayed, for instance). This page talks about eventual consistency, sources of truth, and other related concepts when data is duplicated. I understand that microservice architectures sometimes include a shared database, but most places I look suggest that this is a rare strategy.
As for why each service typically gets its own database, all I've seen so far is "so that each service owns its own resources," but I'm not convinced that a) the service layer in any way "owns" the persisted resources accessed through the database to begin with, or that b) services even need to own the resources they require rather than accessing necessary subsets of the master resources through a shared database.
So what are some of the justifications that each service in a microservice architecture should get its own database?
There are a few reasons why it does make sense to use a separate database per micro-service. Some of them are:
Scaling
Splitting your domain in micro-services is fine. You can scale your particular micro-service on the deployed web-server on demand or scale out as needed. That it obviously one of the benefits when using micro-services. More importantly you can have micro-service-1 running for example on 10 servers as it demands this traffic but micro-service-2 only requires 1 web-server so you deploy it on 1 server. The good thing is that you control this and you can manage your computing resources like in order to save money as Cloud providers are not cheap.
Considering this what about the database?
If you have one database for multiple services you could not do this. You could not scale the databases individually as they would be on one server.
Data partitioning to reduce size
Automatically as you split your domain in micro-services with each containing 1 database you split the amount of data that is stored in each database. Ideally if you do this you can have smaller database servers with less computing power and/or RAM.
In general paying for multiple small servers is cheaper then one large one.
So in this case you could make use of this fact and save some resources as well.
If it happens that the already spited by domain database have large amount of data techniques like data sharding or data partitioning could be applied additional, but this is another topic.
Which db technology fits the business requirement
This is very important pro fact for having multiple databases. It would allow you to pick the database technology which fits your Business requirement best in order to get the best performance or usage of it. For example some specific micro-service might have some Read-heavy operations with very complex filter options and a full text search requirement. Using Elastic Search in this case would be a good choice. Some other micro-service might use SQL Server as it requires SQL specific features like transnational behavior or similar. If for some reason you have one database for all services you would be stuck with the particular database technology which might not be so performant for those requirement. It is a compromise for sure.
Developer discipline
If for some reason you would have a couple micro-services which would share their database you would need to deal with the human factor. The developers would need to be disciplined to not cross domains and access/modify the other micro-services database(tables, collections and etc) which would be hard to achieve and control. In large organisations with a lot of developers this could be a serious problem. With a hard/physical split this is not an issue.
Summary
There are some arguments for having database per micro-service but also some against it. In general the guidelines and suggestions when using micro-services are to have the micro-service together with its data autonomous in order to work independent in Ideal case(this is not the case always). It is defiantly a compromise as well as using micro-services in general. As always the rule is the rule but there are exceptions to it. Micro-services architecture is flexible and very dependent of your Domain needs and requirements. If you and your team identify that it makes sense to merge multiple micro-service databases to 1 and that it solves a lot of your problems then go for it.
Microservices
Microservices advocate design constraints where each service is developed, deployed and scaled independently. This philosophy is only possible if you have database per service. How can i continue my business if i have DB failure and what steps i can take to mitigate this?DB is essential part of any enterprise application. I agree there are different number of challenges when services has its own databases.
Why Independent database?
Unlike other approaches this approach not only keeps your code-base clean and extendable but you truly omit the single point of failure in your business. To achieve this services sometimes can have duplicated data as well, as long as my service is autonomous and services can only be autonomous if i have database per service.
From business point of view, Lets take eCommerce application. you have microserivces like Booking, Order, Payment, Recommendation , search and so on. Database is shared. What happens if the DB is down ? All your services are down ! and there is no point using Microservies architecture other than you have clean code base.
If you have each service having it's own database , i don't mind if my recommendation service is not working but i can still search and book the order and i haven't lost the customer. that's the whole point.
It comes at cost and challenges, but in longer run it pays off.
SQL / NoSQL
Each service has it's own needs. To get the best performance I can use SQL for payment service (transaction) and I can use (I should) NoSQL for recommendation service. Shared database wouldn't help me in this case. In modern cloud Architectures like CQRS, Event Sourcing, Materialized views, we sometimes use 2 different databases for same service to get the performance out of it.
Again Database per service is not only about resources or how much data should it own. But we really have to see the bigger picture. Yes we have certain practices how much data and duplication is good or bad but that's another debate.
Hope that helps !

how to handle duplicated data in a micro service architecture

I am working on a jobs site where I am thinking of breaking out the jobs matching section into a micro service - everything else is a monolith.
But when thinking about how the microservice should have its own separate database, that would mean having the microservice have a separate copy of all the jobs, given the monolith would still handle all job crud functionality.
Am I thinking about this the right way and is it normal to have multiple copies of the same data spread out across different microservices?
The idea of having different databases with the same data scares me a bit, since that creates the potential for things to get out of sync.
You are trying to go away from monolith and the approach you are taking is very common, to take out part from monolith which can be converted into a microservice. Monolith starts to shrink over time and you have more number of MSs.
Coming to your question of data duplicacy, yes this is a challenge and some data needs to be duplicated but this vary case to case and difficult to say without looking into application.
You may expose API so monolith can get/create the data if needed and I strongly suggest not to sacrifice or compromise data model of microservice to avoid duplicacy, because MS will be going to more important than your monolith in future. Keep in mind you should avoid adding any new code to the monolith and even if you have to, for data ask the MS instead of the monolith.
One more thing you can try, instead of REST API call between microservices, you can use caching mechanism with event bus. Every microservice will publish CRUD changes to event bus, interested micro-service consume those events & update local cache accordingly.
Problem with REST call is, in some situation when dependent service is down we can not query main microservice, which could become bottleneck sometime.

Microservices, Dependencies and Events

I’ve been doing a lot of googling regarding managing dependencies between microservices. We’re trying to move away from big monolithic app into micro-services in order to scale organizationally and be able to develop faster and with multiple teams working in parallel.
However, as we’re trying to functionally partition the monolith into the microservices, we see how intertwined business logic and data really is. This was not a problem when we were sitting on top of one big DB and were able to do big relational joins. But with microservices, this becomes a problem.
One solution is to make microservice-A go to 5-10 other microservices to get necessary data (this is equivalent of DB view with join). Another solution is to make microservice-A listen to events from 5-10 other services and populate local storage with relevant into (this is an equivalent of materialized view). Either way, microservice-A is coupled with 5-10 other services, and if new info is needed in microservice-A, the some of the services that it depends upon might will need to be release prior to microservice-A. Please note that microservice-A is itself depended upon by other services. Bottom line, we end up with DISTRIBUTED dependency hell.
Many articles advocate for second solution – i.e. something along the lines of Event Sourcing, Choreography, etc.
I would appreciate any shared experiences, recommendations and insights.
Philometor.
While not technically an "answer", I can definitely share some of my observations and experiences. Your question concerning services calling other services for database operations reminded me of a project where an architect sold senior management on the idea of "decoupling" persistence from the rest of the applications by implementing hundreds of REST interfaces in what essentially was a distributed DAO pattern in front of a very large enterprise database. The project ended up exactly the way I predicted - a dismal failure.
Microservices aren't about turning a monolithic application into a distributed monolithic application. In my example project above, the monolith was turned into a stove-piped, fragile, chaotic mess, with the coupling only moved to service contracts instead of Java class method signatures, and with a performance hit so bad the application was unusable. Last I heard they are still running their original monolith.
Microservices should be more of a vertical partitioning of your application and not a horizontal one. In my opinion it's better to think in terms of business function partitioning rather than "converting" an existing monolith. There's no rule that determines how big a microservice must be, but it should be big enough to do one complete synchronous function without needing to directly depend on outside services (as much as possible) to complete its work. If a microservice performs a complex business function that affects 50 tables, so be it! It owns those many tables. Ideally if a service goes down, it should affect only that business functionality it's responsible for, and not directly affect other services. As you can see, this thinking is the complete opposite from that which produced the distributed mess in my project example.
Not only do you need to ensure that the motivation behind replacing monoliths with microservices is sound, but also you need to step outside the monolith and revisit the actual business and begin partitioning that instead. Like everything else, baby steps are the way to go. Start with one small complete business function, and convert that into a single microservice instead of trying to replace a monolith all at once.

Resources