Microservices - Maintaining Multiple Data stores, initial data load etc

Microservices - Maintaining Multiple Data stores, initial data load etc - microservices

On aspects of granulatiry of mictoservices have read about the 2 pizza rule, services that can be developed in 2 weeks etc. When the case studies of amazon, nelflix, gilt are read we hear about 100s of services. While the service granularity does make sense, what is still not clear to me is about the data stores of each of these microservices. Will there not be just too many data stores if each of the services store/maintain their own data ?? It might be the same logical entity like a product, customer etc that is sliced & the relevant portion/attributes stored/maintained by a corresponding microservice. There could be a service that maintains basic customer information, another that maintains the additional customer information like say his subscription information or his interests etc.
Couple of questions that come to mind around the data stores
Will this not be a huge maintenance issue in terms of backups,
restores etc?
How is the initial data populated into these stores ? Are there any best practices around this ? Organisations are bound to have huge volumes of customer or product data & they will most likely be mastered in other systems.
How does this approach of multiple data stores impact the 'omni-channel' approach where it implies getting a single view of all data? Organizations might have had data consolidation initiatives going on to achieve the same
Edit: Edited the subject a bit

1.Will this not be a huge maintenance issue in terms of backups, restores etc?
From your view yes it will. I mean at the end of day you will not have just one database server to backup but tens or hundreds of them. But mostly people -at least that is what we do - is using a cloud database service to get rid of all these maintenance effort.
2.How is the initial data populated into these stores ? Are there any best practices around this ? Organisations are bound to have huge volumes of customer or product data & they will most likely be mastered in other systems.
I am not sure if there is a best way but we created a client to read the data from legacy system then convert and split it into the parts for each microservices and push them to those microservices by consuming their services. We used message queues to to be sure about health of migration.
3.How does this approach of multiple data stores impact the 'omni-channel' approach where it implies getting a single view of all data? Organizations might have had data consolidation initiatives going on to achieve the same.
Well I don't know what "omni-channel" is so I can't answer that.
Lastly you were mentioning about logical entities shared between services. The real hardest part about implementing microservices is defining what each service will provide. And while doing that you should carefully examine data needs for each services and those services should share as little as possible like only entity ids etc. At least that is what we are doing.

Related

Is it possible to replicate tables from multiple databases in Google Cloud?

The company that I work at uses a microservices architecture with the 'database per service' pattern. This pattern makes it harder to query based on data from multiple services, since each service has its own database. Imagine a service for managing your products and one for managing stock. You would have to somehow combine the data from both services to query for products based on stock.
I know that event sourcing and API composition are potential solutions to the problem, but I was wondering if it is possible to continuously replicate specific tables from the product and stock databases based on database transaction logs. Wouldn't this be much simpler than say implementing an event based solution like event sourcing? One service that I am working with contains a lot of domain events, which would make implementing and maintaining event-based solution rather complex.
Another reason for why I am considering to look at the problem from a different angle is that there is a lot of data. In-memory joins with say API composition will most likely be slow.
To sum it all up, I would like to know if it is possible to continuously replicate specific tables from different databases into one database.
The technologies that my company uses are primarily Spring Framework and PostgreSQL.

I would step back and ask why you have microservices (including why you have multiple databases). This is because it's quite easy to make choices that are superficially easy but which achieve that ease by negating the reason you had the microservices to begin with, and in such a situation, it may in fact be easier to just not do microservices.
For example, you might be doing microservices because you want to be able to have the team maintaining your product service be able to make changes without coordinating with the stock service or vice versa. By setting up a direct replication of a table from service A's database into service B's database, you essentially require many changes service A might want to make to that table to be coordinated with service B. It's perhaps less operationally coupled than unifying the services into a monolith, but in terms of developer velocity, you're giving up a fair amount.
Alternatively, if the rationale is to allow one service to be down (failures, maintenance, releases: doesn't matter) without taking the others down, a replication which guarantees strong consistency implies that taking service B's database down prevents service A from updating its database (because if you allowed service A to update its database in that situation, you couldn't have strong consistency).
Rather than direct replication, it might make sense to use change data capture (e.g. with Debezium) to publish a stream of changes from the transaction logs (e.g. to Kafka). The critical difference from logical replication is that the consumer can, for instance, choose to ignore updates to columns it doesn't care about: the stock service might include details like where things are stocked in a warehouse, for instance, which is data you don't need for answering a query like "show me the products in this category which are in stock". This can be a nice middle ground between going full event-sourcing and other approaches.

Microservices "JOINS"

Let's say we want to create the app with microservices.
We have some page where we display some items (products).
These products have multiple joins(categories, tags, users, and so on).
If users, categories data are within another services, how can we manage and filter the results?
For example in SQL you create 3,4 joins and get.
With microservices - I have to filter the categories, then filter tags and then products - this could be 10 time slower than the speed of the SQL query.
Also if I have table "products_categories" which set categories for each product which service is responsible for that? Product service or Category service ?
Thank you

In Microservices architecture there are two ways to deal with it.
The API composition pattern— This is the simplest approach and should be used whenever possible. It works by making clients of the services that own the data responsible for invoking the services and combining the results.
The Command query responsibility segregation (CQRS) pattern— This is more powerful than the API composition pattern, but it’s also more complex. It maintains one or more view databases whose sole purpose is to support queries.
I will prefer to use CQRS, Define a view database, which is a read-only replica to support specifically that query. The rest of the services keeps the replica up to date by subscribing to (create, update, insert)events published by the data owner services.

This is a very standard problem whenever any micro-service is built.. People just always feel micro-service is the solution for everything which is not true.
Solution to this problem is designing better. Designing so that there is a balance between performance and redundancy of data. Higher performance ( lower latency numbers ) means more duplicacy of data across different databases of microservice. You should not target to achieve performance as good as SQL Joins ; but also do not duplicate data too much. A balance is needed..
Most importantly, dividing the requirement into right set of micro-services is needed.

I assume you created a "microservice" per database table. Those are not microservices, those are just HTTP-based CRUD interfaces to your database.
First, know why you need microservices. (Is there an actual reason?) Second, you have to create microservices that encompass at least one full (business) functionality for your software. Meaning it doesn't need other services to do it.
If you need a table that needs data from multiple microservices, you by definition made wrong microservices. If a microservice can't provide it's own UI without the help of other services, it doesn't fully contain it's own functionality.

What's stopping you from having multiple services for reading / writing to the same database / table? For example:
One service to write to categories
One service to write to tags
One service to write to products
You could then write another service to read from all three of these services, however, this might not be at a HTTP level, instead you could read from the same database within your read service and leverage the power of SQL.
The service that reads could encompass your join logic which would mean you wouldn't need to consume the other services around it.

Distributed database design style for microservice-oriented architecture

I am trying to convert one monolithic application into micro service oriented architecture style. Back end I am using spring , spring boot frameworks for development. Front-end I am using angular 2. And also using PostgreSQL as database.
Here my confusion is that, when I am designing my databases as distributed, according to functionalities it may contain 5 databases. Means I am designing according to vertical partition. Then I am thinking to implement inter-microservice communication services to achieve the entire functionality.
The other way I am thinking that to horizontally partition the current structure. So my domain is based on some educational university. So half of university go under one DB and remaining will go under another DB. And deploy services according to Two region (two for two set of university).
Currently I am decided to continue with the last mentioned approach. I am new to these types of tasks, since it referring some architecture task. Also I am beginner to this microservice and distributed database world. Would someone confirm that my approach will give solution to my issue? Can I continue with my second approach - horizontal partitioning of databases according to domain object?

Can I continue with my second approach - Horizontal partitioning of
databases according to domain object?
Temporarily yes, if based on that you are able to scale your current system to meet your needs.
Now lets think about why on the first place you want to move to Microserices as a development style.
Small Components - easier to manager
Independently Deployable - Continous Delivery
Multiple Languages
The code is organized around business capabilities
and .....
When moving to Microservices, you should not have multiple services reading directly from each other databases, which will make them tightly coupled.
One service should be completely ignorant on how the other service designed its internal structure.
Now if you want to move towards microservices and take complete advantage of that, you should have vertical partition as you say and services talk to each other.
Also while moving towards microservices your will get lots and lots of other problems. I tried compiling on how one should start on microservices on this link .
How to separate services which are reading data from same table:
Now lets first create a dummy example: we have three services Order , Shipping , Customer all are three different microservices.
Following are the ways in which multiple services require data from same table:
Service one needs to read data from other service for things like validation.
Order and shipping service might need some data from customer service to complete their operation.
Eg: While placing a order one will call Order Service API with customer id , now as Order Service might need to validate whether its a valid customer or not.
One approach Database level exposure -- not recommened -- use the same customer table -- which binds order service to customer service Impl
Another approach, Call another service to get data
Variation - 1 Call Customer service to check whether customer exists and get some customer data like name , and save this in order service
Variation - 2 do not validate while placing the order, on OrderPlaced event check in async from Customer Service and validate and update state of order if required
I recommend Call another service to get data based on the consistency you want.
In some use cases you want a single transaction between data from multiple services.
For eg: Delete a customer. you might want that all order of the customer also should get deleted.
In this case you need to deal with eventual consistency, service one will raise an event and then service 2 will react accordingly.
Now if this answers your question than ok, else specify in what kind of scenario multiple service require to call another service.
If still not solved, you could email me on puneetjindal.11#gmail.com, will answer you

Currently I am decided to continue with the last mentioned approach.
If you want horizontal scalability (scaling for increasingly large number of client connections) for your database you may be better of with a technology that was designed to work as a scalable, distributed system. Something like CockroachDB or NoSQL. Cockroachdb for example has built in data sharding and replication and allows you to grow with adding server nodes as required.
when I am designing my databases as distributed, according to functionalities it may contain 5 databases
This sounds like you had the right general idea - split by domain functionality. Here's a link to a previous answer regarding general DB design with micro services.

In the Microservices world, each Microservice owns a set of functionalities and the data manipulated by these functionalities. If a microservice needs data owned by another microservice, it cannot directly go to the database maintained/owned by the other microservice rather it would call an API exposed by the other microservice.
Now, regarding the placement of data, there are various options - you can store data owned by a microservice in a NoSQL database like MongoDB, DynamoDB, Cassandra (it really depends on the microservice's use-case) OR you can have a different table for each micro-service in a single instance of a SQL database. BUT remember, if you choose a single instance of a SQL Database with multiple tables, then there would be no joins (basically no interaction) between tables owned by different microservices.
I would suggest you start small and then think about database scaling issues when the usage of the system grows.

Oracle Materialized View for sensory data transfer

In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.

Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.

Mongodb strategy for Multi-Company web app

I am developing a web app in Meteor, with Mongo, that will be running on cloud. Each user must belong to a Company.
Each Company can only access it's own data.
Each user can access it's own data and some data shared with other users of the same company.
Imagine 1.000 companies and 100 users per company, it could get very bad in performance and secutiry, if I use 1 Mongodb database for whole app.
So, because Mongo is "Schema-less and Database-less" I think I can define 1.000 dbs, lets say db_0001, db_0002, ... with same name collections, lets say tasks, messages, ..., so the app can be efficient and more secure (same code for every Company and isolation of data).
Also, on hosting side (let's say for example with Digital Ocean), I think its easier to distribute the dbs if the are already atomized.
Is this a good approach? Or should I not worry about it and let the hosting do this job?
Any thoughts are wellcome.

You are currently only looking at one side of the coin. That's fine to start with.
Think about how you are going to be displaying that data and what query does it translate to. Do a thorough due diligence on all the potential query. For example, how often would user/getbyid be called and how often would you have to show a user their info and their relationship with other users. What other meta data would be required beside user info, would you have to perform a join to get that data? or is it stored as an embedded document? What fields are you going to be searching and sorting by most? Which types of data are write heavy and what are read heavy?
Now lets get back to your database shading approach. It's great that you are thinking ahead of time on this front rather than having to rewrite your component later. Data volume/storage does not worry me here. How many concurrent users would be using at application and what are primary use cases should be the first place to look at to think about scale.
Additionally, you need to understand the nature of the business and project growth. Is it like Instragram type of hyper growth? or is it more predictable. A big Mongo cluster can handle thousands of concurrent read/write requests (assuming your design and query are optimized) so that does not bother me. If you want to keep it flexible MongoDB has a sharding mechanism and you can shard on a key and it takes care all the fancy stuff for ya.
MongoDB has eventual consistency (look up MongoDB CAP theorem) if you enable read from secondaries and you have a high volume business critical app you need to be careful because you can be reading out of date result.
As far as hosting is concerned, DO is fine but always have a backup in another region to maintain geographic redundancy so in case if a region goes down (Hello AWS!) you have something to fall back on.
Good luck on your project!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio