As per my understanding -
CQRS is like separating reads with writes. But we can do the same with master-slave architecture as well. Assuming master for writes and slaves for read req.
So I want to understand when should we use CQRS and master-slave architecture?
So I want to understand when should we use CQRS and master-slave architecture?
A good starting point would be Greg Young 2010.
CQRS is simply the creation of two objects where there was previously only one.
When most people talk about CQRS they are really speaking about applying the CQRS pattern to the object that represents the service boundary of the application.
This separation however enables us to do many interesting things architecturally, the largest is that it forces a break of the mental retardation that because the two use the same data they should also use the same data model.
The last is, to my mind, a key distinction; we're not talking about multiple replicas of the same data model, with some dedicated to supporting reads, but instead about having the same information stored in two different data models.
A common example would be to use a data model based on event histories for writes, with reads instead being served by generating result sets from an RDBMS.
Master-Slave architecture often deals with the physical separation and replication of data, for better availability and reliability. While the data model itself remains the same for the master and slaves. We can divide the responsibility of reads and writes between masters and slaves, but they still deal with the same underlying data model and interfaces.
On the other hand, the CQRS design pattern is more about decoupling the read and write data model and interfaces themselves, allowing them to evolve separately, and avoid an overly complex model that does too much. The read and write operations in CQRS may even be powered by two completely different data stores with very different schemas
CQRS:
https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
https://martinfowler.com/bliki/CQRS.html
Related
I'm developing a parcels tracking system and thinking of how to improve it's performance.
Right now we have one table in postgres named parcels containing things like id, last known position, etc.
Everyday about 300.000 new parcels are added to this table. The parcels data is took from external API. We need to track all parcels positions as accurate as possible and reduce time between API calls about specific parcel.
Given such requirements what could you suggest about project architecture?
Right now the only solution I can think of is producer-consumer pattern. Like having one process selecting all records from parcel table in the infinite loop and then distribute fetching data task with something like Celery.
Majors downsides of this solution are:
possible deadlocks, as fetching data about the same task can be executed at the same time on different machines.
need in control of queue size
This is a very broad topic, but I can give you a few pointers. Once you reach the limits of vertical scaling (scaling based on picking more powerful machines) you have to scale horizontally (scaling based on adding more machines to the same task). So for being able to design scalable architectures you have to learn about distributed systems. Here some topics to look into:
Infrastructure & processes for hosting distributed systems, such as Kubernetes, Containers, CI/CD.
Scalable forms of persistence. For example different forms of distributed NoSQL like key-value stores, wide-column stores, in-memory databases and novel scalable SQL stores.
Scalable forms of data flows and processing. For example event driven architectures using distributed message- / event-queues.
For your specific problem with packages I would recommend to consider a key-value store for your position data. Those could scale to billions of insertions and retrievals per day (when querying by key).
It also sounds like your data is somewhat temporary and could be kept in an in-memory hot-storage while the package is not delivered yet (and archived afterwards). A distributed in-memory DB could scale even further in terms insertion and queries.
Also you probably want to decouple data extraction (through your api) from processing and persistence. For that you could consider introducing stream processing systems.
I have 6 services talking to the same database SQL Server 2016 (Payments) where some services are doing write operations and some are doing read operations. Database server holds other databases as well than Payments database. We do not have any archival job in place on Payments database. We recently got 99% CPU usage and as well as memory issue on database server.
Obvious steps I can take including
Create archival jobs to migrate old data to archived database
Can scale up database server.
But still want to explore other best solutions. I have below questions.
Can we make different databases for read and write operations, if yes how?
Can we migrate data on the fly to NoSql database from RDBMS because it is faster for read operation?
What is the best design for such applications where concurrent read and write operations happens?
Storage is all about trade-offs, so it is extremely tricky to find correct "storage" solution without diving deep in different aspects such as latency, availability, concurrency, access pattern and security requirements. In this particular case, payments data is being stored which should be confidential and straightforward removes some storage solutions. In general, you should
Cache the read data, but if the same data is being modified
constantly this will not work. Caching also doesn't work well when
your reads are not public (i.e., can not be reused across multiple
read calls, preferrably across multiple users), which is possible in this case as we are dealing with payments data.
Read/write master database and read-only slaves pattern is also "common" pattern to scale reads. It doesn't scale the writes though. It again depends if the application can work with "replication lag".
Sharding is the common access pattern to scale writes. It comes with other burden of cross node query aggregation etc (in some databases).
Finally, based on the data access pattern, refactor the schema
and employ different databases. CQRS (Command Query Responsibility
Segregation) is one way to achieve it, but it comes at it has its
own pros and cons. For more details: https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
Few years back, I read this book which helped me immensely in understanding these concepts: https://www.amazon.com/Scalability-Startup-Engineers-Artur-Ejsmont/dp/0071843655
Is sharing database with multiple serverless functions good practice?
Like in a CRUD application, normally the Create, Update, Delete and Read are different operations sharing the same database. If we migrate that idea on serverless, is that still ideal? All of those operations accessing the same database.
My hesitation comes from the idea of sharing databases between different microservices. Since that increases coupling and makes things more fragile.
The answer to this question is dependent on the circumstances of both the database and the operations.
These concerns can be boiled down to the following characteristics of the database:
Can it handle concurrency? This is the number one reason that can stop a network of serverless functions from operating on the database. For example, S3 buckets cannot handle concurrency, so a workaround such as firehose or an SQS would need to be implemented in order for multiple lambdas to operate simultaneously. DynamoDB, on the other hand, can handle concurrency with no problem.
Is it a transactional or analytical database? This would limit how fast reads vs. writes take place, and if they're too slow, your lambdas will get exponentially slower. This means that if, for example, writes are slow, then they should be done in large batches- not in small increments from multiple instances.
What are its limitations for operation frequency? There can be limitations from both sides of the equation. Lambdas on default have a maximum concurrency of 1000 if they all exist in the same region. Databases often also have limitations for how many operations can take place at the same time.
In most cases, the first two bullets are most important, since limitations normally are not reached except for a few rare cases.
We are moving our database from oracle to couchDB, for one of the use case is to implement the distributed transaction management.
For Ex: Read the data from JMS Queue and update it in multiple document, if any thing fails then revert back and throws an exception to JMS queue.
As we know couchDB does not support distributed transaction management.
Can you please suggest any alternative strategy to implement this or any other way out?
More than technical aspects I feel you might be interested in the bottom line of that.
As mentionned distributed transactions are not possible - this notion doesn't even exist, because it is not necessary. Indeed, unlike in the relational world, 95% of the time when you feel that you need them it means that you are doing something wrong.
I'll be straightforward with you : dumping your relational data into couchdb will end up being a nightmare both for writes and reads. For the first you'll say : how can I do transactions ? For the laters : how can I do joins ? Both are impossible and are concepts which do not even exist.
The convenient conclusions - too - many people reach is that "CouchDb is not enterprise ready or ACID enough". But the truth is you need to take the time to rethink your data structures.
You need to rethink your data structures and make them document oriented because if you don't you are off the intended usage of couchdb - and as you know this is risky territory.
Read on DDD and aggregates design, and turn your records into DDD entities and aggregates. So there'd be an ETL layer to CouchDb. If you don't have the time to do that I'd recommend not using CouchDb - as much as I love it.
CouchDB doesn't have properties which are necessary for distributed transactions so it's impossible. All major distributed transaction algorithms (Two-Phase commit protocol, RAMP and Percolator-style distributed transactions, you can find details in this answer) require linearizability on the record level. Unfortunately CouchDB is an AP solution (in the CAP theorem sense) so it can't even guarantee record-level consistency.
Of cause you can disable replication to make CouchDB consistent but then you'll lose fault-tolerance. Another option is to use CouchDB as a storage and to build a consistent database on top of it but it's an overkill for your task and doesn't use any CouchDB-specific feature. The third option is to use CRDT but it works only if your transactions are commutative.
I need to design a system which has these basic components:
A Webserver which will be getting ~100 requests/sec. The webserver only needs to dump data into raw data repository.
Raw data repository which has a single table which gets 100 rows/s from the webserver.
A raw data processing unit (Simple processing, not much. Removing invalid raw data, inserting missing components into damaged raw data etc.)
Processed data repository
Does it make sense in such a system to have a service layer on which all components would be built? All inter-component interaction will go through the service layers. While this would make the system easily upgradeable and maintainable, would it not also have a significant performance impact since I have so much traffic to handle?
Here's what can happen unless you guard against it.
In the communication between layers, some format is chosen, like XML. Then you build it and run it and find out the performance is not satisfactory.
Then you mess around with profilers which leave you guessing what the problem is.
When I worked on a problem like this, I used the stackshot technique and quickly found the problem. You would have thought it was I/O. NOT. It was that converting data to XML, and parsing XML to recover data structure, was taking roughly 80% of the time. It wasn't too hard to find a better way to do that. Result - a 5x speedup.
What do you see as the costs of having a separate service layer?
How do those costs compare with the costs you must incur? In your case that seems to be at least
a network read for the request
a database write for raw data
a database read of raw data
a database write of processed data
Plus some data munging.
What sort of services do you have a mind? Perhaps
saveRawData()
getNextRawData()
writeProcessedData()
why is the overhead any more than a procedure call? Service does not need to imply "separate process" or "web service marshalling".
I contend that structure is always of value, separation of concerns in your application really matters. In comparison with database activities a few procedure calls will rarely cost much.
In passing: the persisting of Raw data might best be done to a queuing system. You can then get some natural scaling by having many queue readers on separate machines if you need them. In effect the queueing system is naturally introducing some service-like concepts.
Personally feel that you might be focusing too much on low level implementation details when designing the system. Before looking at how to lay out the components, assemblies or services you should be thinking of how to architect the system.
You could start with the following high level statements from which to build your system architecture around:
Confirm the technical skill set of the development team and the operations/support team.
Agree on an initial finite list of systems that will integrate to your service, the protocols they support and some SLAs.
Decide on the messaging strategy.
Understand how you will deploy your service/system.
Decide on the choice of middleware (ESBs, Message Brokers, etc), databases (SQL, Oracle, Memcache, DB2, etc) and 3rd party frameworks/tools.
Decide on your caching and data latency strategy.
Break your application into the various areas of business responsibility - This will allow you to split up the work and allow easier communication of milestones during development/testing and implementation.
Design each component as required to meet the areas of responsibility. The areas of responsibility should automatically lead you to decide on how to design component, assembly or service.
Obviously not all of the above will match your specific case but I would suggest that they should at least be given some thought.
Good luck.
Abstraction and tiering will introduce latency, but the real question is, what are you GAINING to make the cost(s) worthwhile? Loose coupling, governance, scalability, maintainability are worth real $.
Even the best-designed layered app will exhibit more latency than an app talking directly to a DB. Users who know the original system will feel the difference. They may not like it, so this can be a political issue as much as a technical one.