Guidance on Patterns and recommendations on achieving database Atomicity in distributed architecture (microservices) - microservices

Folks, I am evaluating options/ pattern and practices around key challenge of maintaining db atomicity (across multiple tables) that we are facing in distributed (microservices) architecture.
Atomicity, reliability and scale all are critical for business(it might have been common across businesses, just putting it out there).
I read few articals about achieving but it all comes at a significant cost and not without certain trade offs, which I am not ready to make.
Read couple of SO questions, and one concept SAGA seems interesting, but I don’t think our legacy database is meant to handle it.
So here I am asking experts of their personal opinion, guidance and past experience so I can save time and effort without try and learn bunch of options.
Appreciate your time and effort.

CAP theorem
CAP theorem is the key when it comes to distributed systems. Start with this to know if you want availability vs consistency.
Distributed transactions
You are right, trade offs involved and there is no right single answer. when it comes to distributed transaction it's no different. In microservices architecture Atomicity is not easy to achieve. Normally we design the microservices by keeping eventual consistency in mind. Strong consistency is very hard and not a simple solution.
2PC it's very easy to achieve atomicity using 2 phase commit , but that option is not for microservices. your system can't scale system since if any of the microservice goes down your transaction will hang into abnormal state and locks are very common with this approach.
SAGA is most acceptable and scaleable approach . You commit local transaction (atomically) once done you need to publish the event , and all the interested services will have to consume the event and update their own local database. If there is exception or particular microservices can't accept the event data , it would raise compensation transaction , which mean you have to reverse and undo the actions taken by all microservices against that event. This is widely accepted pattern and is scaleable.
I don't get legacy db part. What makes you think legacy DB will have problem ? SAGA has nothing to do with legacy system . It simply mean if you have to accept the event or not. If yes then save it into database. If not then raise compensated transaction so all other service can undo.
What's the right approach ?
Well it really depends on you eventually. There are many pattern around when it comes to save the transaction . Have a look at CQRS and event sourcing pattern which is used to save all the domain events. Since disturbed transactions can be complex . CQRS solve many problems e.g. eventual consistency etc.
Hope that helps! shoot me questions if you have.

One possible option is Command Query Responsibility Segregation (CQRS) - maintain one or more materialized views that contain data from multiple services. The views are kept by services that subscribe to events that each services publishes when it updates its data. For example, the online store could implement a query that finds customers in a particular region and their recent orders by maintaining a view that joins customers and orders. The view is updated by a service that subscribes to customer and order events.


How to approach event sourcing with millions of records

We're looking at implementing event sourcing / CQRS and for 95% of our system I can reason about the events and it doesn't scare me.
On the other hand, we also have a requirement where customers can insert data for millions of records in one go. A large portion of them can be updated in one go as they move location etc or have batch level details updated. It also needs to be reversed if they change their mind moments after.
Each record relates to a physical entity in the real world and it's important that the read model is updated quickly and the audit trail preserved at all costs for each record.
I can't seem to find any advice on how to handle these volumes. Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive? Are there any case studies that have similar requirements?
Any guidance is appreciated.
Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive?
A potentially helpful heuristic -- how would you do it with a version control system? Would you create an empty document, and then introduce a million commits, or would you have a single Data Imported commit, and go from there?
An important consideration to notice is that the authority for the data is somewhere else. "Physical entities in the real world" are not subject to the rules of your domain model; you you have there is a big pile of reference data.
It can help to think in processes -- what you have is an import reference data process, that has a relatively small number of immediate steps, and independently some "do interesting things with each record" which may turn out to be millions of little processes with some small number of events.

Is a data warehouse a good solution for sharing customer data across technologies?

I am wanting to be able to share data across all areas of our business in a way that reduces the overall complexity of our infrastructure.
The Problem
Our problem is that we currently have 4 main applications that all connect to our CRM application (Microsoft Dynamics 2011):
The decision-makers at our firm are currently wanting to upgrade our CRM to the most current version and, then, stay up to date as new upgrades are released (every 2-3 years). Almost all of our applications are rigidly integrated with Microsoft Dynamics so each upgrade is very expensive and risky. I want to design another approach that will reduce this expense and risk.
In 2006, Roger Sessions wrote an article called A Better Path to Enterprise Architectures (here) which outlines ways to better Business IT systems. One of the central themes in his discussion is reducing complexity, and by arranging die in different ways, he shows that you can exponentially reduce the complexity of systems by partitioning technologies into segments rather than letting any technology connect to any other technology. Jeanne Ross has a great presentation on this topic as well (here), and she talks about having a digitized platform for sharing core data and services between areas of the business in order to reduce complexity of the overall system and increase agility in responding to current and future business needs.
As I reflect on the lessons from Sessions and Ross, I am confident that we need to take Microsoft Dynamics out of the center of our architecture if we are wanting to overhaul the technology every 2-3 years. We'll just need replace it with something that will allow our core data (mostly customer data) to be shared across applications. I know that data warehouses are often used for aggregating data across the organization. Could this work?
I understand that data warehouses are mostly used for reporting, so I don't know if having direct connections to the data warehouse would be ideal. However, each application would not need the ability to update any data in the data warehouse. They just need the ability to grab their IDs to set up relationships between global, data warehouse entities (customers) and various unit-specific entities within each application's database.
Which of these three options would meet my needs: (1) a data warehouse to which all applications connect directly, (2) a data warehouse that feeds data to each application-specific database through overnight updates or (3) something else?
What you're after is a data integration architecture - that doesn't necessarily mean a data warehouse. The pattern you're describing is called "hub and spoke," and it's very common - I'd say you're definitely on the right track for resolving the integration problem you're describing.
This page goes into this problem and pattern in much more depth, and it also has a section on the differences between data warehousing and data integration. You've noted that you're aware data warehouses are commonly used for reporting - that's true, and they're also used heavily for analytics, as the link discusses. They're traditionally a data source for business intelligence efforts. This can mean they're not always focused on the kind of data you're interested in - i.e. operational data which your systems need to function, but which might not be of interest for reporting or analytical purposes. Or, they might not function in a way that's helpful for your needs - for instance, traditional overnight ETL loads might not be the best solution if you need your applications to be up-to-date more quickly.
All this is to say that data warehouses can definitely be used as a data hub - the EDW becomes your "master data" source, any data quality processes needed run on the EDW data, and ETL processes fire corrected data back out to the various sources - but you'll probably be better served by researching the topic of data integration than the topic of data warehousing, even if the two share a lot of similarities and can overlap.
If you create a data warehouse without any business intelligence requirements, it might not function very well as a data warehouse. A very suitable data integration/master data solution might not resolve all of the future requirements you have for a data warehouse. Equally, if you were to create a traditional data warehouse after researching data warehousing best practices, it might not fulfill your data integration requirements, or fulfill them in the best way. As the link suggests, separate the two ideas: resolve your data integration problem, and if you want a data warehouse in the future, you can use your data integration solution to help populate it.

MongoDB / Mongoose Schema for Threaded Messages (Efficiently)

I'm somewhat new to noSQL databases (I'm fairly good with relational databases though), and I'm wondering what the most efficient way to handle an inbox system with threaded messages would be.
Each 'message' will have a single sender and recipient. The number of received / sent messages will vary widely between users. This system should scale well to over 1k+ users.
I've read up on fan out on write / read but I'm not sure how well this would work for threaded messages.
Since I'm new to MongoDB / NoSQL in general, I'm not really used to structuring data efficiently this way.
I'm guessing there's going to be nested objects in any sort of efficient way of handling this...but I can't settle on a design that seems both efficient and convenient for threaded conversations between 2 users.
I thought of storing data with an array of the 2 users, combined with an array of 'message' objects. But then there's the issue of the order of the 2 user's usernames. (ex. [UserA, UserB] and [UserB, UserA] are both possible and would be problematic, so that seemed like a bad idea).
I thought of doing the whole fan out on read / write thing, but that doesn't seem efficient for threaded messages (since if grabbing messages by recipient is convenient, grabbing messages by sender won't be and vice versa).
I'm leaning towards favoring grabbing messages by recipient (since the inbox loads multiple messages, and sending only involves one [albeit with a longer look-up time]). But I'd really like to grab a threaded conversation in one go, as well as the list of users that a user has threaded conversations with (for the list of threads).
If someone could give me an efficient schema for threaded conversations I'd be very grateful. I've been researching this and trying to settle on a design for hours, and I'm exhausted. I keep finding flaws in my designs and scrapping them and I'd really just like some input from someone more experienced with NoSQL databases / MongoDB so I can avoid making a huge design flaw and/or writing logic that could've been handled with a better database design.
Thanks in advance for any and all help.
On this particular topic you are in luck, there is a great post discussing the various approaches to the schema here (it's a slight twist on what you are looking at, but not much different):
Then, this topic was also covered in detail at MongoDB World 2014 in three parts by Darren Wood and Asya Kamsky:
Part 1 Outline and Video
Part 2 Outline and Video
Part 3 Outline and Video
Also at MongoDB World the guys at Dropbox talked about the lessons they learned when building their Mailbox:
And then, to round it off, there is a full reference architecture with code called Socialite on Github written by the aforementioned Darren Wood:

What is the design & architecture behind facebook's status update mechanism?

I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!
Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.

Can 'moving business logic to application layer' increase performance?

In my current project, the business logic is implemented in stored procedures (a 1000+ of them) and now they want to scale it up as the business is growing. Architects have decided to move the business logic to application layer (.net) to boost performance and scalability. But they are not redesigning/rewriting anything. In short the same SQL queries which are fired from an SP will be fired from a .net function using ADO.Net. How can this yield any performance?
To the best of my understanding, we need to move business logic to application layer when we need DB independence or there is some business logic that can be better implemented in a OOP language than an RDBMS engine (like traversing a hierarchy or some image processing, etc..). In rest of the cases, if there is no complicated business logic to implement, I believe that it is better to keep the business logic in DB itself, at least the network delays between application layer and DB can be avoided this way.
Please let me know your views. I am a developer looking at some architecture decisions with a little hesitation, pardon my ignorance in the subject.
If your business logic is still in SQL statements, the database will be doing as much work as before, and you will not get better performance. (may be more work if it is not able to cache query plans as effectivily as when stored procedures were used)
To get better performance you need to move some work to the application layer, can you for example cache data on the application server, and do a lookup or a validation check without hitting the database?
Architectural arguments such as these often need to consider many trades-off, considering performance in isolation, or ideed considering only one aspect of performance such as response time tends to miss the larger picture.
There clearly some trade off between executing logic in the database layer and shipping the data back to the applciation layer and processing it there. Data-ship costs versus processing costs. As you indicate the cost and complexity of the business logic will be a significant factor, the size of the data to be shipped would be another.
It is conceivable, if the DB layer is getting busy, that offloading processing to another layer may allow greater overall throughput even if the individual responses time are increased. We could then scale the App tier in order to deal with some extra load. Would you now say that performance has been improved (greater overall throughput) or worsened (soem increase in response time).
Now consider whether the app tier might implement interesting caching strategies. Perhaps we get a very large performance win - no load on the DB at all for some requests!
I think those decisions should not be justified using architectural dogma. Data would make a great deal more sense.
Statements like "All business logic belongs in stored procedures" or "Everything should be on the middle tier" tend to be made by people whose knowledge is restricted to databases or objects, respectively. Better to combine both when you judge, and do it on the basis of measurements.
For example, if one of your procedures is crunching a lot of data and returning a handful of results, there's an argument that says it should remain on the database. There's little sense in bringing millions of rows into memory on the middle tier, crunching them, and then updating the database with another round trip.
Another consideration is whether or not the database is shared between apps. If so, the logic should stay in the database so all can use it.
Middle tiers tend to come and go, but data remains forever.
I'm an object guy myself, but I would tread lightly.
It's a complicated problem. I don't think that black and white statements will work in every case.
Well as others have already said, it depends on many factors. But from you question it seems the architects are proposing moving the stored procedures from inside DB to dynamic SQL inside the application. That sounds very dubious to me.
SQL is a set oriented language and business logic that requires massaging of large amount of data records would be better in SQL. Think complicated search and reporting type function. On the other hand line item edits with corresponding business rule validation is much better being done in a programming language. Caching of slow changing data in app tier is another advantage. This is even better if you have dedicated middle tier service that acts as a gateway to all the data. If data is shared directly among disparate applications then stored proc may be a good idea.
You also have to factor the availability/experience of SQL talent vs programming talent in the organisation.
There is realy no general answer to this question.
