I thinking about the solution to solve problem such as transfer money across banks.
Let say I have two microservices namely A and B with separated database on each of them. I prefered 2PC and Saga but my thought have a bit difference, let I explain more. With each of microservice operation I would have State to keep track, in this case eg. Created and Approved. I would like some how two microservice MUST acknowledge state of each other before change state to Approved. Below are what I mean, I do not know what should do in step 5.
I would have a atomicity operation in step 5. Both microservice would have Approved state or none of them would have. I thought my expectation will generate a cyclic operations between two microservice, right?
Imagine like two bank transfering money is what I would like to achieve. I do not know whether what I am design and thinking is correct or not. Could you possibly give me some advise to address this kind of problem. Or give me some patterns which adapt this issue more efficiently.
Please ask me if my question not clear.
Related
Folks, I am evaluating options/ pattern and practices around key challenge of maintaining db atomicity (across multiple tables) that we are facing in distributed (microservices) architecture.
Atomicity, reliability and scale all are critical for business(it might have been common across businesses, just putting it out there).
I read few articals about achieving but it all comes at a significant cost and not without certain trade offs, which I am not ready to make.
Read couple of SO questions, and one concept SAGA seems interesting, but I don’t think our legacy database is meant to handle it.
So here I am asking experts of their personal opinion, guidance and past experience so I can save time and effort without try and learn bunch of options.
Appreciate your time and effort.
CAP theorem
CAP theorem is the key when it comes to distributed systems. Start with this to know if you want availability vs consistency.
Distributed transactions
You are right, trade offs involved and there is no right single answer. when it comes to distributed transaction it's no different. In microservices architecture Atomicity is not easy to achieve. Normally we design the microservices by keeping eventual consistency in mind. Strong consistency is very hard and not a simple solution.
SAGA vs 2PC
2PC it's very easy to achieve atomicity using 2 phase commit , but that option is not for microservices. your system can't scale system since if any of the microservice goes down your transaction will hang into abnormal state and locks are very common with this approach.
SAGA is most acceptable and scaleable approach . You commit local transaction (atomically) once done you need to publish the event , and all the interested services will have to consume the event and update their own local database. If there is exception or particular microservices can't accept the event data , it would raise compensation transaction , which mean you have to reverse and undo the actions taken by all microservices against that event. This is widely accepted pattern and is scaleable.
I don't get legacy db part. What makes you think legacy DB will have problem ? SAGA has nothing to do with legacy system . It simply mean if you have to accept the event or not. If yes then save it into database. If not then raise compensated transaction so all other service can undo.
What's the right approach ?
Well it really depends on you eventually. There are many pattern around when it comes to save the transaction . Have a look at CQRS and event sourcing pattern which is used to save all the domain events. Since disturbed transactions can be complex . CQRS solve many problems e.g. eventual consistency etc.
Hope that helps! shoot me questions if you have.
One possible option is Command Query Responsibility Segregation (CQRS) - maintain one or more materialized views that contain data from multiple services. The views are kept by services that subscribe to events that each services publishes when it updates its data. For example, the online store could implement a query that finds customers in a particular region and their recent orders by maintaining a view that joins customers and orders. The view is updated by a service that subscribes to customer and order events.
I am currently implementing kind of a questionnaire with a chatbot and use LUIS to interpret the answers. This questionnaire is divided into segments of maybe 20 questions.
Now the following question came to mind: Which option is better?
I make one LUIS model per question. Since these questions can include implicit subquestions (i.e. "Do you want to pay all at once or in rates" could include a "How high should these rates be?") I end up with maybe 3-5 intents per question (including None).
I can make one model per segment. Let's assume that this is possible and fits in the 80 intents per model.
My first intuition was to think that the first alternative is better since this should be way more robust. When there are only 5 intents to choose from, then it may be easier to determine the right one. As far as I know, there is no restriction in how many models you can have (...is this correct?).
So here is my question for SO: What other benefits/drawbacks are there and is there maybe an option that is objectively the best?
You can have as many models as you want, there is no limit on this. But onto the rest of your question:
You intend to use LUIS to interpret every response? I'm curious as to the actual design of the questionnaire and why you need (or want) open ended responses and not multiple-choice questions. "Do you want to pay all at once or in rates" itself is a binary question. Branching off of this, users might respond with, "Yes I want to pay all at once", which could use LUIS. Or they could respond with, "rates" which could be one of two choices available to the user in a Prompt/FormFlow. "rates" is also much shorter than the first answer and thus a selection that would probably be typed more often than not.
Multiple-choice questions provide a standardized input which would reduce the amount of work you'd have to do in managing your data. It also would most likely reduce the amount of effort needed to maintain the models and questionnaire.
Objectively speaking, one model is more likely to be less work, but we can drill down a little further:
First option:
If your questionnaire segments include 20 questions and you have 2 segments, you have 40 individual models to maintain which is a nightmare.
Additionally, you might experience latency issues depending on your recognizer order, because you have to wait for a response from 40 endpoints. This said it IS possible to turn off recognizers so you might only need to wait for one recognizer. But then you need to manually turn on the next recognizer and turn off the previous one. You should also be aware that handling multiple "None" intents is a nightmare, in case you wish to leave every recognizer active.
I'm assuming that you'll want assistance in managing you models after you realize the pain of handling forty of them by yourself. You can add collaborators, but then you need to add them to multiple models as well. One day you'll (probably) have to remove them from all of the models they have collaborator status on.
The first option IS more robust but also involves a rather extreme amount of work hours. You're partially right in that fewer intents is helpful because of fewer possibilities for the model to predict. But the predictions of your models become more accurate with more utterances and labeling, so any bonus gained by having 5 intents per model is most likely to be rendered moot.
Second option:
Using one model per segment, as mentioned above is less work. It's less work to maintain, but what are some downsides? Until you train your model well enough, there may indeed be some unintended consequences due to false-positive predictions. That said, you could account for that in your questionnaire/bot/questionnaire-bot's code to specifically look for the expected intents for the question and then use the highest scoring intent from this subset if the highest scoring intent overall doesn't match to your question.
Another downfall is that if it's one model and a collaborator makes a catastrophic mistake, it affects the entire segment. With multiple models, the mistake would only affect the one question/model, so that's a plus.
Aside from not having to deal with multiple None-intent handling, you can quickly label utterances that should belong to the None intent. What you label as an intent in a singular model essentially makes it stand out more against the other intents inside of the model. If you have multiple models, an answer that triggers a specific intent in one model needs to trigger the None intent in your other models, otherwise, you'll end up with multiple high scoring intents (and the relevant/expected intents might not be the highest scoring).
End:
I recommend the second object, simply because it's less work. Also, I'm not sure of the questionnaire's goals, but as a general rule, I question the need of putting in AI where it's not needed. Here is a link that talks about factors that do not contribute to a bot's success (note that Natural Language is one of these factors).
I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.
I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!
Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.
To take the simplest possible example:
Start with an empty database.
Add a document
Add a design document with validation function that rejects everything
Replicate that database.
To ask a concrete question to begin with, one with an answer that I hope can be given very quickly by pointing me to the right url: is the result of this replication defined by some rule, for example that the documents are always replicated in the order they were saved, or does the successful replication of the first document depend on whether the design document happened to arrive at the destination first? In the quick experiment I did, both documents did get successfully validated, but I'm trying to find out if that outcome is defined in a spec somewhere or it's implementation dependent.
To ask a followup question that's more handwavey and may not have a single answer, what else can happen and what sorts of solutions have emerged to manage those problems? It's obviously possible for different servers to simultaneously (and I use that word hesitantly) have different versions of a validation function. I suppose the validators could be backwards compatible, where every new version adds a case to a switch statement that looks up a say a schema_version attribute of the document. Then if a version 2 document arrives at a server where the version 3 validator is the gatekeeper, it'll be allowed in. If a version 3 document arrives at a version 2 validator, it's a bit more tricky, it presumably depends on whether strictness or leniency is an appropriate default for the application. But can either of those things even happen, or do the replication rules ensure that even if servers are going up and down, updates and deletes are being done all over the place, and replication connections are intermittent and indirect, that a document will never arrive on a given server before its appropriate validation function, and that a validation function will never arrive too late to handle one of the documents it was supposed to check?
I could well be overcomplicating this or missing out on some Zen insight, but painful experience has taught me that I'm not clever enough to predict what sorts of states concurrent systems can get themselves into.
EDIT:
As Marcello says in a comment, updates on individual servers have sequence numbers, and replication applies the updates in sequence number order. I had a vague idea that that was the case, but I'm still fuzzy on the details. I'm trying to find the simplest possible model that will give me an idea about what can and can't happen in a complex CouchDB system.
Suppose I take the state of server A that's started off empty and has three document writes made to it. So its state can be represented as the following string:
A1,A2,A3
Suppose server B also has three writes: B1,B2,B3
We replicate A to B, so the state of B is now: B1,B2,B3,A1,A2,A3. Although presumably the A updates have taken a sequence number on entering B, so the state is now: B1, B2, B3, B4(A1), B5(A2), B6(A3).
If I understand correctly, the replicator also makes a record of the fact that everything up to A3 has been replicated to B, and it happens to store this record as part of B's internal state, but I'm wondering if this is an implementation detail that can be disregarded in the simple model.
If you operate those sets of rules, the A updates and the B updates would stay in order on any server they were replicated to. Perhaps the only way they could get out of order is if you did something like replicating A to B, deleting A1 on A and A2 on B, replicating A to C, then replicating B to C, leaving a state on C of: A2, A3, B1, B2, B3, B4(A1).
Is this making any sense at all? Maybe strings aren't the right way of visualising it, maybe it's better to think of, I don't know, a bunch of queues (servers) in an airport , airport staff (replicators) moving people from queue to queue according to certain rules , and put yourself into the mind of someone trying to skip the queue, ie somehow get into a queue before someone who's ahead of them in their current queue. That has the advantage of personalising the model, but we probably don't want to replicate people in airports.
Or maybe there's some way of expliaining it as a Towers of Hanoi type game, although with FIFO queues instead of LIFO stacks.
It's a model I'm hoping to find - absolutely precise as far as behavior is concerned, all irrelevant implementation details stripped away, and using whatever metaphor or imagery makes it easiest to intuit.
The basic use case is simple. CouchDB uses sequence numbers to index database changes and to ask what changes need to be replicated. Order is implicit in this algorithm and what you fear should not happen. As a side note, the replication process only copies the last revision of a document, but this does not change anything about order.