If Paxos algorithm is modified such that the acceptors accept the first value, or the most recent value, does the approach fail? - algorithm

I've tried to reason and understand if the algorithm fails in these cases but can't seem to find an example where they would.
If they don't then why isn't any of these followed?

Yes.
Don't forget that in later rounds, leaders may be proposing different values than in earlier rounds. Therefore the first message may have the wrong value.
Furthermore messages may arrive reordered. (Consider a node that goes offline, then comes back online to find messages coming in random order.) The most recent message may not be the most recently sent message.
And finally, don't forget that leaders change. The faster an acceptor can be convinced that it is on the wrong leader, the better.

Rather than asking whether the algorithm fails in such a scenario consider that if each node sees different messages lost, delayed, or reordered, is it correct for a node to just accept the first it happens to recieve? Clearly the answer is no.
The algorithm is designed to work when "first" cannot be decided by looking at the timestamp on a message as clocks on different machines may be out of sync. The algorithm is designed to work when the network paths, distances and congestion, may be different between nodes. Nodes may crash and restart else hang and resume making things even more "hostile".
So a five node cluster could have all two nodes try to be leader and all three see a random ordering of which leaders message is "first". So what's the right answer in that case? The algorithm has a "correct" answer based on its rules which ensures a consistent outcome under all "hostile" comditions.
In summary the point of Paxos is that our logical mental model of "first" as a programmer is based on an assumption of a perfect set of clocks, machines and networks. That doesn't exist in the real world. To try to see if things break if you change the algorithm you need "attack" the message flow with all those things said above. You will likely find some way to "break" things under any change.

Related

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

How does Paxos handle packet loss and new node joining?

Recently I'm learning Paxos, until now I already have a basic understanding of how it works. But can anyone explain how Paxos handles packet loss and a new node joining? Could be better if a simple example is provided.
The classical Paxos algorithm does not have a concept of "new nodes joining". Some Paoxs variants do, such as "Vertical Paxos", but the classic algorithm requires that all nodes be statically defined before running the algorithm. With respect to packet loss, Paxos uses a very simple infinite loop: "try a round of the algorithm, if anything at all goes wrong, try another round". So if too many packets are lost in the 1st attempt at achieving resolution (which can be detected via a simple timeout on waiting for replies), a second round can be attempted. If the timeout for that round expires, try again, and so on.
Exactly how packet loss is to be detected and handled is something the Paxos algorithm leaves undefined. It's an implementation-specific detail. This is actually a good thing for production environments since how this is handled can have a pretty big performance impact on Paxos-based systems.
About packet loss, Paxos uses the next assumption about network:
Messages may be lost, reordered, or duplicated.
This is solved via quorums. At least X of all Acceptors must accept a value in order for the system to accept it. This also solves the issue when a node if failing.
About new node joining, Paxos is not focus about how the node detects other nodes. That is a problem solved by other algorithms.
They automagically know all the nodes and each one's role
If you want, for production code implementation, you can use Zookeeper to solve this new node detection.
As pointed out in other answers message loss or message reordering is handled by the algorithm: it is designed to exactly to handle those cases.
New nodes joining is a matter of "cluster membership changes". There is a common misconception that cluster membership changes are not covered by Paxos; yet they are described in the 2001 paper Paxos Made Simple in the last paragraph. In this blog post I discuss it. There is a question of how a new node gets a copy of all the state when it joins the cluster. That is discussed in this answer.

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

How to elect a master node among the nodes running in a cluster?

I'm writing a managed cloud stack (on top of hardware-level cloud providers like EC2), and a problem I will face soon is:
How do several identical nodes decide which one of them becomes a master? (I.e. think of 5 servers running on EC2. One of them has to become a master, and other ones have to become slaves.)
I read a description of the algorithm used by MongoDB, and it seems pretty complicated, and also depends on a concept of votes — i.e. two nodes left alone won't be able to decide anything. Also their approach has a significant delay before it produces the results.
I wonder if there are any less complicated, KISS-embrasing approaches? Are they used widely, or are they risky to adopt?
Suppose we already have a list of servers. Then we can just elect the one that is up and has a numerically smallest IP address. What are downsides of this approach?
Why is MongoDB's algorithm so complicated?
This is a duplicate of How to elect new Master in Cluster?, which gives less details and has not been answered for 6 months, so I feel it is appropriate to start a new question.
(The stack I'm working on is open-source, but it's on a very early stage of development so not giving a link here.)
UPDATE: based on the answers, I have designed a simple consensus algorithm, you can find a JavaScript (CoffeeScript) implementation on GitHub: majority.js.
Leader election algorithms typically consider the split brain as a fault case to support. If you assume that it's not the nodes that fail but the networking, you may run into the case where all nodes are up, but fail to talk to each other. Then, you may end up with two masters.
If you can exclude "split brain" from your fault model (i.e. if you consider only node failures), your algorithm (leader is the one with the smallest address) is fine.
Use Apache ZooKeeper. It solves exactly this problem (and many more).
If your nodes also need to agree on things and their total order, you might want to consider Paxos. It's complicated, but nobody's come up with an easier solution for distributed consensus.
Me like this algorithm:
Each node calculates the lowest known node id and sends a vote for leadership to this node
If a node receives sufficiently many votes and the node also voted for itself, then it takes on the role of leader and starts publishing cluster state.
and at link below have some many algorithm of election master-node in cluster:
https://www.elastic.co/blog/found-leader-election-in-general#the-zen-way
Also can see Raft-algorithm: https://raft.github.io

How to detect anomalous resource consumption reliably?

This question is about a whole class of similar problems, but I'll ask it as a concrete example.
I have a server with a file system whose contents fluctuate. I need to monitor the available space on this file system to ensure that it doesn't fill up. For the sake of argument, let's suppose that if it fills up, the server goes down.
It doesn't really matter what it is -- it might, for example, be a queue of "work".
During "normal" operation, the available space varies within "normal" limits, but there may be pathologies:
Some other (possibly external)
component that adds work may run out
of control
Some component that removes work seizes up, but remains undetected
The statistical characteristics of the process are basically unknown.
What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm.
I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough.
Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred.
I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth.
There are three cases, I think, of interest, not in order:
The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion;
The "Global Warming" scenario: needing to plan for (relatively) stable growth; and
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it.
It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view..
If it is actually related to a queue of work, then queueing theory may be the best route to an answer.
For the general case you could perhaps attempt a (multiple?) linear regression on the historical data, to detect if there is a statistically significant rising trend in the resource usage that is likely to lead to problems if it continues (you may also be able to predict how long it must continue to lead to problems with this technique - just set a threshold for 'problem' and use the slope of the trend to determine how long it will take). You would have to play around with this and with the variables you collect though, to see if there is any statistically significant relationship that you can discover in the first place.
Although it covers a completely different topic (global warming), I've found tamino's blog (tamino.wordpress.com) to be a very good resource on statistical analysis of data that is full of knowns and unknowns. For example, see this post.
edit: as per my comment I think the problem is somewhat analogous to the GW problem. You have short term bursts of activity which average out to zero, and long term trends superimposed that you are interested in. Also there is probably more than one long term trend, and it changes from time to time. Tamino describes a technique which may be suitable for this, but unfortunately I cannot find the post I'm thinking of. It involves sliding regressions along the data (imagine multiple lines fitted to noisy data), and letting the data pick the inflection points. If you could do this then you could perhaps identify a significant change in the trend. Unfortunately it may only be identifiable after the fact, as you may need to accumulate a lot of data to get significance. But it might still be in time to head off resource depletion. At least it may give you a robust way to determine what kind of safety margin and resources in reserve you need in future.

Resources