How do IaaS nodes communicate to form a cluster? - cluster-computing

How do nodes communicate with each other, or how do they become aware of each other (in a decentralized manner) in an IaaS environment? As an example: this article about Akka on Google's IaaS describes a 1500+ decentralized cluster intercommunicating randomly. What is the outline of this process?

It would be quite long to explain how Akka cluster works in detail, but I can try to give an overview.
The membership set in Akka is esentially a highly specialized CRDT. Since talking about Vector Clocks itself would be a lengthy discussion, I will use the analogy of git-like repositories.
You can imagine every Akka node maintaining its own repository where HEAD points to the current state of the cluster (known by that node). When a node introduces a change, it branches off, and starts to propagate the change to other nodes (this part is what is more or less random).
There are certain changes which we call monotonic which in the git analogy would mean that the branch is trivially mergeable. Those changes are just merged by other nodes as they receive them and they will then propagate the merge commit to others and eventually everything stabilizes (HEAD points to the same content).
There are other kind of changes that are not trivial to merge (non-monotonic). The process then is that a node first sends around a proposal: "I want to make this non-trivial change C". This is needed because the other nodes need to be aware of this pending "complex" change and prepare themselves. This is disseminated among the nodes until everyone receives it. Now we are at the state where "Everyone knows that someone proposed to make the change C", but this is not enough, since no one is actually aware that there is an agreement yet.
Therefore there is another "round", where nodes start to propagate the information "I, node Y, are aware of the fact that change C has been proposed". Eventually one or more nodes become aware that there is an agreement (this is more or less a distributed acknowledgement protocol). So the state now is "At least one node knows that every node knows that the change C has been proposed". This is (partly) what we refer to as convergence. At this point the node (or nodes) that are aware of the agreement will make the merge and propagate it.
Please note that I highly simplified the explanation here, obviously the devil (and scaling) is in the details :)

Related

Meaning of merge, phi, effectphi and dead in v8 terminology

I’m trying to read the v8 source code (in particular the compiler part of it) to understand better the optimisation and reduction procedures (in order to look for bugs).
I ran into a few terms that are used in the comments but seem to be unexplained. The comment is this:
// Check if this is a merge that belongs to an unused diamond, which means
// that:
//
// a) the {Merge} has no {Phi} or {EffectPhi} uses, and
// b) the {Merge} has two inputs, one {IfTrue} and one {IfFalse}, which are
// both owned by the Merge, and
// c) and the {IfTrue} and {IfFalse} nodes point to the same {Branch}.
What do the terms Merge, Phi and EffectPhi mean? Also, does marking a node as “dead” mean that it will treated as redundant?
Thanks in advance.
The link to the above code is this:
https://chromium.googlesource.com/v8/v8.git/+/refs/heads/master/src/compiler/common-operator-reducer.cc
V8 developer here. As background knowledge, it helps to know that V8's "Turbofan" compiler uses the "SSA" (static single assignment) and "sea of nodes" concepts. There are various compiler textbooks and research papers that explain these in great detail. To answer your questions in short:
A "Merge" node merges two control nodes, i.e. two branches of control flow. You can think of it as the "opposite" of a Branch, or the equivalent of a Phi for control nodes. Control nodes are the mechanism that Turbofan's sea-of-nodes design uses to make sure nodes aren't reordered across certain control flow boundaries.
A "Phi" node merges the two (or more) possibilities for a value that have been computed by different branches. See https://en.wikipedia.org/wiki/Static_single_assignment_form for more.
An "EffectPhi" is a special version of a Phi node that's used for nodes on the "effect chain". The effect chain is the mechanism Turbofan uses to make sure that nodes' external effects (like memory loads and stores) aren't visibly reordered.
A "dead" node is one that's unreachable and can be eliminated. So it's "redundant" in the sense of "superfluous/unnecessary", but not in the sense of "the same as another node".

How does Paxos handle packet loss and new node joining?

Recently I'm learning Paxos, until now I already have a basic understanding of how it works. But can anyone explain how Paxos handles packet loss and a new node joining? Could be better if a simple example is provided.
The classical Paxos algorithm does not have a concept of "new nodes joining". Some Paoxs variants do, such as "Vertical Paxos", but the classic algorithm requires that all nodes be statically defined before running the algorithm. With respect to packet loss, Paxos uses a very simple infinite loop: "try a round of the algorithm, if anything at all goes wrong, try another round". So if too many packets are lost in the 1st attempt at achieving resolution (which can be detected via a simple timeout on waiting for replies), a second round can be attempted. If the timeout for that round expires, try again, and so on.
Exactly how packet loss is to be detected and handled is something the Paxos algorithm leaves undefined. It's an implementation-specific detail. This is actually a good thing for production environments since how this is handled can have a pretty big performance impact on Paxos-based systems.
About packet loss, Paxos uses the next assumption about network:
Messages may be lost, reordered, or duplicated.
This is solved via quorums. At least X of all Acceptors must accept a value in order for the system to accept it. This also solves the issue when a node if failing.
About new node joining, Paxos is not focus about how the node detects other nodes. That is a problem solved by other algorithms.
They automagically know all the nodes and each one's role
If you want, for production code implementation, you can use Zookeeper to solve this new node detection.
As pointed out in other answers message loss or message reordering is handled by the algorithm: it is designed to exactly to handle those cases.
New nodes joining is a matter of "cluster membership changes". There is a common misconception that cluster membership changes are not covered by Paxos; yet they are described in the 2001 paper Paxos Made Simple in the last paragraph. In this blog post I discuss it. There is a question of how a new node gets a copy of all the state when it joins the cluster. That is discussed in this answer.

If Paxos algorithm is modified such that the acceptors accept the first value, or the most recent value, does the approach fail?

I've tried to reason and understand if the algorithm fails in these cases but can't seem to find an example where they would.
If they don't then why isn't any of these followed?
Yes.
Don't forget that in later rounds, leaders may be proposing different values than in earlier rounds. Therefore the first message may have the wrong value.
Furthermore messages may arrive reordered. (Consider a node that goes offline, then comes back online to find messages coming in random order.) The most recent message may not be the most recently sent message.
And finally, don't forget that leaders change. The faster an acceptor can be convinced that it is on the wrong leader, the better.
Rather than asking whether the algorithm fails in such a scenario consider that if each node sees different messages lost, delayed, or reordered, is it correct for a node to just accept the first it happens to recieve? Clearly the answer is no.
The algorithm is designed to work when "first" cannot be decided by looking at the timestamp on a message as clocks on different machines may be out of sync. The algorithm is designed to work when the network paths, distances and congestion, may be different between nodes. Nodes may crash and restart else hang and resume making things even more "hostile".
So a five node cluster could have all two nodes try to be leader and all three see a random ordering of which leaders message is "first". So what's the right answer in that case? The algorithm has a "correct" answer based on its rules which ensures a consistent outcome under all "hostile" comditions.
In summary the point of Paxos is that our logical mental model of "first" as a programmer is based on an assumption of a perfect set of clocks, machines and networks. That doesn't exist in the real world. To try to see if things break if you change the algorithm you need "attack" the message flow with all those things said above. You will likely find some way to "break" things under any change.

Distributed algorithm design

I've been reading Introduction to Algorithms and started to get a few ideas and questions popping up in my head. The one that's baffled me most is how you would approach designing an algorithm to schedule items/messages in a queue that is distributed.
My thoughts have lead me to browsing Wikipedia on topics such as Sorting,Message queues,Sheduling, Distributed hashtables, to name a few.
The scenario:
Say you wanted to have a system that queued messages (strings or some serialized object for example). A key feature of this system is to avoid any single point of failure. The system had to be distributed across multiple nodes within some cluster and had to consistently (or as best as possible) even the work load of each node within the cluster to avoid hotspots.
You want to avoid the use of a master/slave design for replication and scaling (no single point of failure). The system totally avoids writing to disc and maintains in memory data structures.
Since this is meant to be a queue of some sort the system should be able to use varying scheduling algorithms (FIFO,Earliest deadline,round robin etc...) to determine which message should be returned on the next request regardless of which node in the cluster the request is made to.
My initial thoughts
I can imagine how this would work on a single machine but when I start thinking about how you'd distribute something like this questions like:
How would I hash each message?
How would I know which node a message was sent to?
How would I schedule each item so that I can determine which message and from which node should be returned next?
I started reading about distributed hash tables and how projects like Apache Cassandra use some sort of consistent hashing to distribute data but then I thought, since the query won't supply a key I need to know where the next item is and just supply it...
This lead into reading about peer to peer protocols and how they approach the synchronization problem across nodes.
So my question is, how would you approach a problem like the one described above, or is this too far fetched and is simply a stupid idea...?
Just an overview, pointers,different approaches, pitfalls and benefits of each. The technologies/concepts/design/theory that may be appropriate. Basically anything that could be of use in understanding how something like this may work.
And if you're wondering, no I'm not intending to implement anything like this, its just popped into my head while reading (It happens, I get distracted by wild ideas when I read a good book).
UPDATE
Another interesting point that would become an issue is distributed deletes.I know systems like Cassandra have tackled this by implementing HintedHandoff,Read Repair and AntiEntropy and it seems to work work well but are there any other (viable and efficient) means of tackling this?
Overview, as you wanted
There are some popular techniques for distributed algorithms, e.g. using clocks, waves or general purpose routing algorithms.
You can find these in the great distributed algorithm books Introduction to distributed algorithms by Tel and Distributed Algorithms by Lynch.
Reductions
are particularly useful since general distributed algorithms can become quite complex. You might be able to use a reduction to a simpler, more specific case.
If, for instance, you want to avoid having a single point of failure, but a symmetric distributed algorithm is too complex, you can use the standard distributed algorithm of (leader) election and afterwards use a simpler asymmetric algorithm, i.e. one which can make use of a master.
Similarly, you can use synchronizers to transform a synchronous network model to an asynchronous one.
You can use snapshots to be able to analyze offline instead of having to deal with varying online process states.

How to elect a master node among the nodes running in a cluster?

I'm writing a managed cloud stack (on top of hardware-level cloud providers like EC2), and a problem I will face soon is:
How do several identical nodes decide which one of them becomes a master? (I.e. think of 5 servers running on EC2. One of them has to become a master, and other ones have to become slaves.)
I read a description of the algorithm used by MongoDB, and it seems pretty complicated, and also depends on a concept of votes — i.e. two nodes left alone won't be able to decide anything. Also their approach has a significant delay before it produces the results.
I wonder if there are any less complicated, KISS-embrasing approaches? Are they used widely, or are they risky to adopt?
Suppose we already have a list of servers. Then we can just elect the one that is up and has a numerically smallest IP address. What are downsides of this approach?
Why is MongoDB's algorithm so complicated?
This is a duplicate of How to elect new Master in Cluster?, which gives less details and has not been answered for 6 months, so I feel it is appropriate to start a new question.
(The stack I'm working on is open-source, but it's on a very early stage of development so not giving a link here.)
UPDATE: based on the answers, I have designed a simple consensus algorithm, you can find a JavaScript (CoffeeScript) implementation on GitHub: majority.js.
Leader election algorithms typically consider the split brain as a fault case to support. If you assume that it's not the nodes that fail but the networking, you may run into the case where all nodes are up, but fail to talk to each other. Then, you may end up with two masters.
If you can exclude "split brain" from your fault model (i.e. if you consider only node failures), your algorithm (leader is the one with the smallest address) is fine.
Use Apache ZooKeeper. It solves exactly this problem (and many more).
If your nodes also need to agree on things and their total order, you might want to consider Paxos. It's complicated, but nobody's come up with an easier solution for distributed consensus.
Me like this algorithm:
Each node calculates the lowest known node id and sends a vote for leadership to this node
If a node receives sufficiently many votes and the node also voted for itself, then it takes on the role of leader and starts publishing cluster state.
and at link below have some many algorithm of election master-node in cluster:
https://www.elastic.co/blog/found-leader-election-in-general#the-zen-way
Also can see Raft-algorithm: https://raft.github.io

Resources