How to elect a master node among the nodes running in a cluster? - algorithm

I'm writing a managed cloud stack (on top of hardware-level cloud providers like EC2), and a problem I will face soon is:
How do several identical nodes decide which one of them becomes a master? (I.e. think of 5 servers running on EC2. One of them has to become a master, and other ones have to become slaves.)
I read a description of the algorithm used by MongoDB, and it seems pretty complicated, and also depends on a concept of votes — i.e. two nodes left alone won't be able to decide anything. Also their approach has a significant delay before it produces the results.
I wonder if there are any less complicated, KISS-embrasing approaches? Are they used widely, or are they risky to adopt?
Suppose we already have a list of servers. Then we can just elect the one that is up and has a numerically smallest IP address. What are downsides of this approach?
Why is MongoDB's algorithm so complicated?
This is a duplicate of How to elect new Master in Cluster?, which gives less details and has not been answered for 6 months, so I feel it is appropriate to start a new question.
(The stack I'm working on is open-source, but it's on a very early stage of development so not giving a link here.)
UPDATE: based on the answers, I have designed a simple consensus algorithm, you can find a JavaScript (CoffeeScript) implementation on GitHub: majority.js.

Leader election algorithms typically consider the split brain as a fault case to support. If you assume that it's not the nodes that fail but the networking, you may run into the case where all nodes are up, but fail to talk to each other. Then, you may end up with two masters.
If you can exclude "split brain" from your fault model (i.e. if you consider only node failures), your algorithm (leader is the one with the smallest address) is fine.

Use Apache ZooKeeper. It solves exactly this problem (and many more).

If your nodes also need to agree on things and their total order, you might want to consider Paxos. It's complicated, but nobody's come up with an easier solution for distributed consensus.

Me like this algorithm:
Each node calculates the lowest known node id and sends a vote for leadership to this node
If a node receives sufficiently many votes and the node also voted for itself, then it takes on the role of leader and starts publishing cluster state.
and at link below have some many algorithm of election master-node in cluster:
https://www.elastic.co/blog/found-leader-election-in-general#the-zen-way
Also can see Raft-algorithm: https://raft.github.io

Related

How does Paxos handle packet loss and new node joining?

Recently I'm learning Paxos, until now I already have a basic understanding of how it works. But can anyone explain how Paxos handles packet loss and a new node joining? Could be better if a simple example is provided.
The classical Paxos algorithm does not have a concept of "new nodes joining". Some Paoxs variants do, such as "Vertical Paxos", but the classic algorithm requires that all nodes be statically defined before running the algorithm. With respect to packet loss, Paxos uses a very simple infinite loop: "try a round of the algorithm, if anything at all goes wrong, try another round". So if too many packets are lost in the 1st attempt at achieving resolution (which can be detected via a simple timeout on waiting for replies), a second round can be attempted. If the timeout for that round expires, try again, and so on.
Exactly how packet loss is to be detected and handled is something the Paxos algorithm leaves undefined. It's an implementation-specific detail. This is actually a good thing for production environments since how this is handled can have a pretty big performance impact on Paxos-based systems.
About packet loss, Paxos uses the next assumption about network:
Messages may be lost, reordered, or duplicated.
This is solved via quorums. At least X of all Acceptors must accept a value in order for the system to accept it. This also solves the issue when a node if failing.
About new node joining, Paxos is not focus about how the node detects other nodes. That is a problem solved by other algorithms.
They automagically know all the nodes and each one's role
If you want, for production code implementation, you can use Zookeeper to solve this new node detection.
As pointed out in other answers message loss or message reordering is handled by the algorithm: it is designed to exactly to handle those cases.
New nodes joining is a matter of "cluster membership changes". There is a common misconception that cluster membership changes are not covered by Paxos; yet they are described in the 2001 paper Paxos Made Simple in the last paragraph. In this blog post I discuss it. There is a question of how a new node gets a copy of all the state when it joins the cluster. That is discussed in this answer.

If Paxos algorithm is modified such that the acceptors accept the first value, or the most recent value, does the approach fail?

I've tried to reason and understand if the algorithm fails in these cases but can't seem to find an example where they would.
If they don't then why isn't any of these followed?
Yes.
Don't forget that in later rounds, leaders may be proposing different values than in earlier rounds. Therefore the first message may have the wrong value.
Furthermore messages may arrive reordered. (Consider a node that goes offline, then comes back online to find messages coming in random order.) The most recent message may not be the most recently sent message.
And finally, don't forget that leaders change. The faster an acceptor can be convinced that it is on the wrong leader, the better.
Rather than asking whether the algorithm fails in such a scenario consider that if each node sees different messages lost, delayed, or reordered, is it correct for a node to just accept the first it happens to recieve? Clearly the answer is no.
The algorithm is designed to work when "first" cannot be decided by looking at the timestamp on a message as clocks on different machines may be out of sync. The algorithm is designed to work when the network paths, distances and congestion, may be different between nodes. Nodes may crash and restart else hang and resume making things even more "hostile".
So a five node cluster could have all two nodes try to be leader and all three see a random ordering of which leaders message is "first". So what's the right answer in that case? The algorithm has a "correct" answer based on its rules which ensures a consistent outcome under all "hostile" comditions.
In summary the point of Paxos is that our logical mental model of "first" as a programmer is based on an assumption of a perfect set of clocks, machines and networks. That doesn't exist in the real world. To try to see if things break if you change the algorithm you need "attack" the message flow with all those things said above. You will likely find some way to "break" things under any change.

Neo4j super node issue - fanning out pattern

I'm new to the Graph Database scene, looking into Neo4j and learning Cypher, we're trying to model a graph database, it's a fairly simple one, we got users, and we got movies, users can VIEW movies, RATE movies, create playlists and playlists can HAVE movies.
The question is regarding the Super Node performance issue. And I will quote something from a very good book I am currently reading - Learning Neo4j by Rik Van Bruggen, so here it is:
A very interesting problem then occurs in datasets where some parts of the graph
are all connected to the same node. This node, also referred to as a dense node or a
supernode, becomes a real problem for graph traversals because the graph database
management system will have to evaluate all of the connected relationships to
that node in order to determine what the next step will be in the graph traversal.
The solution to this problem proposed in the book is to have a Meta node with 100 connections to it, and the 101th connection to be linked to a new Meta node that is linked to the previous Meta Node.
I have seen a blog post from the official Neo4j Blog saying that they will fix this problem in the upcoming future (the blog post is from January 2013) - http://neo4j.com/blog/2013-whats-coming-next-in-neo4j/
More exactly they say:
Another project we have planned around “bigger data” is to add some specific optimizations to handle traversals across densely-connected nodes, having very large numbers (millions) of relationships. (This problem is sometimes referred to as the “supernodes” problem.)
What are your opinions on this issue? Should we go with the Meta node fanning-out pattern or go with the basic relationship that every tutorial seem to be using? Any other suggestions?
UPDATE - October 2020. This article is the best source on this topic, covering all aspects of super nodes
(my original answer below)
It's a good question. This isn't really an answer, but why shouldn't we be able to discuss this here? Technically I think I'm supposed to flag your question as "primarily opinion based" since you're explicitly soliciting opinions, but I think it's worth the discussion.
The boring but honest answer is that it always depends on your query patterns. Without knowing what kinds of queries you're going to issue against this data structure, there's really no way to know the "best" approach.
Supernodes are problems in other areas as well. Graph databases sometimes are very difficult to scale in some ways, because the data in them is hard to partition. If this were a relational database, we could partition vertically or horizontally. In a graph DB when you have supernodes, everything is "close" to everything else. (An Alaskan farmer likes Lady Gaga, so does a New York banker). Moreso than just graph traversal speed, supernodes are a big problem for all sorts of scalability.
Rik's suggestion boils down to encouraging you to create "sub-clusters" or "partitions" of the super-node. For certain query patterns, this might be a good idea, and I'm not knocking the idea, but I think hidden in here is the notion of a clustering strategy. How many meta nodes do you assign? How many max links per meta-node? How did you go about assigning this user to this meta node (and not some other)? Depending on your queries, those questions are going to be very hard to answer, hard to implement correctly, or both.
A different (but conceptually very similar) approach is to clone Lady Gaga about a thousand times, and duplicate her data and keep it in sync between nodes, then assert a bunch of "same as" relationships between the clones. This isn't that different than the "meta" approach, but it has the advantage that it copies Lady Gaga's data to the clone, and the "Meta" node isn't just a dumb placeholder for navigation. Most of the same problems apply though.
Here's a different suggestion though: you have a large-scale many-to-many mapping problem here. It's possible that if this is a really huge problem for you, you'd be better off breaking this out into a single relational table with two columns (from_id, to_id), each referencing a neo4j node ID. You then might have a hybrid system that's mostly graph (but with some exceptions). Lots of tradeoffs here; of course you couldn't traverse that rel in cypher at all, but it would scale and partition much better, and querying for a particular rel would probably be much faster.
One general observation here: whether we're talking about relational, graph, documents, K/V databases, or whatever -- when the databases get really big, and the performance requirements get really intense, it's almost inevitable that people end up with some kind of a hybrid solution with more than one kind of DBMS. This is because of the inescapable reality that all databases are good at some things, and not good at others. So if you need a system that's good at most everything, you're going to have to use more than one kind of database. :)
There is probably quite a bit neo4j can do to optimize in these cases, but it would seem to me that the system would need some kinds of hints on access patterns in order to do a really good job at that. Of the 2,000,000 relations present, how to the endpoints best cluster? Are older relationships more important than newer, or vice versa?
Re. the Neo4j blog, dense node support should be enhanced in Neo4j 2.1 (and above), see also http://neo4j.com/blog/neo4j-2-1-graph-etl/
(disclaimer: not an answer, but some discussion)
The 2013 neo4j blog post you mentioned links to this github commit, where the intended problem scope and its solution is discussed. To summarize, it does not address the general supernode issue. Instead, it alleviates the issue when, among multiple relationship types (and directions) that a supernode has, some of the types (directions) happen to have disproportionately less edges than the others. The engine is able to filter based on types and directions.
A more generic solution is the vertex centric approach from Titan (https://stackoverflow.com/a/21385213/1311956), which sort the edges by one or a composite of properties, result in O(log(E)) searching performance, where E is the number of edges in/out of the supernode.
Neo4j has the concept of index on relationships. Unlike vertex centric approach of Titan, the index is global. However, relationship index is a legacy one in Neo4j. This is discussed in another stackoverflow thread.
Another issue with Supernode is the storage problem which leads to storage issue and IO cost.

How do IaaS nodes communicate to form a cluster?

How do nodes communicate with each other, or how do they become aware of each other (in a decentralized manner) in an IaaS environment? As an example: this article about Akka on Google's IaaS describes a 1500+ decentralized cluster intercommunicating randomly. What is the outline of this process?
It would be quite long to explain how Akka cluster works in detail, but I can try to give an overview.
The membership set in Akka is esentially a highly specialized CRDT. Since talking about Vector Clocks itself would be a lengthy discussion, I will use the analogy of git-like repositories.
You can imagine every Akka node maintaining its own repository where HEAD points to the current state of the cluster (known by that node). When a node introduces a change, it branches off, and starts to propagate the change to other nodes (this part is what is more or less random).
There are certain changes which we call monotonic which in the git analogy would mean that the branch is trivially mergeable. Those changes are just merged by other nodes as they receive them and they will then propagate the merge commit to others and eventually everything stabilizes (HEAD points to the same content).
There are other kind of changes that are not trivial to merge (non-monotonic). The process then is that a node first sends around a proposal: "I want to make this non-trivial change C". This is needed because the other nodes need to be aware of this pending "complex" change and prepare themselves. This is disseminated among the nodes until everyone receives it. Now we are at the state where "Everyone knows that someone proposed to make the change C", but this is not enough, since no one is actually aware that there is an agreement yet.
Therefore there is another "round", where nodes start to propagate the information "I, node Y, are aware of the fact that change C has been proposed". Eventually one or more nodes become aware that there is an agreement (this is more or less a distributed acknowledgement protocol). So the state now is "At least one node knows that every node knows that the change C has been proposed". This is (partly) what we refer to as convergence. At this point the node (or nodes) that are aware of the agreement will make the merge and propagate it.
Please note that I highly simplified the explanation here, obviously the devil (and scaling) is in the details :)

Distributed algorithm design

I've been reading Introduction to Algorithms and started to get a few ideas and questions popping up in my head. The one that's baffled me most is how you would approach designing an algorithm to schedule items/messages in a queue that is distributed.
My thoughts have lead me to browsing Wikipedia on topics such as Sorting,Message queues,Sheduling, Distributed hashtables, to name a few.
The scenario:
Say you wanted to have a system that queued messages (strings or some serialized object for example). A key feature of this system is to avoid any single point of failure. The system had to be distributed across multiple nodes within some cluster and had to consistently (or as best as possible) even the work load of each node within the cluster to avoid hotspots.
You want to avoid the use of a master/slave design for replication and scaling (no single point of failure). The system totally avoids writing to disc and maintains in memory data structures.
Since this is meant to be a queue of some sort the system should be able to use varying scheduling algorithms (FIFO,Earliest deadline,round robin etc...) to determine which message should be returned on the next request regardless of which node in the cluster the request is made to.
My initial thoughts
I can imagine how this would work on a single machine but when I start thinking about how you'd distribute something like this questions like:
How would I hash each message?
How would I know which node a message was sent to?
How would I schedule each item so that I can determine which message and from which node should be returned next?
I started reading about distributed hash tables and how projects like Apache Cassandra use some sort of consistent hashing to distribute data but then I thought, since the query won't supply a key I need to know where the next item is and just supply it...
This lead into reading about peer to peer protocols and how they approach the synchronization problem across nodes.
So my question is, how would you approach a problem like the one described above, or is this too far fetched and is simply a stupid idea...?
Just an overview, pointers,different approaches, pitfalls and benefits of each. The technologies/concepts/design/theory that may be appropriate. Basically anything that could be of use in understanding how something like this may work.
And if you're wondering, no I'm not intending to implement anything like this, its just popped into my head while reading (It happens, I get distracted by wild ideas when I read a good book).
UPDATE
Another interesting point that would become an issue is distributed deletes.I know systems like Cassandra have tackled this by implementing HintedHandoff,Read Repair and AntiEntropy and it seems to work work well but are there any other (viable and efficient) means of tackling this?
Overview, as you wanted
There are some popular techniques for distributed algorithms, e.g. using clocks, waves or general purpose routing algorithms.
You can find these in the great distributed algorithm books Introduction to distributed algorithms by Tel and Distributed Algorithms by Lynch.
Reductions
are particularly useful since general distributed algorithms can become quite complex. You might be able to use a reduction to a simpler, more specific case.
If, for instance, you want to avoid having a single point of failure, but a symmetric distributed algorithm is too complex, you can use the standard distributed algorithm of (leader) election and afterwards use a simpler asymmetric algorithm, i.e. one which can make use of a master.
Similarly, you can use synchronizers to transform a synchronous network model to an asynchronous one.
You can use snapshots to be able to analyze offline instead of having to deal with varying online process states.

Resources