Hashicorp Raft consensus deadlock state - consul

I am implementing a Raft service using Hashicorp Raft library for distributed consensus.
https://github.com/hashicorp/raft
I have a simple layout with Raft, 1 followers and a leader.
I bootstrap the cluster with the leader and add 2 follower Raft nodes to the Raft cluster, things look fine. When I knock one of the followers offline, the leader gets:
failed to contact quorum of nodes, stepping down. The problem with this is now there are 0 nodes in leader state and no one can promote to leader because majority of votes required from quorum of nodes. Now because the previous leader is now a follower, even my service discovery tools can't remove the old ip address from the leader because it requires leader power to do so.
My cluster enters this infinite loop (deadlock) of trying to connect to a node that's offline forever and no one can get promoted to leader. Any ideas?
Edit: After realizing, I guess I need a system where there are an odd number of nodes to reach quorum. (ie 3 nodes, 1 gets knocked offline then I can tell the new leader to remove old IP address)

I don't have experience with that library, but:
"The problem with this is now there are 0 nodes in leader state and no one can promote to leader because majority of votes required from quorum of nodes".
With one of three nodes out, you still have quorum/majority. Raft would promote one of followers to leader.
The fact that your system stalled after you removed one of two followers tells me that one of followers was not added correctly in the first place.
You had one leader and one follower initially. Since that was working, it means that follower and the leader can communicate; all good here.
You have added second follower; how do you know that it was done correctly? Can you try to do it again, and knock out this second follower - the system should keep working as the leader and the first follower are ok.
So I can conclude, that the second follower did not join the cluster, and when you knocked out the first follower, the systems stopped, as it should - no more majority of correctly configured nodes are available.

Related

set node as raft leader and leaseholder

I read on cockroachdb docs the following:
"we can optimize query performance by making the same node both Raft leader and the Leaseholder"
But how can you set a node to function both as raft leader and leaseholder (what commands)? Did I miss it in some manual?
Edit / extra background info:
I have a couple of nodes in one datacenter (low latency). But I would like to start a node in a different datacenter (for safety). I don't want that node to function as a leader...
CockroachDB automatically ensures that the raft leader and leaseholder are colocated. There isn't anything manual to be done.

Bootstrap expect=1 in consul results in weird behavior in cluster

Trying to launch a cluster of nodes one at a time, and I'm a bit confused about the bootstrap-expect value.
The way it is set up is that consul is launched with bootstrap-expect, then after it starts consul join is ran
Currently, the deployment sets bootstrap-expect have it set to the number of nodes in the cluster, and a leader is elected after that number.
However, when bootstrap-expect is set to 1 (thought process is so we can have a cluster without waiting for all the nodes), something strange happens.
So first, each node thinks it is the leader - which is expected since bootstrap-expect is set to 1. But after doing consul join to each other, a new cluster leader isn't elected - what happens is strange - each node in the cluster still thinks itself as a cluster leader.
Why don't the nodes, when joining a cluster, elect a new leader? Or at least respect the prexisting leader?
This is condition called Split Brain that you've "intentionally" created. Each server think's it's the leader and has it's own version of the log and each of these versions are not reconcilable with each other. Split Brain is famously hard to recover from. Since the Servers can not agree on what the Cluster State should be, they can't decide who the new leader should be, and they continue without a successful election. You can read up on Raft to learn more about why.

What's the benefit of advanced master election algorithms over bully algorithm?

I read how current master election algorithms like Raft, Paxos or Zab elect master on a cluster and couldn't understand why they use sophisticated algorithms instead of simple bully algorithm.
I'm developing a cluster library and use UDP Multicast for heartbeat messages. Each node joins a multicast address and also send datagram packets periodically to that address. If the nodes find out there is a new node that sends packets to this multicast address, the node is simply added to cluster and similarly when the nodes in the cluster don't get any package from a node, they remove it from the cluster. When I need to choose a master node, I simply iterate over the nodes in the cluster and choose the oldest one.
I read some articles that implies this approach is not effective and more sophisticated algorithms like Paxos should be used in order to elect a master or detect failures via heartbeat messages. I couldn't understand why Paxos is better for split-brain scenarios or other network failures than traditional bully algorithm because I can easily find out when quorum of nodes leave from the cluster without using Raft. The only benefit I see is the count of packets that each server have to handle; only master sends heartbeat messages in Raft while in this case each node has to send heartbeat message to each other. However I don't think this is a problem since I can simply implement similar heartbeat algorithm without changing my master election algorithm.
Can someone elaborate on that?
From a theoretical perspective, Raft, Paxos and Zab are not leader election algorithms. They solve a different problem called consensus.
In your concrete scenario, the difference is the following. With a leader election algorithm, you can only guarantee that eventually one node is a leader. That means that during a period of time, multiple nodes might think they are the leader and, consequently, might act like one. In contrast, with the consensus algorithms above, you can guarantee that there is at most one leader in a logical time instant.
The consequence is this. If the safety of the system depends on the presence of a single leader, you might get in trouble relying only on a leader election. Consider an example. Nodes receive messages from the UDP multicast and do A if the sender is the leader, but do B if the sender is not the leader. If two nodes check for the oldest node in the cluster at slightly different points in time, they might see different leaders. These two nodes might then receive a multicast message and process it in different manners, possibly violating some safety property of the system you'd like to hold (eg, that all nodes either do A or B, but never one does A and another does B).
With Raft, Paxos, and Zab, you can overcome that problem since these algorithms create sort of logical epochs, having at most one leader in each of them.
Two notes here. First, the bully algorithm is defined for synchronous systems. If you really implement it as described in the paper by Garcia-Molina, I believe you might experience problems in your partially synchronous system. Second, the Zab algorithm relies on a sort of bully algorithm for asynchronous systems. The leader is elected by comparing the length of their histories (that minimizes the recovery time of the system). Once the leader is elected, it tries to start the Zab protocol, which in turn locks the epoch for the leader.

Leader Election Algorithm

I am exploring various architectures in cluster computing. Some of the popular ones are:
Master-Slave.
RPC
...
In Master-slave, the normal way is to set one machine as master & a bunch of machines as slaves controlled by master. One particular algo here got me interested. It's called Leader-Election Algo which has a certain randomness in selecting which of the machines will become master.
My question is - Why would anyone want to elect a master machine this way? What advantages does this approach have compared to manually selecting a machine as master?
There are some advantages with this algorithms:
Selection of node as leader will be
done dynamically so for example you
can select node with highest
performance, and arrival of new
nodes may be makes better choice.
Another good approach by dynamically
selecting leader is, if one of a
nodes have major fault (for example
PC is shutting down) you have other
choices and there is no need to
manually change the leader.
if you manually select node should
manually configure all other nodes
to use this node, and also set their
time manually, ... but this
algorithms will help you to handle
timing issues.
for example (not very relevant) why
in most cases using DHCP? too many
configs will be handeled by this
algorithms.
Main idea of using such algorithms is to get rid of additional configuration, add some kind of flexibility, and stability of the whole system. But usually (in HPC/MPI applications) master node is selected manually.
Suppose your master selection algorithms is quite easy - get the list of available systems and select the one with the highest IP address. In this case you can easily start a new process on any of your nodes and it will automatically find the master node.
One nice example of such ideas is the WCCP protocol "designated proxy" selection algorithm where the number of proxies could be flexible and master node is selected in the runtime.
Considering a network of nodes, where it is vital to have one leader node at all times. If the current leader dies, then the network some how has to choose another leader. Given this scenario and requirement there are two possible ways to do it.
The central system approach, where there is a central node
deciding who will be the leader. If
the current leader dies, then this
central node will decide on who
should take over the leader role.
But this is single point of failure,
that is the central node who is
responsible for deciding the leader,
goes down then there is no one there to select leaders if the current leader dies.
Where as in the same scenario we can
use distributed leader selection, as
in all the nodes come to a consensus
who the leader should be. So we do not need to have a central node who decides on who the leader should be, hence eliminating the single point of failure. When the leader node dies, then there will be a way to detect node failure, and then every node will start a distributed leader selection algorithm, and mutually come to a consensus of electing a leader.
So, in short when you have a system which has no central control, probably because the system is meant to be scalable without having single point of failure, in those systems to take choose some node, leader elections algorithms are used.

What cluster node should be active?

There is some cluster and there is some unix network daemon. This daemon is started on each cluster node, but only one can be active.
When active daemon breaks (whether program breaks of node breaks), other node should become active.
I could think of few possible algorithms, but I think there is some already done research on this and some ready-to-go algorithms? Am I right? Can you point me to the answer?
Thanks.
Jgroups is a Java network stack which includes DistributedLockManager type of support and cluster voting capabilities. These allow any number of unix daemons to agree on who should be active. All of the nodes could be trying to obtain a lock (for example) and only one will succeed until the application or the node fails.
Jgroups also have the concept of the coordinator of a specific communication channel. Only one node can be coordinator at one time and when a node fails, another node becomes coordinator. It is simple to test to see if you are the coordinator in which case you would be active.
See: http://www.jgroups.org/javadoc/org/jgroups/blocks/DistributedLockManager.html
If you are going to implement this yourself there is a bunch of stuff to keep in mind:
Each node needs to have a consistent view of the cluster.
All nodes will need to inform all of the rest of the nodes that they are online -- maybe with multicast.
Nodes that go offline (because of ap or node failure) will need to be removed from all other nodes' "view".
You can then have the node with the lowest IP or something be the active node.
If this isn't appropriate then you will need to have some sort of voting exchange so the nodes can agree who is active. Something like: http://en.wikipedia.org/wiki/Two-phase_commit_protocol

Resources