Bootstrap expect=1 in consul results in weird behavior in cluster - consul

Trying to launch a cluster of nodes one at a time, and I'm a bit confused about the bootstrap-expect value.
The way it is set up is that consul is launched with bootstrap-expect, then after it starts consul join is ran
Currently, the deployment sets bootstrap-expect have it set to the number of nodes in the cluster, and a leader is elected after that number.
However, when bootstrap-expect is set to 1 (thought process is so we can have a cluster without waiting for all the nodes), something strange happens.
So first, each node thinks it is the leader - which is expected since bootstrap-expect is set to 1. But after doing consul join to each other, a new cluster leader isn't elected - what happens is strange - each node in the cluster still thinks itself as a cluster leader.
Why don't the nodes, when joining a cluster, elect a new leader? Or at least respect the prexisting leader?

This is condition called Split Brain that you've "intentionally" created. Each server think's it's the leader and has it's own version of the log and each of these versions are not reconcilable with each other. Split Brain is famously hard to recover from. Since the Servers can not agree on what the Cluster State should be, they can't decide who the new leader should be, and they continue without a successful election. You can read up on Raft to learn more about why.

Related

Hashicorp Raft consensus deadlock state

I am implementing a Raft service using Hashicorp Raft library for distributed consensus.
https://github.com/hashicorp/raft
I have a simple layout with Raft, 1 followers and a leader.
I bootstrap the cluster with the leader and add 2 follower Raft nodes to the Raft cluster, things look fine. When I knock one of the followers offline, the leader gets:
failed to contact quorum of nodes, stepping down. The problem with this is now there are 0 nodes in leader state and no one can promote to leader because majority of votes required from quorum of nodes. Now because the previous leader is now a follower, even my service discovery tools can't remove the old ip address from the leader because it requires leader power to do so.
My cluster enters this infinite loop (deadlock) of trying to connect to a node that's offline forever and no one can get promoted to leader. Any ideas?
Edit: After realizing, I guess I need a system where there are an odd number of nodes to reach quorum. (ie 3 nodes, 1 gets knocked offline then I can tell the new leader to remove old IP address)
I don't have experience with that library, but:
"The problem with this is now there are 0 nodes in leader state and no one can promote to leader because majority of votes required from quorum of nodes".
With one of three nodes out, you still have quorum/majority. Raft would promote one of followers to leader.
The fact that your system stalled after you removed one of two followers tells me that one of followers was not added correctly in the first place.
You had one leader and one follower initially. Since that was working, it means that follower and the leader can communicate; all good here.
You have added second follower; how do you know that it was done correctly? Can you try to do it again, and knock out this second follower - the system should keep working as the leader and the first follower are ok.
So I can conclude, that the second follower did not join the cluster, and when you knocked out the first follower, the systems stopped, as it should - no more majority of correctly configured nodes are available.

Clustered elasticsearch setup (two master nodes)

We are currently setting up an environment with two elasticsearch instances (clustered servers).
Since it's clustered, we need to make sure that data (indexes) are synched between the two instances.
We do not have the possibility to setup an additional (3rd) server/instance to act as the 'master'.
Therefore we have configured both instances as master and data nodes. So instance 1 is master & node and instance 2 is also master & node.
The synchronization works fine when both instances are up and running. But when one instance is down, the other instance keeps trying to connect with the instance that is down, which obviously fails because the instance is down. Therefore the node that is up is also not functioning anymore, because it can not connect to his 'master' node (which is the node that is down), even though the instance itself is also a 'master'.
The following errors are logged in this case:
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
org.elasticsearch.transport.ConnectTransportException: [xxxxx-xxxxx-2][xx.xx.xx.xx:9300] connect_exception
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: xx.xx.xx.xx/xx.xx.xx.xx:9300
In short: two elasticsearch master instances in a clustered setup. When one is down, the other one does not function because it can not connect to the 'master' instance.
Desired result: If one of the master instances is down, the other should continue functioning (without throwing errors).
Any recommendations on how to solve this, without having to setup an additional server that is the 'master' and the other 2 the 'slaves'?
Thanks
To be able to vote, masters must be a minimum of 2.
That's why you must have a minimum of 3 master nodes if you want your cluster to resist to the loss of one node.
You can just add a specialized small master node by settings all other roles to false.
This node can have very few resources .
As describe in this post :
https://discuss.elastic.co/t/master-node-resource-requirement/84609
Dedicated master nodes need persistent storage, but not a lot of it. 1-2 CPU cores and 2-4GB RAM is often sufficient for smaller deployments. As dedicated master nodes do not store data you can also set the heap to a higher percentage (75%-80%) of total RAM that is recommended for data nodes.
If there are no options to increase 1 more node then you can set
minimum_master_nodes=1 . This will let your es cluster up even if 1 node is up. But it may lead to split brain issue as we restricted only 1 node to be visible to form cluster.
In that scenario you have to restart the cluster to resolve split brain.
I would suggest you to upgrade to elasticsearch 7.0 or above. There you can live with two nodes each master eligible and split brain issue will not come.
You should not have 2 master eligible nodes in the cluster as its a very risky thing and can lead to split brain issue.
Master nodes doesn't require much resources, but as you have just two data nodes, you can still live without having a dedicated master nodes(but please aware that it has downsides) to just save the cost.
So simply, remove master role from another node and you should be good to go.

How to deal with Split Brain with an cluster have the two number of nodes?

I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html

RabbitMQ cluster is not reconnecting after network failure

I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:
=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit#rabbitmq02 not responding **
** Removing (timedout) connection **
=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit#rabbitmq02 lost 'rabbit'
=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit#rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit#rabbitmq02}
I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!
When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.
My questions:
If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?
How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.
RabbitMQ Clusters do not work well on unreliable networks (part of RabbitMQ documentation). So when the network failure happens (in a two node cluster) each node thinks that it is the master and the only node in the cluster. Two master nodes don't automatically reconnect, because their states are not automatically synchronized (even in case of a RabbitMQ slave - the actual message synchronization does not happen - the slave just "catches up" as messages get consumed from the queue and more messages get added).
To detect whether you have a broken cluster, run the command:
rabbitmqctl cluster_status
on each of the nodes that form part of the cluster. If the cluster is broken then you'll only see one node. Something like:
Cluster status of node rabbit#rabbitmq1 ...
[{nodes,[{disc,[rabbit#rabbitmq1]}]},{running_nodes,[rabbit#rabbitmq1]}]
...done.
In such cases, you'll need to run the following set of commands on one of the nodes that formed part of the original cluster (so that it joins the other master node (say rabbitmq1) in the cluster as a slave):
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit#rabbitmq1
rabbitmqctl start_app
Finally check the cluster status again .. this time you should see both the nodes.
Note: If you have the RabbitMQ nodes in an HA configuration using a Virtual IP (and the clients are connecting to RabbitMQ using this virtual IP), then the node that should be made the master should be the one that has the Virtual IP.
From RabbitMQ doc: Clustering and Network Partitions
RabbitMQ also three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. The default behaviour is referred to as ignore mode.
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. This configuration prevents split-brain and is therefore able to automatically recover from network partitions without inconsistencies.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in the configuration file to:
autoheal
pause_minority
pause_if_all_down
If using the pause_if_all_down mode, additional parameters are required:
nodes: nodes which should be unavailable to pause
recover: recover action, can be ignore or autoheal
...
Which Mode to Pick?
It's important to understand that allowing RabbitMQ to deal with network partitions automatically comes with trade offs.
As stated in the introduction, to connect RabbitMQ clusters over generally unreliable links, prefer Federation or the Shovel.
With that said, here are some guidelines to help the operator determine which mode may or may not be appropriate:
ignore: use when network reliability is the highest practically possible and node availability is of topmost importance. For example, all cluster nodes can be in the same a rack or equivalent, connected with a switch, and that switch is also the route to the outside world.
pause_minority: appropriate when clustering across racks or availability zones in a single region, and the probability of losing a majority of nodes (zones) at once is considered to be very low. This mode trades off some availability for the ability to automatically recover if/when the lost node(s) come back.
autoheal: appropriate when are more concerned with continuity of service than with data consistency across nodes.
One other way to recover from this kind of failure is to work with Mnesia which is the database that RabbitMQ uses as the persistence mechanism and for the synchronization of the RabbitMQ instances (and the master / slave status) are controlled by this. For all the details, refer to the following URL: http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html
Adding the relevant section here:
There are several occasions when Mnesia may detect that the network
has been partitioned due to a communication failure.
One is when Mnesia already is up and running and the Erlang nodes gain
contact again. Then Mnesia will try to contact Mnesia on the other
node to see if it also thinks that the network has been partitioned
for a while. If Mnesia on both nodes has logged mnesia_down entries
from each other, Mnesia generates a system event, called
{inconsistent_database, running_partitioned_network, Node} which is
sent to Mnesia's event handler and other possible subscribers. The
default event handler reports an error to the error logger.
Another occasion when Mnesia may detect that the network has been
partitioned due to a communication failure, is at start-up. If Mnesia
detects that both the local node and another node received mnesia_down
from each other it generates a {inconsistent_database,
starting_partitioned_network, Node} system event and acts as described
above.
If the application detects that there has been a communication failure
which may have caused an inconsistent database, it may use the
function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which
nodes each table may be loaded.
At start-up Mnesia's normal table load algorithm will be bypassed and
the table will be loaded from one of the master nodes defined for the
table, regardless of potential mnesia_down entries in the log. The
Nodes may only contain nodes where the table has a replica and if it
is empty, the master node recovery mechanism for the particular table
will be reset and the normal load mechanism will be used when next
restarting.
The function mnesia:set_master_nodes(Nodes) sets master nodes for all
tables. For each table it will determine its replica nodes and invoke
mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that
are included in the Nodes list (i.e. TabNodes is the intersection of
Nodes and the replica nodes of the table). If the intersection is
empty the master node recovery mechanism for the particular table will
be reset and the normal load mechanism will be used at next restart.
The functions mnesia:system_info(master_node_tables) and
mnesia:table_info(Tab, master_nodes) may be used to obtain information
about the potential master nodes.
Determining which data to keep after communication failure is outside
the scope of Mnesia. One approach would be to determine which "island"
contains a majority of the nodes. Using the {majority,true} option for
critical tables can be a way of ensuring that nodes that are not part
of a "majority island" are not able to update those tables. Note that
this constitutes a reduction in service on the minority nodes. This
would be a tradeoff in favour of higher consistency guarantees.
The function mnesia:force_load_table(Tab) may be used to force load
the table regardless of which table load mechanism is activated.
This is a more lengthy and involved way of recovering from such failures .. but will give better granularity and control over data that should be available in the final master node (this can reduce the amount of data loss that might happen when "merging" RabbitMQ masters).

What cluster node should be active?

There is some cluster and there is some unix network daemon. This daemon is started on each cluster node, but only one can be active.
When active daemon breaks (whether program breaks of node breaks), other node should become active.
I could think of few possible algorithms, but I think there is some already done research on this and some ready-to-go algorithms? Am I right? Can you point me to the answer?
Thanks.
Jgroups is a Java network stack which includes DistributedLockManager type of support and cluster voting capabilities. These allow any number of unix daemons to agree on who should be active. All of the nodes could be trying to obtain a lock (for example) and only one will succeed until the application or the node fails.
Jgroups also have the concept of the coordinator of a specific communication channel. Only one node can be coordinator at one time and when a node fails, another node becomes coordinator. It is simple to test to see if you are the coordinator in which case you would be active.
See: http://www.jgroups.org/javadoc/org/jgroups/blocks/DistributedLockManager.html
If you are going to implement this yourself there is a bunch of stuff to keep in mind:
Each node needs to have a consistent view of the cluster.
All nodes will need to inform all of the rest of the nodes that they are online -- maybe with multicast.
Nodes that go offline (because of ap or node failure) will need to be removed from all other nodes' "view".
You can then have the node with the lowest IP or something be the active node.
If this isn't appropriate then you will need to have some sort of voting exchange so the nodes can agree who is active. Something like: http://en.wikipedia.org/wiki/Two-phase_commit_protocol

Resources