We're using a 3 nodes Vertica cluster.
The network connection between the nodes sometimes fails for a short amount of time (ex : 10 seconds).
When this happens, all nodes quickly shut down as soon as they detect that other nodes are unreachable (because k-safety cannot be satisfied). For example, the following sequence is recorded in the vertica log by the node0003 :
00:04:30.633 node v_feedback_node0001 left the cluster
...
00:04:30.670 Node left cluster, reassessing k-safety...
...
00:04:32.389 node v_feedback_node0002 left the cluster
...
00:04:32.414 Changing node v_feedback_node0003 startup state from UP to UNSAFE
...
00:04:33.425 Shutting down this node
...
00:04:38.547 node v_feedback_node0003 left the cluster
Is it possible to configure a delay after which each node will try to reconnect to others before giving up and shutting down ?
Got an answer from a Vertica employee on the Vertica forum.
This [reconnection delay] time is hard coded to 8 seconds.
I think time is better spent making the network more reliable. 30 sec
of network failure is a lot (i mean really, really large, typically
network rtt is in the microseconds). even if you kept vertica up by
delaying k-safe assessment, nothing really can connect to the
database, or most likely all db connections may reset.
Related
I repeatedly switch data nodes in high availability mode, the terminal continues prompting that the transaction is occupied, and then resumes after a few minutes. Why does this occur?
Regarding the questions about switching data nodes repeatedly, prompting the transactions are occupied and so on, the mechanism is as follows. After a data node is down, it takes some time to synchronize the transaction status between itself and the control node. The transaction timeout is about two minutes. In addition, There is some time spent on data recovery, which is related to the specific data volume. If you switch back to this node again in a short time, you will be prompted that the transaction is occupied, and it will be restored in about two minutes. In actual production, the probability of a node breaks down, especially some different nodes hangs continuously within a few minutes is extremely small. Thus, for high-availability tests, it is recommended to set the interval between simulated hangs more than two minutes. It is better to have interval time longer than 5 mins for simulating the breakdown of next node.
I want to use consul for a 2-node cluster. Drawback is there's no failure tolerance for two nodes :
https://www.consul.io/docs/internals/consensus.html
Is there a way in Consul to make a consistent leader election with only two nodes? Can Consul Raft Consensus algorithm be changed?
Thanks a lot.
It sounds like you're limited to 2 machines of this type, because they are expensive. Consider acquiring three or five cheaper machines to run your orchestration layer.
To answer protocol question, no, there is no way to run a two-node cluster with failure tolerance in Raft. To be clear, you can safely run a two-node cluster just fine - it will be available and make progress like any other cluster. It's just when one machine goes down, because your fault tolerance is zero you will lose availability and no longer make no progress. But safety is never compromised - your data is still persisted consistently on these machines.
Even outside Raft, there is no way to run a two-node cluster and guarantee progress upon a single failure. This is a fundamental limit. In general, if you want to support f failures (meaning remain safe and available), you need 2f + 1 nodes.
There are non-Raft ways to improve the situation. For example, Flexible Paxos shows that we can require both nodes for leader election (as it already is in Raft), but only require a single node for replication. This would allow your cluster to continue working in some failure cases where Raft would have stopped. But the worst case is still the same: there are always failures that will cause any two-node cluster to become unavailable.
That said, I'm not aware of any practical flexible paxos implementations anyway.
Considering the expense of even trying to hack up a solution to this, your best bet is to either get a larger set of cheaper machines, or just run your two-node cluster and accept unavailability upon failure.
Talking about changing the protocol, there is impossibility proof by FLP which states that consensus cannot be reached if systems are less than 2f + 1 for f failures (fail-stop). Although, safety is provided but progress (liveness) cannot be ensured.
I think, the options suggested in earlier post are the best.
The choice of leader election on top of the Consul’s documentation itself requires 3 nodes. This relies on the health-checks mechanism, as well as the sessions. Sessions are essentially distributed locks automatically released by TTL or when the service crashes.
To build 2-node Consul cluster we have to use another approach, supposedly called Leader Lease. Since we already have Consul KV-storage with CAS support, we can simply write to it which machine is the leader before the expiration of such and such time. As long as the leader is alive and well, it can periodically extend it's time. If the leader dies, someone will replace it quickly. For this approach to work, it is enough to synchronize the time on the machines using ntpd and when the leader performs any action, verify that it has enough time left to complete this action.
A key is created in the KV-storage, containing something like “node X is the leader before time Y”, where Y is calculated as the current time + some time interval(T). As a leader, node X updates the record once every T/2 or T/3 units of time, thereby extending it's leadership role. If a node falls or cannot reach the KV-storage, after the interval(T) its place will be taken by the node, which will be the first to discover that the leadership role has been released.
CAS is needed to prevent a race condition if the two nodes simultaneously try to become a leader. CAS Specifies to use a Check-And-Set operation. This is very useful as a building block for more complex synchronization primitives. If the index is 0, Consul will only put the key if it does not already exist. If the index is non-zero, the key is only set if the index matches the ModifyIndex of that key.
I have 4-nodes elasticsearch cluster (1 of them is client, es 1.3.5).
It works fine most of time, but sometimes on peaks gone out of resources.
Can i add any reserve node in cluster that will be enabled only when peaks occured (1-2 day in months) and will be disabled all other time? Does it makes sense?
There's no notion of backup/standby/reserve node. You monitor the cluster activity and you start a new node when the peak happens. If you would have used a newer version you could have used Marvel (the monitoring part) and Watcher (the alerting part) to get notified on the peaks. At that point you could have started the new node.
There are also examples of watches that alert you in case of high memory or cpu usage: https://www.elastic.co/guide/en/watcher/current/watching-marvel-data.html#watching-memory-usage
Our Hadoop cluster is a cluster of 5 data nodes and 2 name nodes. The traffic is actually very high and a few nodes go down very often. But they come back after a while. Some times it takes a long time, more than half an hour to come back alive.
There are few DNs with more threads than the others. Is this a configuration problem?
The data is not write intensive. MR jobs run every 20 minutes.
After running a health monitor for two days, sampled at half an hour interval, we came to know that the nodes die during disk verification which runs every 6 hours. So now the nodes die predictably. But why do they die during disk verification? Is there anyway to prevent the nodes die during the disk verification??
Clouedera's capacity planning gives an insight on this. If you see “Bad connect ack with firstBadLink”, “Bad connect ack”, “No route to host”, or “Could not obtain block” IO exceptions under heavy loads, chances are these are due to a bad network.
I'm currently rebuilding our servers that have our region-servers and data nodes. When I take down a data node, after 10 minutes the blocks that it had are being re-replicated among other data nodes, as it should. We have 10 data-nodes, so I see heavy network traffic as the blocks are being re-replicated. However, I'm seeing that traffic to be about only 500-600mbps per server (the machines all have gigabit interfaces) so it's definitely not network-bound. I'm trying to figure out what is limiting the speed that the data-nodes send and receive blocks. Each data-node has six 7200 rpm sata drives, and the IO usage is very low during this, only peaking to 20-30% per drive. Is there a limit built into hdfs that limits the speed at which blocks are replicated?
The rate of replication work is throttled by HDFS to not interfere with cluster traffic when failures happen during regular cluster load.
The properties that control this are dfs.namenode.replication.work.multiplier.per.iteration (2), dfs.namenode.replication.max-streams (2) and dfs.namenode.replication.max-streams-hard-limit (4). The foremost controls the rate of work to be scheduled to a DN at every heartbeat that occurs, and the other two further limit the maximum parallel threaded network transfers done by a DataNode at a time. The values in () indicate their defaults. Some description of this is available at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
You can perhaps try to increase the set of values to (10, 50, 100) respectively to spruce up the network usage (requires a NameNode restart), but note that your DN memory usage may increase slightly as a result of more blocks information being propagated to it. A reasonable heap size for these values for the DN role would be about 4 GB.
P.s. These values were not tried by me on production systems personally. You will also not want to max out the re-replication workload such that it affects regular cluster work, as recovery of 1/3 replicas may be of lesser priority than missing job/query SLAs due to lack of network resources (unless you have a really fast network that's always under-utilised even under loaded periods). Try to tune it till you're satisfied with the results.