Switching nodes in High Availability mode - high-availability

I repeatedly switch data nodes in high availability mode, the terminal continues prompting that the transaction is occupied, and then resumes after a few minutes. Why does this occur?

Regarding the questions about switching data nodes repeatedly, prompting the transactions are occupied and so on, the mechanism is as follows. After a data node is down, it takes some time to synchronize the transaction status between itself and the control node. The transaction timeout is about two minutes. In addition, There is some time spent on data recovery, which is related to the specific data volume. If you switch back to this node again in a short time, you will be prompted that the transaction is occupied, and it will be restored in about two minutes. In actual production, the probability of a node breaks down, especially some different nodes hangs continuously within a few minutes is extremely small. Thus, for high-availability tests, it is recommended to set the interval between simulated hangs more than two minutes. It is better to have interval time longer than 5 mins for simulating the breakdown of next node.

Related

How to delay Vertica node shutdown when k-safety assessment fails?

We're using a 3 nodes Vertica cluster.
The network connection between the nodes sometimes fails for a short amount of time (ex : 10 seconds).
When this happens, all nodes quickly shut down as soon as they detect that other nodes are unreachable (because k-safety cannot be satisfied). For example, the following sequence is recorded in the vertica log by the node0003 :
00:04:30.633 node v_feedback_node0001 left the cluster
...
00:04:30.670 Node left cluster, reassessing k-safety...
...
00:04:32.389 node v_feedback_node0002 left the cluster
...
00:04:32.414 Changing node v_feedback_node0003 startup state from UP to UNSAFE
...
00:04:33.425 Shutting down this node
...
00:04:38.547 node v_feedback_node0003 left the cluster
Is it possible to configure a delay after which each node will try to reconnect to others before giving up and shutting down ?
Got an answer from a Vertica employee on the Vertica forum.
This [reconnection delay] time is hard coded to 8 seconds.
I think time is better spent making the network more reliable. 30 sec
of network failure is a lot (i mean really, really large, typically
network rtt is in the microseconds). even if you kept vertica up by
delaying k-safe assessment, nothing really can connect to the
database, or most likely all db connections may reset.

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

Determining expiry - distributed nodes - without syncing the clocks

I have the following problem:
A leader server creates objects which have a start time and end time. The start time and end time are set when an object gets created.
The Start time of the object is set to current time on the leader node, and end time is set to Start time + Delta Time
A thread wakes up regularly and checks if the End time for any of the objects are lesser than the current time (hence the object has expired) - if yes then the object needs to be deleted
All this works fine, as long as things are running smoothly on leader node. If the leader node goes down, one of the follower node becomes the new leader. (There will be replication among leader and follower node (RAFT algorithm))
Now on the new leader, the time could be very different from the previous leader. Hence the computation in step 3 could be misleading.
One way to solve this problem, is to keep the clocks of nodes (leader and followers) in sync (as much as possible).
But I am wondering if there is any other way to resolve this problem of "expiry" with distributed nodes?
Further Information:
Will be using RAFT protocol for message passing and state
replication
It will have known bounds on the delay of message between processes
Leader and follower failures will be tolerated (as per RAFT protocol)
Message loss is assumed not to occur (RAFT ensures this)
The operation on objects is to check if they are alive. Objects will be enqueued by a client.
There will be strong consistency among processes (RAFT provides this)
I've seen expiry done in two different ways. Both of these methods guarantee that time will not regress, as what can happen if synchrnozing clocks via NTP or otherwise using the system clock. In particular, both methods utilize the chip clock for strictly increasing time. (System.nanoTime in Javaland.)
These methods are only for expiry: time does not regress, but it is possible that time can go appear to go slower.
First Method
The first method works because you are using a raft cluster (or a similar protocol). It works by broadcasting an ever-increasing clock from the leader to the replicas.
Each peer maintains what we'll call the cluster clock that runs at near real time. The leader periodically broadcasts the clock value via raft.
When a peer receives this clock value it records it, along with the current chip clock value. When the peer is elected leader it can determine the duration since the last clock value by comparing its current chip clock with the last recorded chip clock value.
Bonus 1: Instead of having a new transition type, the cluster clock value may be attached to every transition, and during quiet periods the leader makes no-op transitions just to move the clock forward. If you can, attach these to the raft heartbeat mechanism.
Bonus 2: I maintain some systems where time is increased for every transition, even within the same time quantum. In other words, every transition has a unique timestamp. For this to work without moving time forward too quickly your clock mechanism must have a granularity that can cover your expected transition rate. Milliseconds only allow for 1,000 tps, microseconds allow for 1,000,000 tps, etc.
Second Method
Each peer merely records its chip clock when it receives each object and stores it along with each object. This guarantees that peers will never expire an object before the leader, because the leader records the time stamp and then sends the object over a network. This creates a strict happens-before relationship.
This second method is susceptible, however, to server restarts. Many chips and processing environments (e.g. the JVM) will reset the chip-clock to a random value on startup. The first method does not have this problem, but is more expensive.
If you know your nodes are synchronized to some absolute time, within some epsilon, the easy solution is probably to just bake the epsilon into your garbage collection scheme. Normally with NTP, the epsilon is somewhere around 1ms. With a protocol like PTP, it would be well below 1ms.
Absolute time doesn't really exist in distributed systems though. It can be bottleneck to try to depend on it, since it implies that all the nodes need communicate. One way of avoiding it, and synchronization in general, is to keep a relative sequence of events using a vector clock, or an interval tree clock. This avoids the need to synchronize on absolute time as state. Since the sequences describe related events, the implication is that only nodes with related events need to communicate.
So, with garbage collection, objects could be marked stale using node sequence numbers. Then, instead of the garbage collector thread checking liveness, the object could either be collected as the sequence number increments, or just marked stale and collected asynchronously.

Hadoop data nodes die very often

Our Hadoop cluster is a cluster of 5 data nodes and 2 name nodes. The traffic is actually very high and a few nodes go down very often. But they come back after a while. Some times it takes a long time, more than half an hour to come back alive.
There are few DNs with more threads than the others. Is this a configuration problem?
The data is not write intensive. MR jobs run every 20 minutes.
After running a health monitor for two days, sampled at half an hour interval, we came to know that the nodes die during disk verification which runs every 6 hours. So now the nodes die predictably. But why do they die during disk verification? Is there anyway to prevent the nodes die during the disk verification??
Clouedera's capacity planning gives an insight on this. If you see “Bad connect ack with firstBadLink”, “Bad connect ack”, “No route to host”, or “Could not obtain block” IO exceptions under heavy loads, chances are these are due to a bad network.

Spreading/smoothing periodic tasks out over time

I have a database table with N records, each of which needs to be refreshed every 4 hours. The "refresh" operation is pretty resource-intensive. I'd like to write a scheduled task that runs occasionally and refreshes them, while smoothing out the spikes of load.
The simplest task I started with is this (pseudocode):
every 10 minutes:
find all records that haven't been refreshed in 4 hours
for each record:
refresh it
set its last refresh time to now
(Technical detail: "refresh it" above is asynchronous; it just queues a task for a worker thread pool to pick up and execute.)
What this causes is a huge resource (CPU/IO) usage spike every 4 hours, with the machine idling the rest of the time. Since the machine also does other stuff, this is bad.
I'm trying to figure out a way to get these refreshes to be more or less evenly spaced out -- that is, I'd want around N/(10mins/4hours), that is N/24, of those records, to be refreshed on every run. Of course, it doesn't need to be exact.
Notes:
I'm fine with the algorithm taking time to start working (so say, for the first 24 hours there will be spikes but those will smooth out over time), as I only rarely expect to take the scheduler offline.
Records are constantly being added and removed by other threads, so so we can't assume anything about the value of N between iterations.
I'm fine with records being refreshed every 4 hours +/- 20 minutes.
Do a full refresh, to get all your timestamps in sync. From that point on, every 10 minutes, refresh the oldest N/24 records.
The load will be steady from the start, and after 24 runs (4 hours), all your records will be updating at 4-hour intervals (if N is fixed). Insertions will decrease refresh intervals; deletions may cause increases or decreases, depending on the deleted record's timestamp. But I suspect you'd need to be deleting quite a lot (like, 10% of your table at a time) before you start pushing anything outside your 40-minute window. To be on the safe side, you could do a few more than N/24 each run.
Each minute:
take all records older than 4:10 , refresh them
If the previous step did not find a lot of records:
Take some of the oldest records older than 3:40, refresh them.
This should eventually make the last update time more evenly spaced out. What "a lot" and "some" means You should decide Yourself (possibly based on N).
Give each record its own refreshing interval time, which is a random number between 3:40 and 4:20.

Resources