Our Hadoop cluster is a cluster of 5 data nodes and 2 name nodes. The traffic is actually very high and a few nodes go down very often. But they come back after a while. Some times it takes a long time, more than half an hour to come back alive.
There are few DNs with more threads than the others. Is this a configuration problem?
The data is not write intensive. MR jobs run every 20 minutes.
After running a health monitor for two days, sampled at half an hour interval, we came to know that the nodes die during disk verification which runs every 6 hours. So now the nodes die predictably. But why do they die during disk verification? Is there anyway to prevent the nodes die during the disk verification??
Clouedera's capacity planning gives an insight on this. If you see “Bad connect ack with firstBadLink”, “Bad connect ack”, “No route to host”, or “Could not obtain block” IO exceptions under heavy loads, chances are these are due to a bad network.
Related
I repeatedly switch data nodes in high availability mode, the terminal continues prompting that the transaction is occupied, and then resumes after a few minutes. Why does this occur?
Regarding the questions about switching data nodes repeatedly, prompting the transactions are occupied and so on, the mechanism is as follows. After a data node is down, it takes some time to synchronize the transaction status between itself and the control node. The transaction timeout is about two minutes. In addition, There is some time spent on data recovery, which is related to the specific data volume. If you switch back to this node again in a short time, you will be prompted that the transaction is occupied, and it will be restored in about two minutes. In actual production, the probability of a node breaks down, especially some different nodes hangs continuously within a few minutes is extremely small. Thus, for high-availability tests, it is recommended to set the interval between simulated hangs more than two minutes. It is better to have interval time longer than 5 mins for simulating the breakdown of next node.
We're using a 3 nodes Vertica cluster.
The network connection between the nodes sometimes fails for a short amount of time (ex : 10 seconds).
When this happens, all nodes quickly shut down as soon as they detect that other nodes are unreachable (because k-safety cannot be satisfied). For example, the following sequence is recorded in the vertica log by the node0003 :
00:04:30.633 node v_feedback_node0001 left the cluster
...
00:04:30.670 Node left cluster, reassessing k-safety...
...
00:04:32.389 node v_feedback_node0002 left the cluster
...
00:04:32.414 Changing node v_feedback_node0003 startup state from UP to UNSAFE
...
00:04:33.425 Shutting down this node
...
00:04:38.547 node v_feedback_node0003 left the cluster
Is it possible to configure a delay after which each node will try to reconnect to others before giving up and shutting down ?
Got an answer from a Vertica employee on the Vertica forum.
This [reconnection delay] time is hard coded to 8 seconds.
I think time is better spent making the network more reliable. 30 sec
of network failure is a lot (i mean really, really large, typically
network rtt is in the microseconds). even if you kept vertica up by
delaying k-safe assessment, nothing really can connect to the
database, or most likely all db connections may reset.
We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.
I've read on rethinkdb's doc that we can have a number of nodes from one to sixteen but actually I don't know if it is a way of speaking or a real limit.
I launched 20 VirtualBox VMs to create a cluster and I found troubles to have all nodes in the cluster online at the same time, 3 or 4 nodes loose connectivity. This makes sense with the 16 limit but I havent found similar limits for other nosql databases.
Is 16 a real maximum number of nodes per cluster limit on rethinkdb?
thanks!
Short answer is: There is no hard limit.
It is written 16 machines because that is what we have tested so far.
Some tests have been run with 64 nodes and while it doesn't scale as much as it should, it still works.
RethinkDB is aiming for a smooth experience with 100 servers and 100.000 tables -- see https://github.com/rethinkdb/rethinkdb/issues/1861 to track progress.
Also if you run 20 VMs on the same machine, the host may not have enough resources to run the cluster, which would explains the timeouts.
I'm currently rebuilding our servers that have our region-servers and data nodes. When I take down a data node, after 10 minutes the blocks that it had are being re-replicated among other data nodes, as it should. We have 10 data-nodes, so I see heavy network traffic as the blocks are being re-replicated. However, I'm seeing that traffic to be about only 500-600mbps per server (the machines all have gigabit interfaces) so it's definitely not network-bound. I'm trying to figure out what is limiting the speed that the data-nodes send and receive blocks. Each data-node has six 7200 rpm sata drives, and the IO usage is very low during this, only peaking to 20-30% per drive. Is there a limit built into hdfs that limits the speed at which blocks are replicated?
The rate of replication work is throttled by HDFS to not interfere with cluster traffic when failures happen during regular cluster load.
The properties that control this are dfs.namenode.replication.work.multiplier.per.iteration (2), dfs.namenode.replication.max-streams (2) and dfs.namenode.replication.max-streams-hard-limit (4). The foremost controls the rate of work to be scheduled to a DN at every heartbeat that occurs, and the other two further limit the maximum parallel threaded network transfers done by a DataNode at a time. The values in () indicate their defaults. Some description of this is available at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
You can perhaps try to increase the set of values to (10, 50, 100) respectively to spruce up the network usage (requires a NameNode restart), but note that your DN memory usage may increase slightly as a result of more blocks information being propagated to it. A reasonable heap size for these values for the DN role would be about 4 GB.
P.s. These values were not tried by me on production systems personally. You will also not want to max out the re-replication workload such that it affects regular cluster work, as recovery of 1/3 replicas may be of lesser priority than missing job/query SLAs due to lack of network resources (unless you have a really fast network that's always under-utilised even under loaded periods). Try to tune it till you're satisfied with the results.