Namenode doesn't detect datanodes failure - hadoop

I have set up a Hadoop high available cluster including 3 nodes as masters (3 journal nodes, active namenode, and standby namenode, with no secondary namenode) and 3 datanodes.
using commands
hadoop-daemon.sh start journalnode
hadoop-daemon.sh start namenode
hadoop-daemon.sh start zkfc
I start namenode services and using command hadoop-daemon.sh start datanode I start datanode services.
The problem is when I stop a datanode intentionally using command hadoop-daemon.sh stop datanode, In the namenodes WebUI,(both active and standby) even after some minutes, it is still considered as an alive node and I think namenodes don't detect datanode's failure!

For future readers, from here:
A datanode is considered stale when:
dfs.namenode.stale.datanode.interval < last contact < (2 *
dfs.namenode.heartbeat.recheck-interval)
In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for the Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.
Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.
Relevant properties include:
dfs.heartbeat.interval - default: 3 seconds
dfs.namenode.stale.datanode.interval - default: 30 seconds
dfs.namenode.heartbeat.recheck-interval - default: 5 minutes
dfs.namenode.avoid.read.stale.datanode - default: true
dfs.namenode.avoid.write.stale.datanode - default: true

Related

Hadoop 2.7.5 Yarn HA conflict and Bug

I have set up a Hadoop HA cluster including standby node for both namenode and resourcemanager such that in node A and node B both namenode and resourcemanager process will start and one node role as a standby node.
When I shutdown active node (A or B), the other will be the active node (tested). In case of I just start a node (A or B) the namenode process is ok, but the resourcemanager process is not responsive! I checked the logs and it was the following:
2018-01-27 17:01:43,371 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/192.168.1.210:8020. Already tried 68 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
The node A/192.168.1.210 is that not-started node. Its stock on a loop connecting to node A, while the namenode process is active on node B.
I set the following property to decrease the (500*2000) (refrenced here):
<property>
<name>yarn.resourcemanager.fs.state-store.retry-policy-spec</name>
<value>1000,10</value>
<!-- <value>2000,10</value> also checked -->
</property>
But is has no effect on resource manager behavior!
Is this a bug or I'm wrong!?
Since Hadoop 2.8 the property yarn.node-labels.fs-store.retry-policy-spec is added to control that situation. Adding the following property solved the problem:
<property>
<name>yarn.node-labels.fs-store.retry-policy-spec</name>
<value>1000, 10</value>
</property>
Now it try 10 time, sleeping 1000ms and after that it switch to another namenode

jps command shows DFSAdmin process

iam using hadoop apache 2.7.1
on centos 7 environment
and i have an HA Cluster which consists of two name nodes(mn1 and mn2)
and 6 data nodes
issuing jps on mn1 shows
34734 DFSZKFailoverController
34245 NameNode
31529 DFSAdmin
34551 JournalNode
34822 Jps
3857 QuorumPeerMain
and issuing jps on mn2 shows
26272 JournalNode
26483 Jps
26110 NameNode
26388 DFSZKFailoverController
2259 QuorumPeerMain
what does DFSAdmin Process in mn1 jps output refers to ?
i noticed that this dfsadmin process appeared in the following scenario:
when number of failed journal nodes exceeds possible numbers of journal nodes for cluster to continue working
Which is defined in formula
(N-1)/ 2 where N is number of journal nodes
So the cluster will not continue to work and active name node shut downs
so if i started accepted number of journal nodes and name node again
jps on active name node shows dfsadmin process
and this problem was solved by restarting all cluster services a gian

Running Hadoop in full-distributed mode in a 5-machines cluster takes more time than in a single machine

I am running hadoop in a cluster of 5 machines (1 master and 4 slaves). I am running a map-reduce algorithm for friends-in-common recommandation, and I am using a file with 49995 lines (or 49995 people each one followed by his friends).
The problem is that it takes more time to execute the algorithm on the cluster than on one machine !!
I don't know if this is normal because the file is not big enough (and thus the time is slower due to latency between machines) or that I must change something to run the algorithm in parallel on the different nodes, but I think this is done automatically.
Typically, running the algorithm on one machine takes this:
real 3m10.044s
user 2m53.766s
sys 0m4.531s
While on the cluster it takes this time:
real 3m32.727s
user 3m10.229s
sys 0m5.545s
Here is the output when I execute the start_all.sh script on the master:
ubuntu#ip:/usr/local/hadoop-2.6.0$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-namenode-ip-172-31-37-184.out
slave1: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave1.out
slave2: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave2.out
slave3: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave3.out
slave4: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave4.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-secondarynamenode-ip-172-31-37-184.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-resourcemanager-ip-172-31-37-184.out
slave4: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave4.out
slave1: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave1.out
slave3: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave3.out
slave2: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave2.out
And here is the output when I execute the stop_all.sh script:
ubuntu#ip:/usr/local/hadoop-2.6.0$ sbin/stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [master]
master: stopping namenode
slave4: no datanode to stop
slave3: stopping datanode
slave1: stopping datanode
slave2: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
slave2: no nodemanager to stop
slave3: no nodemanager to stop
slave4: no nodemanager to stop
slave1: no nodemanager to stop
no proxyserver to stop
Thank you in advance !
One possible reason is that your file is not uploaded on the HDFS. In other words it is stored on a single machine, and all the other running machines have to get their data from that machine.
Before you run your mapreduce program. You can do the following steps:
1- Make sure that the HDFS is up and running. Open the link:
master:50070
Where master is the IP for the node running the namenode, and check on that link that you have all the nodes live and running. So if you have 4 datanodes you should see: datanodes (4 live).
2- Call:
hdfs dfs -put yourfile /someFolderOnHDFS/yourfile
That way you have uploaded your input file to the HDFS and the data is now distributed among multiple nodes.
Try running your program now and see if it is faster
Best of luck

Hadoop: How to start secondary namenode on other node?

I'm trying to construct hadoop cluster which consists of 1 namenode, 1 secondary namenode, and 3 datanodes in ec2.
So I wrote the address of secondary namenode to the masters file and executed start-dfs.sh .
:~/hadoop/etc/hadoop$ cat masters
ec2-54-187-222-213.us-west-2.compute.amazonaws.com
But, the secondary namenode didn't start at the address which was written in the masters file. It just started at the node where the stat-dfs.sh script was executed.
:~/hadoop/etc/hadoop$ start-dfs.sh
...
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/ubuntu/hadoop/logs/hadoop-ubuntu-secondarynamenode-ip-172-31-26-190.out
I don't figure why secondary namenode started at [0.0.0.0]. It should start at ec2-54-187-222-213.us-west-2.compute.amazonaws.com.
Are there anyone who know this reason?
============================================================
Oh I solved this problem. I added
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>ec2-54-187-222-213.us-west-2.compute.amazonaws.com:50090</value>
</property>
to hdfs-site.xml file and it works! The masters file is useless.
It is okay, as long as the node roles are configured correctly in hadoop configuration. You can use dfsadmin to check the IP address of the secondary namenode. If it is 172.31.26.190 then it means fine. The secondary namenode serves at 0.0.0.0 means it accepts any incoming connections from localhost or from any nodes within the network.

Couldn't see RegionServer in Terminal-LINUX-HBASE

I installed hadoop and my HBase is running on top of it. All my deamons in hadoop is up and running. After i started my hbase i could see the HMaster running when i gave the JPS command.
I'm running my hadoop in Pseudo distributed mode . When i checked my localhost it shows regionserver is running.
But why couldn't i see the HRegionServer running in my Terminal in Linux?
It might be because hbase.cluster.distributed is not set or set to false in hbase-site.xml .
According to http://hbase.apache.org/book/config.files.html :
hbase.cluster.distributed :
The mode the cluster will be in. Possible values are false for
standalone mode and true for distributed mode. If false, startup will
run all HBase and ZooKeeper daemons together in the one JVM. Default:
false
So if you set it to true you'll see the distinct master, region server and ZooKeeper processes. E.g: a pseudo-distributed Hadoop/HBase process list would look like this:
jps
3991 HMaster
4209 HRegionServer
3140 DataNode
3464 TaskTracker
3246 JobTracker
2942 NameNode
3924 HQuorumPeer

Resources