Hadoop slave cannot connect to master, even when service is running and ports are open - hadoop

I'm running hadoop 2.5.1 and I'm having a problem when slaves are connecting to master. My goal is to set-up a hadoop cluster. I hope someone can help, I'm been poundering with this too long already! :)
This is what comes up to the log file of slave:
2014-10-18 22:14:07,368 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.0.104:8020
This is my core-site.xml -file (same on master and slave):
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master/</value>
</property>
</configuration>
This is my hosts -file ((almost)same on master and slave).. I have hard coded addresses to there without any success:
127.0.0.1 localhost
192.168.0.104 xubuntu: xubuntu
192.168.0.104 master
192.168.0.194 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Netstats from master:
xubuntu#xubuntu:/usr/local/hadoop/logs$ netstat -atnp | grep 8020
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 192.168.0.104:8020 0.0.0.0:* LISTEN 26917/java
tcp 0 0 192.168.0.104:52114 192.168.0.104:8020 ESTABLISHED 27046/java
tcp 0 0 192.168.0.104:8020 192.168.0.104:52114 ESTABLISHED 26917/java
Nmap from master to master:
Starting Nmap 6.40 ( http://nmap.org ) at 2014-10-18 22:36 EEST
Nmap scan report for master (192.168.0.104)
Host is up (0.000072s latency).
rDNS record for 192.168.0.104: xubuntu:
PORT STATE SERVICE
8020/tcp open unknown
..and nmap from slave to master (even when the port is open, the slave doesn't connect to it..):
ubuntu#ubuntu:/usr/local/hadoop/logs$ nmap master -p 8020
Starting Nmap 6.40 ( http://nmap.org ) at 2014-10-18 22:35 EEST
Nmap scan report for master (192.168.0.104)
Host is up (0.14s latency).
PORT STATE SERVICE
8020/tcp open unknown
What is this all about? The problem is not about firewall.. I have also read every thread there is to to this without any success. I'm frustrated to this.. :(

At least one of your problems is that you are using old configuration name for the HDFS. For version 2.5.1 the configuration name should be fs.defaultFS instead of fs.default.name. I also suggest defining the port in the value, so the value would be hdfs://master:8020.
Sorry, I'm not linux guru, so I don't know about nmap, but does telnet'ing work from slave to master to the port?

Related

Datanode denied communication with namenode because hostname cannot be resolved

I ran a hadoop cluster in kubernetes, with 4 journalnodes and 2 namenodes. Sometimes, my datanodes cannot register to namenodes.
17/06/08 07:45:32 INFO datanode.DataNode: Block pool BP-541956668-10.100.81.42-1496827795971 (Datanode Uuid null) service to hadoop-namenode-0.myhadoopcluster/10.100.81.42:8020 beginning handshake with NN
17/06/08 07:45:32 ERROR datanode.DataNode: Initialization failed for Block pool BP-541956668-10.100.81.42-1496827795971 (Datanode Uuid null) service to hadoop-namenode-0.myhadoopcluster/10.100.81.42:8020 Datanode denied communication with namenode because hostname cannot be resolved (ip=10.100.9.45, hostname=10.100.9.45): DatanodeRegistration(0.0.0.0:50010, datanodeUuid=b1babba6-9a6f-40dc-933b-08885cbd358e, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-bceaa23f-ba3d-4749-a542-74cda1e82e07;nsid=177502984;c=0)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:863)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:4529)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1279)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:95)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28539)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
It says:
hadoop-namenode-0.myhadoopcluster/10.100.81.42:8020 Datanode denied communication with namenode because hostname cannot be resolved (ip=10.100.9.45, hostname=10.100.9.45)
However, I can ping hadoop-namenode-0.myhadoopcluster, 10.100.81.42, 10.100.9.45 in both the datanode and the namenode.
/etc/hosts in datanode:
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.100.9.45 hadoop-datanode-0.myhadoopcluster.default.svc.cluster.local hadoop-datanode-0
/etc/hosts in namenode:
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.100.81.42 hadoop-namenode-0.myhadoopcluster.default.svc.cluster.local hadoop-namenode-0
And I have already set dfs.namenode.datanode.registration.ip-hostname-check to false in hdfs-site.xml
I guess the problem may be related to dns. And in other similar problems, hadoop are not deployed in kubernetes or docker container, so I posted this one. Please do not tag it as duplicated...
In my situation, I included three configuration to the namenode and datanode as well:
dfs.namenode.datanode.registration.ip-hostname-check: false
dfs.client.use.datanode.hostname: false (default)
dfs.datanode.use.datanode.hostname: false (default)
I hope you found a resolution to the issue by now.
I ran into similar problem last week, but my cluster is set up in a different environment but the problem context is same.
Essentially , the reverse DNS lookup needs to be set up to solve this issue if the cluster is using a DNS Resolver then this needs to be set up at the DNS server level or if the Name Nodes are looking into /etc/hosts file to find Data Nodes then there needs to be any entry for the Data nodes there.
I have updated an old question in Hortonworks Community Forum Post,Link as below:
https://community.hortonworks.com/questions/24320/datanode-denied-communication-with-namenode.html?childToView=135321#answer-135321

Why suddenly the slave node lost connection to the master node in hadoop?

I have set up hadoop 2.7.2 upon a cluster with a master(ubuntu 15.10) and two slave(slave2,3) hosted in the master by virtualbox.
I have run several examples like wordcount,it all works ok. But when i try to run my own job,say Myjob, it runs well at first, but after a while, it will definitely be interrupted by this error:
INFO ipc.Client: Retrying connect to server: slave3/xxx.216.227.176(the ip of slave):38046. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
Sometimes it would be slave2 ,sometimes would be slave3. And my connection to that slave by ssh shows that the connection is closed by remote.
But the virtualbox shows that slave runs well, and i can ressh to that slave, However all the hadoop process are been killed.
Need to mention that my own job runs longer than the example job.
At first, i think it maybe some error caused by my config file, so, i reinstall hadoop at master and slaves. But the error stills.
So, i think that maybe it is caused by my network config in the slave nodes. So, i changed the last filed of the slave's ip like xxx.xxx.xxx.183 to xxx.xxx.xxx.176 and reinstall hadoop.
And I rerun the job, at this time the job runs longer than usual. But, at the last, when the map stage mostly finished (map 86% reduce 28%), it failed due to the same error!
INFO ipc.Client: Retrying connect to server: slave3/125.xxx.227.xxx(the ip of slave):38046. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
Also there is something log in yarn-user-resourcemanager-Master.log :
java.net.ConnectException: Call From Master/xxx.216.227.186 to slave2:44592 failed on connection exception: java.net.ConnectException: refuse to connect; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
It seems like that the longer time the app runs , the bigger chance it will fail.
Here is my hosts file:
127.0.0.1 localhost
#127.0.1.1 Master
xxx.216.227.186 Master
xxx.216.227.185 slave1# the slave1 has some problem thus do not connect to the cluster
xxx.216.227.176 slave2
xxx.216.227.166 slave3
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Why? How to fix it? Thanks!

Hadoop yarn node list shows slaves as localhost.localdomain:#somenumber. connection refuse exception

I have got connection refuse exception from localhost.localdomain/127.0.0.1 to localhost.localdomain:55352 when trying to run wordcount program.
yarn node -list gives
hduser#localhost:/usr/local/hadoop/etc/hadoop$ yarn node -list
15/05/27 07:23:54 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.111.72:8040
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
localhost.localdomain:32991 RUNNING localhost.localdomain:8042 0
localhost.localdomain:55352 RUNNING localhost.localdomain:8042 0
master /etc/hosts:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#127.0.1.1 ubuntu-Standard-PC-i440FX-PIIX-1996
192.168.111.72 master
192.168.111.65 slave1
192.168.111.66 slave2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
slave /etc/hosts:
127.0.0.1 localhost.localdomain localhost
#127.0.1.1 ubuntu-Standard-PC-i440FX-PIIX-1996
192.168.111.72 master
#192.168.111.65 slave1
#192.168.111.66 slave2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
What I understood is master is wrongly trying to connect to slaves on localhost. Please help me resolve this. Any suggestion is appreciated. Thank you.
Here is the code how NodeManager builds the NodeId:
private NodeId buildNodeId(InetSocketAddress connectAddress,
String hostOverride) {
if (hostOverride != null) {
connectAddress = NetUtils.getConnectAddress(
new InetSocketAddress(hostOverride, connectAddress.getPort()));
}
return NodeId.newInstance(
connectAddress.getAddress().getCanonicalHostName(),
connectAddress.getPort());
}
NodeManager tries to get the canonical hostname from the binding address, localhost will be gotten by given address 127.0.0.1.
So in your case, on the slave host, localhost.localdomain is the default host name for address 127.0.0.1, and the possible solution might be changing the first line of /etc/hosts on your slaves respectively to:
127.0.0.1 slave1 localhost.localdomain localhost
and
127.0.0.1 slave2 localhost.localdomain localhost

Hbase Regions server is not able to communicate with HMaster

I am not able to setup the hbase in distributed mode. It works fine when i setup it on one machine(standalone mode). My Zookeeper, hmaster and region server starts properly.
But when i go to hbase shell and look for the status. It shows me 0 region server. I am attaching my logs of regions server. Plus the host files of my master(namenode) and slave(datanode). I have tried every P&C which are given on stackoverflow for changing the host file, but didn't work for me.
2013-06-24 15:03:45,844 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server naresh-pc/192.168.0.108:2181. Will not attempt to authenticate using SASL (unknown error)
2013-06-24 15:03:45,845 WARN org.apache.zookeeper.ClientCnxn: Session 0x13f75807d960001 for server null, unexpected error, closing socket connection and attempting to reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
Slave /etc/hosts :
127.0.0.1 localhost
127.0.1.1 ubuntu-pc
#ip for hadoop
192.168.0.108 master
192.168.0.126 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Master /etc/hosts :
127.0.0.1 localhost
127.0.1.1 naresh-pc
#ip for hadoop
192.168.0.108 master
192.168.0.126 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
hbase-site.xml :
<configuration>
<property>
<name>hbase.master</name>
<value>master:60000</value>
<description>The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver
in a single process.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:54310/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed
Zookeeper true: fully-distributed with unmanaged Zookeeper
Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
<description>Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example,
"host1.mydomain.com,host2.mydomain.com".
By default this is set to localhost for local and
pseudo-distributed modes of operation. For a
fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If
HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop
ZooKeeper on.
</description>
</property>
</configuration>
Zookeeper log:
2013-06-28 18:22:26,781 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x13f8ac0b91b0002, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:722)
2013-06-28 18:22:26,858 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.0.108:57447 which had sessionid 0x13f8ac0b91b0002
2013-06-28 18:25:21,001 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x13f8ac0b91b0002, timeout of 180000ms exceeded
2013-06-28 18:25:21,002 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x13f8ac0b91b0002
Master Log:
2013-06-28 18:22:41,932 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1502022 ms
2013-06-28 18:22:43,457 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1503547 ms
Remove 127.0.1.1 from hosts file and turn of IPV6. That should fix the problem.
Your Regionserver is looking for HMaster at naresh-pc but you do not have any such entry in your /etc/hosts file. Please make sure your configuration is proper.
Can you try all of this:
Make sure your /conf/regionservers file has just one entry: slave
Not sure what HBase version you are using, but instead of using port 54310 for hbase.rootdir property in your hbase-site.xml use port 9000
Your /etc/hosts file, on BOTH master and slave should should only have these custom entries:
127.0.0.1 localhost
192.168.0.108 master
192.168.0.126 slave
I am concerned that your logs state Opening socket connection to server naresh-pc/192.168.0.108:2181
Clearly the system thinks that zookeeper is on host naresh-pc, but in your config you are setting zookeeper quorum at host master, which HBase will bind to. That's a problem right there. In my experience, HBase is EXTREMELY fussy about host names, so make sure they are all in synch in all your configs and in your /etc/hosts file.
Also, this may be a minor issue, but wouldn't hurt to specify the zookeper data directory in your .xml file to have a minimum set of settings that should make the cluster work: hbase.zookeeper.property.dataDir

how to establish the RegionServer of Hbase to master

Please tell me how to establish the RegionServer of Hbase to master.
I configured 5 region servers, however, only 2 server is worked properly.
hbase(main):001:0> status
2 servers, 0 dead, 1.5000 average load
The hostname of this two servers are sm3-10 and sm3-12 from http://hbase-master:60010.
But the other servers like sm3-8 not work.
I'd like to know the trouble shooting step and resolutions.
sm3-10:slave, work well
[root#sm3-10 ~]# jps
2581 QuorumPeerMain
2761 SecondaryNameNode
2678 DataNode
19913 Jps
2551 HRegionServer
[root#sm3-10 ~]# lsof -i:54310
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2678 hdfs 52r IPv6 27608 TCP sm3-10:33316->sm3-12:54310 (ESTABLISHED)
[root#sm3-10 ~]# lsof -i:3888
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2581 zookeeper 19u IPv6 7239 TCP *:ciphire-serv (LISTEN)
java 2581 zookeeper 20u IPv6 7242 TCP sm3-10:ciphire-serv->sm3-11:53593 (ESTABLISHED)
java 2581 zookeeper 25u IPv6 27011 TCP sm3-10:ciphire-serv->sm3-12:40352 (ESTABLISHED)
java 2581 zookeeper 29u IPv6 25573 TCP sm3-10:ciphire-serv->sm3-8:44271 (ESTABLISHED)
sm3-8:slave, not work properly, however, the status looks good
[root#sm3-8 ~]# jps
3489 Jps
2249 HRegionServer
2463 DataNode
2297 QuorumPeerMain
2686 SecondaryNameNode
[root#sm3-8 ~]# lsof -i:54310
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2463 hdfs 51u IPv6 9919 TCP sm3-8.nos-seamicro.local:40776->sm3-12:54310 (ESTABLISHED)
[root#sm3-8 ~]# lsof -i:3888
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2297 zookeeper 18u IPv6 5951 TCP *:ciphire-serv (LISTEN)
java 2297 zookeeper 19u IPv6 9839 TCP sm3-8.nos-seamicro.local:52886->sm3-12:ciphire-serv (ESTABLISHED)
java 2297 zookeeper 20u IPv6 5956 TCP sm3-8.nos-seamicro.local:44271->sm3-10:ciphire-serv (ESTABLISHED)
java 2297 zookeeper 24u IPv6 5959 TCP sm3-8.nos-seamicro.local:47922->sm3-11:ciphire-serv (ESTABLISHED)
Mastet:sm3-12
[root#sm3-12 ~]# jps
2760 QuorumPeerMain
3035 NameNode
3096 SecondaryNameNode
2612 HRegionServer
4330 Jps
2872 DataNode
3723 HMaster
[root#sm3-12 ~]# lsof -i:54310
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 2872 hdfs 51u IPv6 7824 TCP sm3-12:45482->sm3-12:54310 (ESTABLISHED)
java 3035 hdfs 54u IPv6 7783 TCP sm3-12:54310 (LISTEN)
java 3035 hdfs 70u IPv6 7873 TCP sm3-12:54310->sm3-8:40776 (ESTABLISHED)
java 3035 hdfs 71u IPv6 7874 TCP sm3-12:54310->sm3-11:54990 (ESTABLISHED)
java 3035 hdfs 72u IPv6 7875 TCP sm3-12:54310->sm3-10:33316 (ESTABLISHED)
java 3035 hdfs 74u IPv6 7877 TCP sm3-12:54310->sm3-12:45482 (ESTABLISHED)
[root#sm3-12 ~]#
[root#sm3-12 ~]# cat /etc/hbase/conf/hbase-site.xml
hbase.rootdir
hdfs://sm3-12:54310/hbase
true
hbase.zookeeper.quorum
sm3-8,sm3-10,sm3-11,sm3-12,sm3-13
true
--- snip ---
[root#sm3-12 ~]# cat /etc/zookeeper/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper
clientPort=2181
server.1=sm3-10:2888:3888
server.2=sm3-11:2888:3888
server.3=sm3-12:2888:3888
server.4=sm3-8:2888:3888
[root#sm3-12 ~]#
Thanks in advance,
Hiromi
check to make sure your dns is configured properly on all of the hosts, and each server can do a reverse lookup

Resources