Why suddenly the slave node lost connection to the master node in hadoop? - hadoop

I have set up hadoop 2.7.2 upon a cluster with a master(ubuntu 15.10) and two slave(slave2,3) hosted in the master by virtualbox.
I have run several examples like wordcount,it all works ok. But when i try to run my own job,say Myjob, it runs well at first, but after a while, it will definitely be interrupted by this error:
INFO ipc.Client: Retrying connect to server: slave3/xxx.216.227.176(the ip of slave):38046. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
Sometimes it would be slave2 ,sometimes would be slave3. And my connection to that slave by ssh shows that the connection is closed by remote.
But the virtualbox shows that slave runs well, and i can ressh to that slave, However all the hadoop process are been killed.
Need to mention that my own job runs longer than the example job.
At first, i think it maybe some error caused by my config file, so, i reinstall hadoop at master and slaves. But the error stills.
So, i think that maybe it is caused by my network config in the slave nodes. So, i changed the last filed of the slave's ip like xxx.xxx.xxx.183 to xxx.xxx.xxx.176 and reinstall hadoop.
And I rerun the job, at this time the job runs longer than usual. But, at the last, when the map stage mostly finished (map 86% reduce 28%), it failed due to the same error!
INFO ipc.Client: Retrying connect to server: slave3/125.xxx.227.xxx(the ip of slave):38046. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
Also there is something log in yarn-user-resourcemanager-Master.log :
java.net.ConnectException: Call From Master/xxx.216.227.186 to slave2:44592 failed on connection exception: java.net.ConnectException: refuse to connect; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
It seems like that the longer time the app runs , the bigger chance it will fail.
Here is my hosts file:
127.0.0.1 localhost
#127.0.1.1 Master
xxx.216.227.186 Master
xxx.216.227.185 slave1# the slave1 has some problem thus do not connect to the cluster
xxx.216.227.176 slave2
xxx.216.227.166 slave3
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Why? How to fix it? Thanks!

Related

Hadoop Multi-Node Cluster: exception: java.net.ConnectException: Connection refused

I have set up 4 nodes hadoop cluster using http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/:
Namenode: node04
Datanode: node01
Datanode: node02
Datanode: node03
I can see only two nodes(node01,node03) running in my cluster. Node02 has an log with error message as:
2015-12-11 10:15:18,698 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node04/127.17.0.224:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-11 10:15:19,699 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node04/127.17.0.224:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-12-11 10:15:20,699 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node04/127.17.0.224:9000. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Every nodes /etc/hosts contains following:
127.0.0.1 localhost
127.17.0.221 node01
127.17.0.222 node02
127.17.0.223 node03
127.17.0.224 node04
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
And /etc/hadoop/masters contains node04, /etc/hadoop/slaves contains node01 node02 and node03
Would you please help me understand how to get to it?
Thanks!
Perform these actions:
Go to node02 and run telnet node04 9000 and ping node04 commands
to confirm there is connectivity between node02 and node04
On all nodes check whether core-site.xml and hdfs-site.xml have the same contents
Check ssh and sshd
ssh connection between the nodes
Check port binding details (Hadoop Datanodes cannot find NameNode)
Also refer https://wiki.apache.org/hadoop/ServerNotAvailable

Error on starting HDFS daemons on hadoop Multinode cluster.Datanode not starting

I am trying to setup hadoop cluster and getting following error while connecting datanode.Namenode is up and running fine,however datanode is creating problem.
/etc/hosts file is available on both the nodes.
IP tables stopped(f/w).
ssh happening.
2015-05-20 20:54:05,008 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: nn1.cluster1.com/192.168.1.11:9000. Already tried 9
time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS) 2015-05-20 20:54:05,017 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
Call to nn1.cluster1.com/192.168.1.11:9000 failed on local exception:
java.net.NoRouteToHostException: No route to host
java.io.IOException: Call to nn1.cluster1.com/192.168.1.11:9000 failed on local exception: java.net.NoRouteToHostException: No route to host
This error occurs if you have firewall on namenode system. To disable firewall, type these commands in terminal.
service iptables save
service iptables stop
chkconfig iptables off
Now, stop and start the hadoop processess.

Pig keeps trying to connect to job history server (and fails)

I'm running a Pig job that fails to connect to the Hadoop job history server.
The task (usually any task with GROUP BY) runs for a while and then it starts with a message like:
2015-04-21 19:05:22,825 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-04-21 19:05:26,721 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-04-21 19:05:29,721 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
It then continues for a while retrying the connection. Sometimes it precedes further with the job. Othertimes it throws this exception:
2015-04-21 19:05:55,822 [main] WARN org.apache.pig.tools.pigstats.mapreduce.MRJobStats - Unable to get job counters
java.io.IOException: java.io.IOException: java.net.NoRouteToHostException: No Route to Host from cluster-01/10.10.10.11 to 0.0.0.0:10020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getCounters(HadoopShims.java:132)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addCounters(MRJobStats.java:284)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
I found this question here but in my case the job history server is started. If I run netstat, I find:
tcp 0 0 0.0.0.0:10020 0.0.0.0:* LISTEN 12073/java off (0.00/0/0)
Where 12073 is ...
12073 pts/4 Sl 0:07 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Dproc_historyserver -Xmx1000m -Djava.library.path=/data/hadoop/hadoop/lib -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/hadoop/hadoop-2.3.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/data/hadoop/hadoop-2.3.0 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/hadoop/hadoop/logs -Dhadoop.log.file=mapred-hadoop-historyserver-cluster-01.log -Dhadoop.root.logger=INFO,RFA -Dmapred.jobsummary.logger=INFO,JSA -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
I tried opening the port 10200 in case it was a firewall issue:
ACCEPT tcp -- anywhere anywhere tcp dpt:10020
... but no luck.
After a few minutes, some of the tasks just arbitrarily continue to the next part.
I'm using Hadoop 2.3 and Pig 0.14.
My question is:
1) What are the possible reasons why Pig cannot connect to the job history server (JHS) given that the JHS is running on the same port that Pig looks for it?
... or failing that ...
2) Is there any way to just tell Pig to stop trying to connect to the JHS and continue with the task?
It seems that most Hadoop installation/configuration guides neglect to mention configuring the Job History Server. It seems that Pig, in particular, relies on this server. It also seems like the default (local) settings for the JHS won't work in a multi-node cluster.
The solution was to add the hostname of the server into the configuration in mapred-site.xml to make sure it could be accesses from the other machines. (In my version of the file, the lines had to be added as "new" ... there were no previous settings.)
<property>
<name>mapreduce.jobhistory.address</name>
<value>cm:10020</value>
<description>Host and port for Job History Server (default 0.0.0.0:10020)</description>
</property>
Then restart the job history server:
mr-jobhistory-daemon.sh stop historyserver
mr-jobhistory-daemon.sh start historyserver
If you get a bind exception (port in use), it means the stop didn't work. Either
Use ps ax | grep -e JobHistory to get the process and kill it manually with kill -9 [pid]. Then call the start command above again. Or
Use a different port in the configuration
Pig should pick up the new settings automatically. Run a Pig script and hope for the best.
start history server in hadoop bin using the below command
bin$ ./mr-jobhistory-daemon.sh start historyserver
run pig using the below command
$pig
Config mapreduce.jobhistory.address in hadoop/etc/hadoop/mapred-site.xml,
then:
mapred --daemon start
The solution was the History server was not running:
[user#vm9 sbin]$ ./mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/user/hadoop-2.7.7/logs/mapred-user-historyserver-vm9.out
[user#vm9 sbin]$ jps
5683 NameNode
6309 NodeManager
5974 SecondaryNameNode
8075 RunJar
6204 ResourceManager
8509 JobHistoryServer
5821 DataNode
8542 Jps
[user#vm9 sbin]$
Now pig can run properly and it will connect to the job history server and the dump command is working fine.

Hadoop slave cannot connect to master, even when service is running and ports are open

I'm running hadoop 2.5.1 and I'm having a problem when slaves are connecting to master. My goal is to set-up a hadoop cluster. I hope someone can help, I'm been poundering with this too long already! :)
This is what comes up to the log file of slave:
2014-10-18 22:14:07,368 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.0.104:8020
This is my core-site.xml -file (same on master and slave):
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master/</value>
</property>
</configuration>
This is my hosts -file ((almost)same on master and slave).. I have hard coded addresses to there without any success:
127.0.0.1 localhost
192.168.0.104 xubuntu: xubuntu
192.168.0.104 master
192.168.0.194 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Netstats from master:
xubuntu#xubuntu:/usr/local/hadoop/logs$ netstat -atnp | grep 8020
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 192.168.0.104:8020 0.0.0.0:* LISTEN 26917/java
tcp 0 0 192.168.0.104:52114 192.168.0.104:8020 ESTABLISHED 27046/java
tcp 0 0 192.168.0.104:8020 192.168.0.104:52114 ESTABLISHED 26917/java
Nmap from master to master:
Starting Nmap 6.40 ( http://nmap.org ) at 2014-10-18 22:36 EEST
Nmap scan report for master (192.168.0.104)
Host is up (0.000072s latency).
rDNS record for 192.168.0.104: xubuntu:
PORT STATE SERVICE
8020/tcp open unknown
..and nmap from slave to master (even when the port is open, the slave doesn't connect to it..):
ubuntu#ubuntu:/usr/local/hadoop/logs$ nmap master -p 8020
Starting Nmap 6.40 ( http://nmap.org ) at 2014-10-18 22:35 EEST
Nmap scan report for master (192.168.0.104)
Host is up (0.14s latency).
PORT STATE SERVICE
8020/tcp open unknown
What is this all about? The problem is not about firewall.. I have also read every thread there is to to this without any success. I'm frustrated to this.. :(
At least one of your problems is that you are using old configuration name for the HDFS. For version 2.5.1 the configuration name should be fs.defaultFS instead of fs.default.name. I also suggest defining the port in the value, so the value would be hdfs://master:8020.
Sorry, I'm not linux guru, so I don't know about nmap, but does telnet'ing work from slave to master to the port?

Hbase Regions server is not able to communicate with HMaster

I am not able to setup the hbase in distributed mode. It works fine when i setup it on one machine(standalone mode). My Zookeeper, hmaster and region server starts properly.
But when i go to hbase shell and look for the status. It shows me 0 region server. I am attaching my logs of regions server. Plus the host files of my master(namenode) and slave(datanode). I have tried every P&C which are given on stackoverflow for changing the host file, but didn't work for me.
2013-06-24 15:03:45,844 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server naresh-pc/192.168.0.108:2181. Will not attempt to authenticate using SASL (unknown error)
2013-06-24 15:03:45,845 WARN org.apache.zookeeper.ClientCnxn: Session 0x13f75807d960001 for server null, unexpected error, closing socket connection and attempting to reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
Slave /etc/hosts :
127.0.0.1 localhost
127.0.1.1 ubuntu-pc
#ip for hadoop
192.168.0.108 master
192.168.0.126 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Master /etc/hosts :
127.0.0.1 localhost
127.0.1.1 naresh-pc
#ip for hadoop
192.168.0.108 master
192.168.0.126 slave
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
hbase-site.xml :
<configuration>
<property>
<name>hbase.master</name>
<value>master:60000</value>
<description>The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver
in a single process.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:54310/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed
Zookeeper true: fully-distributed with unmanaged Zookeeper
Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
<description>Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example,
"host1.mydomain.com,host2.mydomain.com".
By default this is set to localhost for local and
pseudo-distributed modes of operation. For a
fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If
HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop
ZooKeeper on.
</description>
</property>
</configuration>
Zookeeper log:
2013-06-28 18:22:26,781 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x13f8ac0b91b0002, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:722)
2013-06-28 18:22:26,858 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.0.108:57447 which had sessionid 0x13f8ac0b91b0002
2013-06-28 18:25:21,001 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x13f8ac0b91b0002, timeout of 180000ms exceeded
2013-06-28 18:25:21,002 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x13f8ac0b91b0002
Master Log:
2013-06-28 18:22:41,932 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1502022 ms
2013-06-28 18:22:43,457 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1503547 ms
Remove 127.0.1.1 from hosts file and turn of IPV6. That should fix the problem.
Your Regionserver is looking for HMaster at naresh-pc but you do not have any such entry in your /etc/hosts file. Please make sure your configuration is proper.
Can you try all of this:
Make sure your /conf/regionservers file has just one entry: slave
Not sure what HBase version you are using, but instead of using port 54310 for hbase.rootdir property in your hbase-site.xml use port 9000
Your /etc/hosts file, on BOTH master and slave should should only have these custom entries:
127.0.0.1 localhost
192.168.0.108 master
192.168.0.126 slave
I am concerned that your logs state Opening socket connection to server naresh-pc/192.168.0.108:2181
Clearly the system thinks that zookeeper is on host naresh-pc, but in your config you are setting zookeeper quorum at host master, which HBase will bind to. That's a problem right there. In my experience, HBase is EXTREMELY fussy about host names, so make sure they are all in synch in all your configs and in your /etc/hosts file.
Also, this may be a minor issue, but wouldn't hurt to specify the zookeper data directory in your .xml file to have a minimum set of settings that should make the cluster work: hbase.zookeeper.property.dataDir

Resources