DataNode can't talk to NameNode in Hadoop 2.2 - hadoop

I'm setting up a hadoop 2.2 cluster. I have successfully configured a master and a slave. When I enter start-dfs.sh and start-yarn.sh on the master, all the daemons start correctly.
To be specific, on the master the following are running:
DataNode
NodeManager
NameNode
ResourceManager
SecondaryNameNode
On the slave, the following are running:
DataNode
NodeManager
When I open http://master-host:50070 I see that there is only 1 "Live Node" and it is referring to the datanode on the master.
The datanode on the slave is started, but not being able to tell the master that it started. This is the only error I can find:
From /logs/hadoop-hduser-datanode.log on the slave:
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: ec2-xx-xxx-xx-xx.compute-1.amazonaws.com/xx.xxx.xx.xxx:9001
Things I have checked/verified:
9001 is open
both nodes can ssh into each other
both nodes can ping each other
Any suggestions are greatly appreciated.

My issue was in the hosts file:
The hosts file on the slave and master needed to be (they're identical_:
127.0.0.1 localhost
<master internal ip> master
<slave internal ip> slave
For AWS you need to use the internal ip that is something like xx.xxx.xxx.xxx (not the external ip in the ec2-xx-xx-xxx.aws.com and not the ip-xx-xx-xxx).
Also, core-site.xml should refer to the location of hdfs as http://master:9000.

Related

How does master node start all the process in a hadoop cluster?

I have set up a Hadoop cluster of 5 virtual machines , using plain vanilla Hadoop. The cluster details are below:
192.168.1.100 - Configured to Run NameNode and SNN daemons
192.168.1.101 - Configured to Run ResourceManager daemon.
192.168.1.102 - Configured to Run DataNode and NodeManager daemons.
192.168.1.103 - Configured to Run DataNode and NodeManager daemons.
192.168.1.104 - Configured to Run DataNode and NodeManager daemons.
I have kept masters and slaves files in each virtual servers.
masters:
192.168.1.100
192.168.1.101
slaves file:
192.168.1.102
192.168.1.103
192.168.1.104
Now when I run start-all.sh command from NameNode machine, how is it able to start all the daemons? I am not able to understand it? There are no adapters installed (or I am not aware of), there are simple hadoop jars present in all the machines so how is NameNode machine able to start all the daemons in all the machines (virtual servers).
Can anyone help me understand this?
The namenode connects to the slaves via SSH and runs the slave services.
That is why you need public ssh-keys in ~/.ssh/authorized_keys on the slaves, to have their private counterparts be present for the user running the Hadoop namenode.

Hadoop Multinode Cluster, slave permission denied

I'm trying to do multinode cluster (actually with 2 nodes - 1 master and 1 slave) on Hadoop. I follow the instruction Multinode Cluster for Hadoop 2.x
When I execute the order:
./sbin/start-all.sh
I got the error message for my slave node:
slave: Permission denied (publickey)
I already modified both .ssh/authorized_keys files on master and slave and add the keyprint from .ssh/id_rsa.pub from master and slave.
Finally I restarted the ssh with the next command sudo service ssh restart also on the both nodes (master and slaves).
By the executing of the order ./sbin/start-all.sh I don't have a problem with the master node, but slave node get me back the error message permission denied.
Has anybody some ideas, why I can not see the slave node?
The execution of the jps order get me currently next result:
master
18339 Jps
17717 SecondaryNameNode
18022 NodeManager
17370 NameNode
17886 ResourceManager
slave
2317 Jps
I think, master is ok, but I have troubles with slave.
After ssh-keygen on the Master. Copy the id_rsa.pub to the authorized_keys using cat id_rsa.pub >> authorized_keys on all the slaves. Test the password-less ssh using:
ssh <slave_node_IP>
if you have copied the whole hadoop folder from master to slave nodes(for easy replication), make sure that the slave node's hadoop folder has the correct owner from the slave system.
chown * 777 <slave's username> </path/to/hadoop>
I ran this command on my slave system and it solved my problem.

Hadoop: Multinode cluster only recognizes 2 live nodes out of 3 data nodes

I have setup mutlinode hadoop with 3 datanodes and 1 namenode using virtualbox on Ubuntu. My host system serves as NameNode (also datanode) and two VMs serve as DataNodes. My systems are:
192.168.1.5: NameNode (also datanode)
192.168.1.10: DataNode2
192.168.1.11: DataNode3
I am able to SSH all systems from each system. My hadoop/etc/hadoop/slaves on all systems have entry as:
192.168.1.5
192.168.1.10
192.168.1.11
hadoop/etc/hadoop/master on all systems have entry as: 192.168.1.5
All core-site.xml, yarn-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh are same on machines except of missing entry for dfs.namenode.name.dir in hdfs-site.xml in both DataNodes.
When I execute start-yarn.sh and start-dfs.sh from NameNode, all work fine and through JPS I am able to see all required services on all machines.
Jps on NameNode:
5840 NameNode
5996 DataNode
7065 Jps
6564 NodeManager
6189 SecondaryNameNode
6354 ResourceManager
Jps on DataNodes:
3070 DataNode
3213 NodeManager
3349 Jps
However when I want to check from namenode/dfshealth.html#tab-datanode and namenode:50070/dfshealth.html#tab-overview, both indicates only 2 datanodes.
tab-datanode shows NameNode and DataNode2 as active datanodes. DataNode3 is not displayed at all.
I checked all configuration files (mentioned xml, sh and slves/master) multiple times to make sure nothing is different on both datanodes.
Also etc/hosts file also contains all node's entry in all systems:
127.0.0.1 localhost
#127.0.1.1 smishra-VM2
192.168.1.11 DataNode3
192.168.1.10 DataNode2
192.168.1.5 NameNode
One thing I'll like mention is that I configured 1 VM 1st then I made clone of that. So both VMs have same configuration. So its more confusing why 1 datanode is shown but not the other one.
Take a look at http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/
I'll bet that your problems come from the network configuration on your Virtual Box VMs. The post above has a lot of detail around how to ensure that the internal network between the VMs is set up correctly, with forward and reverse name resolution working, no duplicate MAC addresses, etc, which is critical for a Hadoop cluster to work correctly.

Hadoop two node cluster environment, NameNode’s web UI displays the number of live nodes as one and Dead nodes as zero

I configured properly two node cluster environment for Hadoop, and Master is also configured for datanode as well.
So currently I have two data nodes, without any issue I am able to start all the services in Master.
Slave datanode is also able to stop start from Master Node.
But when I am checking the health by using the url http://<IP>:50070/dfshealth.jsp Live node count is always showing only one not two.
Master Process:
~/hadoop-1.2.0$ jps
9112 TaskTracker
8805 SecondaryNameNode
9182 Jps
8579 DataNode
8887 JobTracker
8358 NameNode
Slave Process:
~/hadoop-1.2.0$ jps
18130 DataNode
18380 Jps
18319 TaskTracker
Please help me to know what I am doing wrong.
The second DataNode is running but not connecting to the NameNode. Chances are you re-formatted the NameNode and now have different version numbers in the NameNode and DataNode.
A fix is to manually delete the directory where the DataNode keeps its data (dfs.datanode.data.dir) and then reformat the NameNode. A less extreme one is to manually edit the version but for study purposes you can just axe the whole directory.
Finally I got the solution,
After #charles input I checked the Datanode logs and got the below error.
org.apache.hadoop.ipc.Client: Retrying connect to server: masternode/192.168.157.132:8020. Already tried 9 time(s);
I was able to do the ssh but there was issue with telnet from datanode to master for 8020 port.
>telnet 192.168.157.132 8020
Trying 192.168.157.132...
telnet: connect to address 192.168.157.132: No route to host
I just added in iptables to allow the port 8020 by using below command and restarted the hadoop services and everything worked fine.
iptables -I INPUT 5 -p tcp --dport 8020 -j ACCEPT
It was the issue with firewall only.
Thanks to all for valuable inputs.

TaskTracker on hadoop slave cluster won't start. Can't connect to master

I set up a 2 node hadoop cluster on aws with the namenode and the jobtracker running on the master, and the tasktracker and datanode being both the master and slave. When I start the dfs, It tells me that it starts the namenode, the datanode on both nodes, and the secondary namenode. When I start map reduce it also tells me that the jobtracker was started, as well as the tasktracker on both nodes. I started to run an example to make sure it was working, but it said that only one tasktracker was being used, on the namenode web interface. I checked the logs and bot the datanode and tasktracker node logs on the slave had something along the lines of
2013-08-08 21:31:04,196 INFO org.apache.hadoop.ipc.RPC: Server at ip-10-xxx-xxx-xxx/10.xxx.xxx.xxx:9000 not available yet, Zzzzz...
2013-08-08 21:31:06,202 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-xxx-xxx-xxx/10.xxx.xxx.xxx:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
The namenode is running on port 9000, This was in the datanode log. In the tasktracker log, it had the same thing except it was port 9001; where the jobtracker was running. I was able to find something on the apache wiki about this error http://wiki.apache.org/hadoop/ServerNotAvailable
but I couldn't find any of the possible problems they stated. Since I'm running both nodes on aws I also made sure that permissions were granted to both ports.
In summary.
The tasktracker and datanode on the slave node won't connect to the master
I know the ip addresses are right, i've checked multiple times
I can passphraseless ssh from both instances into each other and into themselves
The ports are granted permission on aws
based on the logs, both the namenode and the jobtracker are running fine
I put the the ips of the master and slave in the config files, rather than a hostname because when i did that and edited the /etc/hosts accordingly, it couldn't resolve it
Does anybody know of any other possible reasons?
Per the original poster:
OK, so apparently, it's because the namenode is listening at 127.0.0.1:9000, rather than ip-10.x.x.IpOfMaster:9000. See Hadoop Datanodes cannot find NameNode.
I just replaced localhost:9000 in the config files with ip-10.x.x.x:9000 and it worked.

Resources