troubleshooting hadoop 2.7.1

troubleshooting hadoop 2.7.1 - hadoop

I am trying to setup a 3-workers 1 master hadoop cluster using 2.7.1. When I start the cluster, the master has the following daemons running:
2792 NameNode
3611 NodeManager
4362 Jps
3346 ResourceManager
2962 DataNode
3169 SecondaryNameNode
And in the three worker nodes,
2163 NodeManager
2030 DataNode
2303 Jps
Problem is when I look at the web UI, the cluster does not recognize the 3 workers. It says 1 live data node and that is the master itself. Please have a look here:
http://master:50070/dfshealth.html#tab-overview
Question is, what are the daemon processes that are suppose to be running on workers node? I tried to look at the log files and didnt find anything helpful because it contains log only related to running daemons and the log files does not have any errors or Fatal errors.
I thought secondary namenode should be running in workers and ports are not letting it to communicate. So I tried to open up Port 9000 and 9001 in master by
sudo iptables -I INPUT -p tcp --dport 9000 -j ACCEPT
sudo iptables -I INPUT -p tcp --dport 9001 -j ACCEPT
iptables-save
but this didnt help much. Still facing the same problem. Log files in workers are not helpful either.
Appreicate your help in fixing this.
Edit 1:
The following is my configuration at core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9001</value> <!-- slave1, 2 & 3 in position of master -->
</property>
</configuration>
This is my /etc/hosts file:
127.0.0.1 localhost math2
127.0.1.1 math2
192.168.1.2 master
192.168.1.3 worker1
192.168.1.7 worker5
192.168.1.8 worker6
This is my configuration at /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8)
auto lo
iface lo inet loopback
address 192.168.1.2 (3,5,6 instead of 2 for slaves)
netmask 255.255.255.0
gateway 192.168.1.1
broadcast 192.168.1.255
Here is the log output for one of the datanodes:
2016-02-05 17:54:12,655 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException

Did you put the ip-address of all the nodes in /etc/hosts. The /etc/hosts file of all nodes(master and slaves) should contain the ip-address of all the nodes in your cluster.
For example if we have three data nodes and a master node the /etc/hosts file should be:
192.168.0.1 master
192.168.0.2 datanode1
192.168.0.3 datanode2
192.168.0.4 datanode33

On Datanode you can see below process when you run jps command
19728 DataNode
19819 Jps
and when you run ps -aef |grep -i datanode on Datanode, it should show the two process one with root user and another with HDFS user

Related

ResourceManager does not start

I installed Hadoop (HDP 2.5.3) on 4 VMs with Ambari (1 Ambari Server and 3 Ambari Clients; with the DNS entries server, node0, node1, node2) with HDFS, YARN, MapReduce and Zookeeper.
However, YARN doesn't want to start. When starting the Resource Manager on node1 I get the following error:
resource_management.core.exceptions.ExecutionFailed: Execution of 'curl -sS -L -w '%{http_code}' -X GET 'http://node0:50070/webhdfs/v1/ats/done/?op=GETFILESTATUS&user.name=hdfs' 1>/tmp/tmpgsiRLj 2>/tmp/tmpMENUFa' returned 7. curl: (7) Failed to connect to node0 port 50070: connection refused 000
App Timeline Server and History Server on node1 don't want to start either. Zookeeper, NameNode, DataNode and Nodemanager on Node0 is up. The nodes can reach each other (tried with ping) so that shouldn't be the problem.
Hopefully one can help me. I'm really new to this topic and not really familiar with the system.

You should check the host file (/etc/hosts), see the host name and FNDN, check if there any duplicates name, IP address.
Could you also confirm the firewall activity by steps:
sudo ufw status
And also check the port in iptables (or allow port in firewall: udp, tcp).

Spark listens on localhost

I installed spark on a cluster of machines w/o public DNS (just created machines on a cloud).
Hadoop looks to be installed and worked correctly, but Sparks listens on 7077 and 6066 as 127.0.0.1 instead of public ip so worker nodes can't connect to it.
What is wrong?
My /etc/hosts on the master node looks like:
127.0.1.1 namenode namenode
127.0.0.1 localhost
XX.XX.XX.XX namenode-public
YY.YY.YY.YY hadoop-2
ZZ.ZZ.ZZ.ZZ hadoop-1
My $SPARK_HOME/conf/spark-env.sh looks like:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SPARK_PUBLIC_DNS=namenode-public
export SPARK_WORKER_CORES=6
export SPARK_LOCAL_IP=XX.XX.XX.XX
sudo netstat -pan|grep 7077
tcp 0 0 127.0.1.1:7077 0.0.0.0:* LISTEN 6670/java

You should specify SPARK_MASTER_HOST in spark-env.sh (it must be the address of your machine that is visible to the slave nodes). Moreover, you may need to add rules for ports 7077 and 6066 in iptables.

Hadoop: Multinode cluster only recognizes 2 live nodes out of 3 data nodes

I have setup mutlinode hadoop with 3 datanodes and 1 namenode using virtualbox on Ubuntu. My host system serves as NameNode (also datanode) and two VMs serve as DataNodes. My systems are:
192.168.1.5: NameNode (also datanode)
192.168.1.10: DataNode2
192.168.1.11: DataNode3
I am able to SSH all systems from each system. My hadoop/etc/hadoop/slaves on all systems have entry as:
192.168.1.5
192.168.1.10
192.168.1.11
hadoop/etc/hadoop/master on all systems have entry as: 192.168.1.5
All core-site.xml, yarn-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh are same on machines except of missing entry for dfs.namenode.name.dir in hdfs-site.xml in both DataNodes.
When I execute start-yarn.sh and start-dfs.sh from NameNode, all work fine and through JPS I am able to see all required services on all machines.
Jps on NameNode:
5840 NameNode
5996 DataNode
7065 Jps
6564 NodeManager
6189 SecondaryNameNode
6354 ResourceManager
Jps on DataNodes:
3070 DataNode
3213 NodeManager
3349 Jps
However when I want to check from namenode/dfshealth.html#tab-datanode and namenode:50070/dfshealth.html#tab-overview, both indicates only 2 datanodes.
tab-datanode shows NameNode and DataNode2 as active datanodes. DataNode3 is not displayed at all.
I checked all configuration files (mentioned xml, sh and slves/master) multiple times to make sure nothing is different on both datanodes.
Also etc/hosts file also contains all node's entry in all systems:
127.0.0.1 localhost
#127.0.1.1 smishra-VM2
192.168.1.11 DataNode3
192.168.1.10 DataNode2
192.168.1.5 NameNode
One thing I'll like mention is that I configured 1 VM 1st then I made clone of that. So both VMs have same configuration. So its more confusing why 1 datanode is shown but not the other one.

Take a look at http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/
I'll bet that your problems come from the network configuration on your Virtual Box VMs. The post above has a lot of detail around how to ensure that the internal network between the VMs is set up correctly, with forward and reverse name resolution working, no duplicate MAC addresses, etc, which is critical for a Hadoop cluster to work correctly.

Hadoop two node cluster environment, NameNode’s web UI displays the number of live nodes as one and Dead nodes as zero

I configured properly two node cluster environment for Hadoop, and Master is also configured for datanode as well.
So currently I have two data nodes, without any issue I am able to start all the services in Master.
Slave datanode is also able to stop start from Master Node.
But when I am checking the health by using the url http://<IP>:50070/dfshealth.jsp Live node count is always showing only one not two.
Master Process:
~/hadoop-1.2.0$ jps
9112 TaskTracker
8805 SecondaryNameNode
9182 Jps
8579 DataNode
8887 JobTracker
8358 NameNode
Slave Process:
~/hadoop-1.2.0$ jps
18130 DataNode
18380 Jps
18319 TaskTracker
Please help me to know what I am doing wrong.

The second DataNode is running but not connecting to the NameNode. Chances are you re-formatted the NameNode and now have different version numbers in the NameNode and DataNode.
A fix is to manually delete the directory where the DataNode keeps its data (dfs.datanode.data.dir) and then reformat the NameNode. A less extreme one is to manually edit the version but for study purposes you can just axe the whole directory.

Finally I got the solution,
After #charles input I checked the Datanode logs and got the below error.
org.apache.hadoop.ipc.Client: Retrying connect to server: masternode/192.168.157.132:8020. Already tried 9 time(s);
I was able to do the ssh but there was issue with telnet from datanode to master for 8020 port.
>telnet 192.168.157.132 8020
Trying 192.168.157.132...
telnet: connect to address 192.168.157.132: No route to host
I just added in iptables to allow the port 8020 by using below command and restarted the hadoop services and everything worked fine.
iptables -I INPUT 5 -p tcp --dport 8020 -j ACCEPT
It was the issue with firewall only.
Thanks to all for valuable inputs.

DataNode can't talk to NameNode in Hadoop 2.2

I'm setting up a hadoop 2.2 cluster. I have successfully configured a master and a slave. When I enter start-dfs.sh and start-yarn.sh on the master, all the daemons start correctly.
To be specific, on the master the following are running:
DataNode
NodeManager
NameNode
ResourceManager
SecondaryNameNode
On the slave, the following are running:
DataNode
NodeManager
When I open http://master-host:50070 I see that there is only 1 "Live Node" and it is referring to the datanode on the master.
The datanode on the slave is started, but not being able to tell the master that it started. This is the only error I can find:
From /logs/hadoop-hduser-datanode.log on the slave:
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: ec2-xx-xxx-xx-xx.compute-1.amazonaws.com/xx.xxx.xx.xxx:9001
Things I have checked/verified:
9001 is open
both nodes can ssh into each other
both nodes can ping each other
Any suggestions are greatly appreciated.

My issue was in the hosts file:
The hosts file on the slave and master needed to be (they're identical_:
127.0.0.1 localhost
<master internal ip> master
<slave internal ip> slave
For AWS you need to use the internal ip that is something like xx.xxx.xxx.xxx (not the external ip in the ec2-xx-xx-xxx.aws.com and not the ip-xx-xx-xxx).
Also, core-site.xml should refer to the location of hdfs as http://master:9000.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio