How to run HDFS cluster without DNS - hadoop

I'm building a local HDFS dev environment (actually hadoop + mesos + zk + kafka) to ease development of Spark jobs and facilitate local integrated testing.
All other components are working fine but I'm having issues with HDFS. When the Data Node tries to connect to the name node, I get a DisallowedDataNodeException:
org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode
Most questions related to the same issue boil down to name resolution of the data node at the name node either static through the etc/hosts files or using dns. Static resolution is not an option with docker, as I don't know the data nodes when the name node container is created. I would like to avoid creating and maintaining an additional DNS service. Ideally, I would like to wire everything using the --link feature of docker.
Is there a way to configure HDFS in such a way that it only uses IP addresses to work?
I found this property and set to false, but it didn't do the trick:
dfs.namenode.datanode.registration.ip-hostname-check (default: true)
Is there a way to have a multi-node local HDFS cluster working only using IP addresses and without using DNS?

I would look at reconfiguring your Docker image to use a different hosts file [1]. In particular:
In the Dockerfile(s), do the switch-a-roo [1]
Bring up the master node
Bring up the data nodes, linked
Before starting the datanode, copy over the /etc/hosts to the new location, /tmp/hosts
Append the master node's name and master node ip to the new hosts file
Hope this works for you!
[1] https://github.com/dotcloud/docker/issues/2267#issuecomment-40364340

Related

Behavior of HDFS High Availability

How does distributed copy (distcp) work between two clusters when NameNode (NN) fails in High Availability (HA) configuration.
Will that job fail due to different IP address of name node and the standby node?
Depending on the configuration of your HDFS HA and if Automatic Failover is implemented, it might work (I personally haven't tested the specific command during a failover).
Another important part is that you are using names for the services and DNS is properly setup and configured for all involved nodes (you should never use direct IP addresses).
Yashwanth,
In an HA Hadoop cluster, it is not recommended to use active name node in the distcp commands. A simple answer to your question is Yes, if you hardcode Namenode IP or DNS in the distcp command. In an HA hadoop cluster you need to use cluster name in of IP in the distcp command.

Where can I find the include file in Hadoop-1.2.0 single node cluster on ubuntu?

I am trying set the include file on my Hadoop-1.2.0 single node cluster so that no other nodes can talk to the namenode, other that what are in the include file. Not sure how to go about this
I have set up a single node cluster and am currently new to Hadoop. I have an openssh-server installed and have set up the namenode directory and you can view my hdfs-site.xml here.
If you have done the setup manually then you have to set the ssh connections to the other hosts, manually. So, if you have not setup the ssh connections then you are good.
For a UI based installation you have if you have not specified the other hosts you want to add to the cluster then there are no other task-nodes/data-nodes that are connecting.
If you are trying to restricts users then you have to look at the ACLs and allow connections to only your subnet. For users unique user-id and password is good enough for the restriction.

Changing FQDN of nodes in hadoop cluster

I would like to change the DNS of the nodes added to my hadoop cluster.
FOr example, FQDN of a node in my cluster is hadoop1.dev.com and I would like to change it to hadoop1.abc.xyz
COuld someone suggest me the process to change it without effecting my cluster data.
Update your /etc/hosts file like below, then restart network service to take effect
x.x.x.x hadoop1.dev.com hadoop1.abc.xyz

Dynamic IPs for Hadoop cluster

I need to setup a multi-node Hadoop cluster. So far, I have done installations using static IP addresses for each of the cluster nodes. However, in my latest cluster, I need to work with DHCP assigned nodes. So I am wondering, how should I get the cluster working and survive restarts etc.
Is it mandatory to have static IP address for the cluster nodes or can we get it working with dynamic IPs as well?
Any expert guidance please.
For standalone and pseudo-distributed modes, you can get going on dynamic IP address for it runs on a single node.
For fully distributed mode, the nodes are identified with the file masters and slaves located in 'HADOOP_HOME/conf'. These names are hosts which have been described in '/etc/hosts'. So, when IP of any node changes, Hadoop cannot identify the machines or nodes or hosts (even if identified, these nodes have no Hadoop configured). Thus, you cannot achieve the fully distributed Hadoop setup.
Get your DHCP configured on a router if you can. Else install DHCP on all of the nodes. And get going!!

How can I force Hadoop to use my hostnames instead of IP-XX-XX-XX-XX

So, I'm configuring a 10 node cluster with Hadoop 2.5.2, and so far it's working, but the only issue I have is that when trying to communicate with the nodes, Hadoop is guessing their hostnames based on their IP, instead of using the ones I've configured.
Let me be more specific: this is happening when starting a job, but when I start up yarn (for instance), the slave nodes names, are used correctly. The scheme that Hadoop uses to auto-generate the names of the nodes is IP-XX-XX-XX-XX, so for a node with IP 179.30.0.1 it's name would be IP-179-30-0-1.
This is forcing me to edit every /etc/hosts file on each node, so that their 127.0.0.1 ip is named like Hadoop says.
Can I make Hadoop use the names I have those hosts? Or am I force to do this extra configuration step normally?

Resources