I need to setup a multi-node Hadoop cluster. So far, I have done installations using static IP addresses for each of the cluster nodes. However, in my latest cluster, I need to work with DHCP assigned nodes. So I am wondering, how should I get the cluster working and survive restarts etc.
Is it mandatory to have static IP address for the cluster nodes or can we get it working with dynamic IPs as well?
Any expert guidance please.
For standalone and pseudo-distributed modes, you can get going on dynamic IP address for it runs on a single node.
For fully distributed mode, the nodes are identified with the file masters and slaves located in 'HADOOP_HOME/conf'. These names are hosts which have been described in '/etc/hosts'. So, when IP of any node changes, Hadoop cannot identify the machines or nodes or hosts (even if identified, these nodes have no Hadoop configured). Thus, you cannot achieve the fully distributed Hadoop setup.
Get your DHCP configured on a router if you can. Else install DHCP on all of the nodes. And get going!!
Related
How does distributed copy (distcp) work between two clusters when NameNode (NN) fails in High Availability (HA) configuration.
Will that job fail due to different IP address of name node and the standby node?
Depending on the configuration of your HDFS HA and if Automatic Failover is implemented, it might work (I personally haven't tested the specific command during a failover).
Another important part is that you are using names for the services and DNS is properly setup and configured for all involved nodes (you should never use direct IP addresses).
Yashwanth,
In an HA Hadoop cluster, it is not recommended to use active name node in the distcp commands. A simple answer to your question is Yes, if you hardcode Namenode IP or DNS in the distcp command. In an HA hadoop cluster you need to use cluster name in of IP in the distcp command.
I am an experienced person in Java and wanted to get my hands dirty with Hadoop. I have gone through the basics and now preparing for the practical things.
I have started with the tutorials given at https://developer.yahoo.com/hadoop/tutorial/ to setup and running hadoop on virtual machine.
So, to create a cluster I need multiple virtual machine running in parallel. right? And needs to add ip address of all in hadoop-site.xml. Or can I do it with single virtual machine?
No you cannot create a cluster with single VM. Cluster meaning is group of machines.
If you have a good configuration of Host machine, on top of that you can run 'n' number of guest OS. By doing like this only you can create Hadoop cluster (1 NN, 1 SNN, 1 DN)
If you want, you can install Pseudo mode (all services run in one machine) Hadoop, which runs like a testing machine
You can setup a multinode cluster using any virtual box like Oracle VM. Create 5 nodes(1-NN,1-SNN,3-DN). Assign each node its IP address and set up all the configuration files on all the nodes. There are 2 files - (Masters and slave). In the NN node give the IP address of SNN in Master file and all the 3 DN's IP address in the slave files. Also set up the ssh connectivity between all the nodes using the public keys.
So, I'm configuring a 10 node cluster with Hadoop 2.5.2, and so far it's working, but the only issue I have is that when trying to communicate with the nodes, Hadoop is guessing their hostnames based on their IP, instead of using the ones I've configured.
Let me be more specific: this is happening when starting a job, but when I start up yarn (for instance), the slave nodes names, are used correctly. The scheme that Hadoop uses to auto-generate the names of the nodes is IP-XX-XX-XX-XX, so for a node with IP 179.30.0.1 it's name would be IP-179-30-0-1.
This is forcing me to edit every /etc/hosts file on each node, so that their 127.0.0.1 ip is named like Hadoop says.
Can I make Hadoop use the names I have those hosts? Or am I force to do this extra configuration step normally?
I'm building a local HDFS dev environment (actually hadoop + mesos + zk + kafka) to ease development of Spark jobs and facilitate local integrated testing.
All other components are working fine but I'm having issues with HDFS. When the Data Node tries to connect to the name node, I get a DisallowedDataNodeException:
org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode
Most questions related to the same issue boil down to name resolution of the data node at the name node either static through the etc/hosts files or using dns. Static resolution is not an option with docker, as I don't know the data nodes when the name node container is created. I would like to avoid creating and maintaining an additional DNS service. Ideally, I would like to wire everything using the --link feature of docker.
Is there a way to configure HDFS in such a way that it only uses IP addresses to work?
I found this property and set to false, but it didn't do the trick:
dfs.namenode.datanode.registration.ip-hostname-check (default: true)
Is there a way to have a multi-node local HDFS cluster working only using IP addresses and without using DNS?
I would look at reconfiguring your Docker image to use a different hosts file [1]. In particular:
In the Dockerfile(s), do the switch-a-roo [1]
Bring up the master node
Bring up the data nodes, linked
Before starting the datanode, copy over the /etc/hosts to the new location, /tmp/hosts
Append the master node's name and master node ip to the new hosts file
Hope this works for you!
[1] https://github.com/dotcloud/docker/issues/2267#issuecomment-40364340
I want to install hadoop on a round-robin DNS environment. I have a bunch of machines sharing a common user environment and a common name. These machines are equal. The round-robin DNS runs on a branch of machines. Each machine has its own IP address and host name. It is our school's machines. But these machines share a common name. When I login, my terminal shows which machine I am on.
The problem is that I make change on one machine, the changes applies to all other machines.
I follow the instruction of michael-noll's multi-node hadoop. I need to configure master node.
But what I did to master node applies to slave nodes. That said, I cannot differentiate master and slave nodes.
So, can I install hadoop in such environment?
I'm not quite sure why you would want to install Hadoop using a round-robin DNS, but no, you cannot do this with Hadoop. Every single node needs to have a unique host name.