Configuring a slave's hostname using internal IP - Multiple NICs - hadoop

In my Hadoop environment, I need to configure my slave nodes so that when they communicate in the middle of a map/reduce job they use the internal IP instead of the external IP that it picks up from the hostname.
Is there any way to set up my Hadoop config files to specify that the nodes should communicate using the internal IPs instead of the external IPs? I've already used the internal IPs in my core-site.xml, master, and slave files.
I've done some research and I've seen people mention the "slave.host.name" parameter, but which config file would I place this parameter in? Are there any other solutions to this problem?
Thanks!

The IP routing tables have to be changed so that the network between the Hadoop nodes uses a particular gateway. Don't think Hadoop has any setting to change which gateway to use.

You can configured slave.host.name in mapred-site.xml for each slave node.
Also remember to use that host name (instead of IP) consistently for all other configurations (core-site.xml, hdfs-site.xml, mapred-site.xml, masters, slaves) and also /etc/hosts file.

Related

How to configure HDFS to listen to 0.0.0.0

I have an hdfs cluster listening on 192.168.50.1:9000 which means it only accepts connections via that IP. I would like it to listen at 0.0.0.0:9000. When I enter 127.0.0.1 localhost master in the /etc/hosts, then it starts at 127.0.0.1:9000, which prevents all nodes to connect.
This question is similar to this one How to make Hadoop servers listening on all IPs, but for hdfs, not yarn.
Is there an equivalent setting for core-site.xml like yarn.resourcemanager.bind-host or any other way to configure this? If not, then what's the reasoning behind this? Is it a security feature?
For the NameNode you need to set these to 0.0.0.0 in your hdfs-site.xml:
dfs.namenode.rpc-bind-host
dfs.namenode.servicerpc-bind-host
dfs.namenode.lifeline.rpc-bind-host
dfs.namenode.http-bind-host
dfs.namenode.https-bind-host
The DataNodes use 0.0.0.0 by default.
If you ever need to find a config variable for HDFS, refer to hdfs-default.xml.
Also very useful, if you look at any of the official Hadoop docs, at the bottom left corner of the page are all the default values for the various XML files.
So you can go to Apache Hadoop 2.8.0 or your specific version and find the settings you're looking for.
Well the question is quite old already, however, usually you do not need to configure the bind address 0.0.0.0 because it is the default value! You'd rather have an entry in the /etc/hosts file 127.0.0.1 hostname which hadoop resolves to 127.0.0.1.
Consequently you need to remove that entry and hadoop will bind to all interfaces (0.0.0.0) without any additional config entries in the config files.

How to configure ports for hostname and localhost?

I am running a browser on the single node Hortonworks Hadoop cluster(HDP 2.3.4) on Centos 6.7:
With localhost:8000 and <hostname>:8000, I can access Hue. Same works for Ambari at 8080
However, several other ports, I only can access with the hostname. So with e.g. <hostname>:50070, I can access the namenode service. If I use localhost:50070, I cannot setup a connection. So I assume localhost is blocked, the namenode not.
How can I set up that localhost and <hostname> have the same port configuration?
This likely indicates that the NameNode HTTP server socket is bound to a single network interface, but not the loopback interface. The NameNode HTTP server address is controlled by configuration property dfs.namenode.http-address in hdfs-site.xml. Typically this specifies a host name or IP address, and this maps to a single network interface. You can tell it to bind to all network interfaces by setting property dfs.namenode.http-bind-host to 0.0.0.0 (the wildcard address, matching all network interfaces). The NameNode must be restarted for this change to take effect.
There are similar properties for other Hadoop daemons. For example, YARN has a property named yarn.resourcemanager.bind-host for controlling how the ResourceManager binds to a network interface for its RPC server.
More details are in the Apache Hadoop documentation for hdfs-default.xml and yarn-default.xml. There is also full coverage of multi-homed deployments in HDFS Support for Multihomed Networks.

Hadoop Cluster distributed in different sub-networks (Docker + Flannel)

I want to have Hadoop 2.3.0 in a multi bare-metal cluster using Docker. I have a master container and a slave container (in this first setup). When Master and Slave containers are in the same host (and therefore, same Flannel subnet), Hadoop works perfectly. However, if the Master and Slave are in different bare metal nodes (hence, different flannel subnets), it simply does not work (I get a connection refused error). Both containers can ping and ssh one another, so there is no connectivity problem. For some reason, it seems that hadoop needs all the nodes in the cluster to be in the same subnet. Is there a way to circumvent this?
Thanks
I think having the nodes in separate flannel subnets introduces some NAT-related rules which cause such issues.
See the below link which seems to have addressed a similar issue
Re: Networking Problem in creating HDFS cluster.
Hadoop uses a bunch of other ports for communication between the nodes, the above assumes these ports are unblocked.
ssh and ping are not enough. If you have iptables or any other firewalls, either you need to disable or open up the ports. You can set up the cluster, as long as hosts can communicate with each other and ports are open. Run telnet <namenode> <port> to ensure hosts are communicating on desired ports.

How can I connect apache Nutch 2.x to a remote HBase cluster?

I have two machines. One machine runs HBase 0.92.2 in pseudo-distributed mode, while the other one is using Nutch 2.x crawler. How can I configure these two machines so that one machine with HBase-0.92.2 acts as back end storage and the other with Nutch-2.x acts as a crawler?
I finally did it.I was easy to do.
i am sharing my experience here. May be it can help someone.
1- change the configuration file of hbase-site.xml for pseudo distributed mode.
2- MOST IMPORTANT THING: on hbase machine, replace localhost ip in /etc/hosts with your real network ip like this
10.11.22.189 master localhost
hbase machine's ip = 10.11.22.189
(note: if you won't change your hbase machine's localhost ip, remote nutch crawler won't be able to connect to it)
4- copy/symlink hbase-site.xml into $NUTCH_HOME/conf
5- start your crawler and see it working

Hadoop Dedoop Application unable to contact Hadoop Namenode : Getting "Unable to contact Namenode" error

I'm trying to use the Dedoop application that runs using Hadoop and HDFS on Amazon EC2. The Hadoop cluster is set up and the Namenode JobTracker and all other Daemons are running without error.
But the war Dedoop.war application is not able to connect to the Hadoop Namenode after deploying it on tomcat.
I have also checked to see if the ports are open in EC2.
Any help is appreciated.
If you're using Amazon AWS, I highly recommend using Amazon Elastic Map Reduce. Amazon takes care of setting up and provisioning the Hadoop cluster for you, including things like setting up IP addresses, NameNode, etc.
If you're setting up your own cluster on EC2, you have to be careful with public/private IP addresses. Most likely, you are pointing to the external IP addresses - can you replace them with the internal IP addresses and see if that works?
Can you post some lines of the Stacktrace from Tomcat's log files?
Dedoop must etablish an SOCKS proxy server (similar to ssh -D port username#host) to pass connections to Hadoop nodes on EC2. This is mainly because Hadoop resolves puplic IPS to EC2-internal IPs which breaks MR Jobs submission and HDFS management.
To this end Tomcat must be configured to to etablish ssh connections. The setup procedure is described here.

Resources