access hdfs outside docker swarm - hadoop

I have created a docker swarm configuration which consists of a namenode, datanode, resource manager, and yarn workers. These all work well together and I can run hdfs dfs commands from any container in the swarm. I have also exposed the port 9000 using the ports section of the yaml. In my core-site.xml I use the hostname of the namenode from the swarm configuration.
I am unable to get a client outside of the swarm to access the cluster using the hdfs dfs commands. I have a different core-site.xml which has the address of the host machine for the swarm. When I run commands I get a java.io.EOFException.
Is there any way to get an external client to connect to the hadoop cluster running in docker swarm?

Turns out this is resolved by following the multihomed network instructions. It seems that it appears to the namenode as though it has two network interfaces to work with. One for the overlay network across the swarm and another from exposing the port.

Related

Hadoop cluster with docker swarm

I'm trying to setup a hadoop cluster inside a docker swarm with multiple hosts, with a datanode on each docker node with a mounted volume.I made some tests and works fine, but the problem comes when a datanode dies and then return.
I restarted 2 host at the same time and when the containers run again, they get a new ip. The problem is that the namemode give a error because it thinks it is another datanode.
ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node 10.0.0.13:50010 is attempting to report storage ID 3a7b556f-7364-460e-beac-173132d77503. Node 10.0.0.9:50010 is expected to serve this storage.
Is is possible to prevent docker to assign a new ip, and instead keep the last ip after a restart?
Or there are any option for Hadoop config to fix this?
Static DHCP addresses for containers accessing an overlay network are officially not supported for the time being, as told here: https://github.com/moby/moby/issues/31860.
I hope, that docker will provide a solution for this very soon.

How can I setup Docker container with individual IP address on EC2 instance?

I'm currently try to setup a hadoop cluster by Docker on EC2. Namely, I have several EC2 instances, for each instance, there is a docker container which is running hadoop program. But the connection between containers is tricky in Docker, so I want assign individual IP address for every container, How can I do it?
What can I do if I want to assign an individual IP for Docker container on EC2?
When you run a docker container, you can specify the network. --net="host" means that the container shares network stack of the host computer.
You can assign multiple IPs to your EC2 instance and then run hadoop containers such as
docker run --net="host" hadoop
This will give the container access to all network interfaces and you can decide in the container which one to bind to.
However please note the following warning:
--net="host" gives the container full access to local system services such as D-bus and is therefore considered insecure.

How to configure ports for hostname and localhost?

I am running a browser on the single node Hortonworks Hadoop cluster(HDP 2.3.4) on Centos 6.7:
With localhost:8000 and <hostname>:8000, I can access Hue. Same works for Ambari at 8080
However, several other ports, I only can access with the hostname. So with e.g. <hostname>:50070, I can access the namenode service. If I use localhost:50070, I cannot setup a connection. So I assume localhost is blocked, the namenode not.
How can I set up that localhost and <hostname> have the same port configuration?
This likely indicates that the NameNode HTTP server socket is bound to a single network interface, but not the loopback interface. The NameNode HTTP server address is controlled by configuration property dfs.namenode.http-address in hdfs-site.xml. Typically this specifies a host name or IP address, and this maps to a single network interface. You can tell it to bind to all network interfaces by setting property dfs.namenode.http-bind-host to 0.0.0.0 (the wildcard address, matching all network interfaces). The NameNode must be restarted for this change to take effect.
There are similar properties for other Hadoop daemons. For example, YARN has a property named yarn.resourcemanager.bind-host for controlling how the ResourceManager binds to a network interface for its RPC server.
More details are in the Apache Hadoop documentation for hdfs-default.xml and yarn-default.xml. There is also full coverage of multi-homed deployments in HDFS Support for Multihomed Networks.

Hadoop Cluster distributed in different sub-networks (Docker + Flannel)

I want to have Hadoop 2.3.0 in a multi bare-metal cluster using Docker. I have a master container and a slave container (in this first setup). When Master and Slave containers are in the same host (and therefore, same Flannel subnet), Hadoop works perfectly. However, if the Master and Slave are in different bare metal nodes (hence, different flannel subnets), it simply does not work (I get a connection refused error). Both containers can ping and ssh one another, so there is no connectivity problem. For some reason, it seems that hadoop needs all the nodes in the cluster to be in the same subnet. Is there a way to circumvent this?
Thanks
I think having the nodes in separate flannel subnets introduces some NAT-related rules which cause such issues.
See the below link which seems to have addressed a similar issue
Re: Networking Problem in creating HDFS cluster.
Hadoop uses a bunch of other ports for communication between the nodes, the above assumes these ports are unblocked.
ssh and ping are not enough. If you have iptables or any other firewalls, either you need to disable or open up the ports. You can set up the cluster, as long as hosts can communicate with each other and ports are open. Run telnet <namenode> <port> to ensure hosts are communicating on desired ports.

Hadoop Dedoop Application unable to contact Hadoop Namenode : Getting "Unable to contact Namenode" error

I'm trying to use the Dedoop application that runs using Hadoop and HDFS on Amazon EC2. The Hadoop cluster is set up and the Namenode JobTracker and all other Daemons are running without error.
But the war Dedoop.war application is not able to connect to the Hadoop Namenode after deploying it on tomcat.
I have also checked to see if the ports are open in EC2.
Any help is appreciated.
If you're using Amazon AWS, I highly recommend using Amazon Elastic Map Reduce. Amazon takes care of setting up and provisioning the Hadoop cluster for you, including things like setting up IP addresses, NameNode, etc.
If you're setting up your own cluster on EC2, you have to be careful with public/private IP addresses. Most likely, you are pointing to the external IP addresses - can you replace them with the internal IP addresses and see if that works?
Can you post some lines of the Stacktrace from Tomcat's log files?
Dedoop must etablish an SOCKS proxy server (similar to ssh -D port username#host) to pass connections to Hadoop nodes on EC2. This is mainly because Hadoop resolves puplic IPS to EC2-internal IPs which breaks MR Jobs submission and HDFS management.
To this end Tomcat must be configured to to etablish ssh connections. The setup procedure is described here.

Resources