Connecting to a Hadoop HDFS within a cloudera virtual machine via IP - hadoop

I'm looking to connect to a hadoop instance on a cloudera virtual machine via the server address of the hdfs. Would anyone know how to find the IP address of this?
If not, how could I connect locally to the hdfs within virtual machine, as they are both running on the same computer.
Need to do this for a pentaho kettle connection.

If you're trying to configure Pentaho Data Integration (Kettle) to use HDFS as an input data source, then first you'll need to get the hostname/IP address and port number of the HDFS NameNode service, which you will then enter into (Pentaho) Spoon (the GUI to Kettle).
Getting HDFS NameNode IP/port number
The default port of the Hadoop HDFS NameNode service is 8020 in both CDH4 and 5 (source).
If for some reason you're not using the defaults, then the hostname/port of the HDFS NameNode service can be found in Cloudera Manager (which should be installed if you're using the Cloudera Quickstart VM, for example):
Click on the HDFS service on the main Cloudera Manager page
Click on Configuration - View and Edit
Click on NameNode - Ports and Addresses.
"NameNode Port" is the one you want, i.e. not "NameNode Web UI Port". As PDI needs the NameNode port.
Browse HDFS files in PDI to confirm
Test by opening Pentaho Data Integration (Spoon) and creating a "Hadoop Copy Files" transformation step, as an example, and then enter in your HDFS details in the "browse files" area and check if a directory list shows up.

Related

Airflow conn_id with multiple server

I am using WebHDFSSensor and for that we need to provide namenode. However, active namenode and standBy namenode change. I can't just provide current namenode host to webhdfs_conn_id. I have to create connection from both host. I tried to provide host as an array but it didn't work.
So my question here is , Lets consider I need connection with name webhdfs_default and I need it for 2 host w.x.y.z and a.b.c.d. How do I create that?

access hdfs outside docker swarm

I have created a docker swarm configuration which consists of a namenode, datanode, resource manager, and yarn workers. These all work well together and I can run hdfs dfs commands from any container in the swarm. I have also exposed the port 9000 using the ports section of the yaml. In my core-site.xml I use the hostname of the namenode from the swarm configuration.
I am unable to get a client outside of the swarm to access the cluster using the hdfs dfs commands. I have a different core-site.xml which has the address of the host machine for the swarm. When I run commands I get a java.io.EOFException.
Is there any way to get an external client to connect to the hadoop cluster running in docker swarm?
Turns out this is resolved by following the multihomed network instructions. It seems that it appears to the namenode as though it has two network interfaces to work with. One for the overlay network across the swarm and another from exposing the port.

Downloading Hadoop Data from other PC

I have Hadoop v2.6 installed in my one PC in Ubuntu OS 14.04. I have added lots of unstructured data using Hadoop -put command into HDFS.
Can someone tell me how to download this data from another PC which is not in Hadoop Cluster using the Web User Interface provided by Hadoop??
I can access the data from other PC by typing in the address bar of browser (the IP address of HDFS server):Port Number
Like this: 192.168.x.x:50070
The problem is, I am not able to download the data as it gives the error "Webpage Not Available". I also tried other browsers, but still no luck.
Port 50070 is the default name node port. You should try port 14000 which is the default HttpFS port. If it still doesn't work try using the example from the manual:
http://192.168.x.x:14000?user.name=babu&op=homedir

Hadoop Dedoop Application unable to contact Hadoop Namenode : Getting "Unable to contact Namenode" error

I'm trying to use the Dedoop application that runs using Hadoop and HDFS on Amazon EC2. The Hadoop cluster is set up and the Namenode JobTracker and all other Daemons are running without error.
But the war Dedoop.war application is not able to connect to the Hadoop Namenode after deploying it on tomcat.
I have also checked to see if the ports are open in EC2.
Any help is appreciated.
If you're using Amazon AWS, I highly recommend using Amazon Elastic Map Reduce. Amazon takes care of setting up and provisioning the Hadoop cluster for you, including things like setting up IP addresses, NameNode, etc.
If you're setting up your own cluster on EC2, you have to be careful with public/private IP addresses. Most likely, you are pointing to the external IP addresses - can you replace them with the internal IP addresses and see if that works?
Can you post some lines of the Stacktrace from Tomcat's log files?
Dedoop must etablish an SOCKS proxy server (similar to ssh -D port username#host) to pass connections to Hadoop nodes on EC2. This is mainly because Hadoop resolves puplic IPS to EC2-internal IPs which breaks MR Jobs submission and HDFS management.
To this end Tomcat must be configured to to etablish ssh connections. The setup procedure is described here.

Hadoop namenode web UI not opening in CDH4

I recently installed the CDH distribution of Cloudera to create a 2 node cluster. From the Cloudera Manager UI, all services are running fine.
All the command line tools (hive etc ) are also working fine and I am able to read and write data to hdfs.
However the namenode (and datanode) web UI alone is not opening. Checking on netstat -a | grep LISTEN, the processes are listening on the assigned ports and there are no firewall rules which are blocking the connections ( I already disabled iptables)
I initially though that it could be a DNS issue but even the IP address is not working. The Cloudera Manager installed on the same machine on another port is opening fine.
Any pointers on how to debug this problem?
I had faced the same issue.
First it was because NAMENODE in safemode
then after because of two IP address (I have two NIC configured on CDH Cluster one internal connectivity of the servers (10.0.0.1) and other is to connect servers form Internet (192.168.0.1))
When i try to open NAMENODE GUI form any of the server connected to cluster on network 10.0.0.1 then GUI is opening and works fine but from other any other machine connected to servers by 192.168.0.1 network it fails.

Resources