Access WebHDFS on Hortonworks Hadoop (AWS EC2)

Access WebHDFS on Hortonworks Hadoop (AWS EC2) - hadoop

I'm facing an issue with the WebHDFS access on my Amazon EC2 machine. I have installed Hortonworks HDP 2.3 btw.
I can retrieve the file status from my local machine in the browser (chrome) with following http request:
http://<serverip>:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS
This works fine but if I try to open the file with ?op=OPEN, then it redirects me to the private DNS of the machine, which I cannot access:
http://<privatedns>:50075/webhdfs/v1/user/admin/file.csv?op=OPEN&namenoderpcaddress=<privatedns>:8020&offset=0
I also tried to get access to WebHDFS from the AWS machine itself with this command:
[ec2-user#<ip> conf]$ curl -i http://localhost:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS
curl: (7) couldn't connect to host
Does anyone know why I cannot connect to localhost or why the OPEN on my local machine does not work?
Unfortunately I couldn't find any tutorial to configure the WebHDFS for a Amazon machine.
Thanks in Advance

What happens is that the namenode redirects you to the datanode. Seems like you installed a single-node cluster, but conceptually the namenode and datanode(s) are distinct, and in your configuration the datanode(s) live/listen on the private side of your EC2 VPC.
You could reconfigure your cluster to host the datanodes on the public IP/DNS (see HDFS Support for Multihomed Networks), but I would not go that way. I think the proper solution is to add a Know gateway, which is a specialized component for accessing a private cluster from a public API. Specifically, you will have to configure the datanode URLs, see Chapter 5. Mapping the Internal Nodes to External URLs. The example there seems spot on for your case:
For example, when uploading a file with WebHDFS service:
The external client sends a request to the gateway WebHDFS service.
The gateway proxies the request to WebHDFS using the service URL.
WebHDFS determines which DataNodes to create the file on and returns
the path for the upload as a Location header in a HTTP redirect, which
contains the datanode host information.
The gateway augments the routing policy based on the datanode hostname
in the redirect by mapping it to the externally resolvable hostname.
The external client continues to upload the file through the gateway.
The gateway proxies the request to the datanode by using the augmented
routing policy.
The datanode returns the status of the upload and the gateway again
translates the information without exposing any internal cluster
details.

Related

Does webhdfs support high availability when failover happens

iam using hadoop 2.7.1 on centos 7
when high availability is included with hadoop cluster
and active name node fails ,it becomes stand by
but webhdfs doesn't support high availability ?isn't it
what should be the alternative to send get and put request to other active name
node with the failure of master name node

Yes, WebHDFS is not High Availability aware. This issue is still open. Refer HDFS-6371
Instead, you can opt for HttpFs. It is inteoperable with the webhdfs REST API and HA aware.
Or, write your custom implementation to redirect requests to the Active Namenode.

Webhdfs server runs in the same process as NameNode. So you need to run webhdfs compatible proxy server, that would get rid of NN failover:
HttpFs - as part of Hadoop
Apache Knox- as part of HDP distribution.
They both webhdfs compatible, so you don't need to change any REST API.

How to configure ports for hostname and localhost?

I am running a browser on the single node Hortonworks Hadoop cluster(HDP 2.3.4) on Centos 6.7:
With localhost:8000 and <hostname>:8000, I can access Hue. Same works for Ambari at 8080
However, several other ports, I only can access with the hostname. So with e.g. <hostname>:50070, I can access the namenode service. If I use localhost:50070, I cannot setup a connection. So I assume localhost is blocked, the namenode not.
How can I set up that localhost and <hostname> have the same port configuration?

This likely indicates that the NameNode HTTP server socket is bound to a single network interface, but not the loopback interface. The NameNode HTTP server address is controlled by configuration property dfs.namenode.http-address in hdfs-site.xml. Typically this specifies a host name or IP address, and this maps to a single network interface. You can tell it to bind to all network interfaces by setting property dfs.namenode.http-bind-host to 0.0.0.0 (the wildcard address, matching all network interfaces). The NameNode must be restarted for this change to take effect.
There are similar properties for other Hadoop daemons. For example, YARN has a property named yarn.resourcemanager.bind-host for controlling how the ResourceManager binds to a network interface for its RPC server.
More details are in the Apache Hadoop documentation for hdfs-default.xml and yarn-default.xml. There is also full coverage of multi-homed deployments in HDFS Support for Multihomed Networks.

Access hadoop nodes web UI from multiple links

i am using the following setup for hadoop's nodes web ui access :
dfs.namenode.http-address : 127.0.0.1:50070
By which i am able to access the nodes web ui link only form the local machine as :
http://127.0.0.1:50070
Is there any way by which i can make it accessible from outside as well ? say like :
http://<Machine-IP>:50070
Thanks in Advance !!

You can use hostname or ipaddress instead of localhost/127.0.0.1.
Make sure you can ping the hostname or ip from the remote machine. If you can ping it then you can able to access web ui.
To ping it
Open cmd/terminal
type the below command in remote machines
ping hostname/ip

From http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
The following table lists web interfaces that you can view on the core
and task nodes. These Hadoop interfaces are available on all clusters.
To access the following interfaces, replace slave-public-dns-name in
the URI with the public DNS name of the node. For more information
about retrieving the public DNS name of a core or task node instance,
see Connecting to Your Linux/Unix Instances Using SSH in the Amazon
EC2 User Guide for Linux Instances. In addition to retrieving the
public DNS name of the core or task node, you must also edit the
ElasticMapReduce-slave security group to allow SSH access over TCP
port 22. For more information about modifying security group rules,
see Adding Rules to a Security Group in the Amazon EC2 User Guide for
Linux Instances.
YARN ResourceManager
YARN NodeManager
Hadoop HDFS NameNode
Hadoop HDFS DataNode
Spark HistoryServer
Because there are several application-specific interfaces available on
the master node that are not available on the core and task nodes, the
instructions in this document are specific to the Amazon EMR master
node. Accessing the web interfaces on the core and task nodes can be
done in the same manner as you would access the web interfaces on the
master node.
There are several ways you can access the web interfaces on the master
node. The easiest and quickest method is to use SSH to connect to the
master node and use the text-based browser, Lynx, to view the web
sites in your SSH client. However, Lynx is a text-based browser with a
limited user interface that cannot display graphics. The following
example shows how to open the Hadoop ResourceManager interface using
Lynx (Lynx URLs are also provided when you log into the master node
using SSH).
Copy lynx http://ip-###-##-##-###.us-west-2.compute.internal:8088/
There are two remaining options for accessing web interfaces on the
master node that provide full browser functionality. Choose one of the
following:
Option 1 (recommended for more technical users): Use an SSH client to connect to the master node, configure SSH tunneling with local port
forwarding, and use an Internet browser to open web interfaces hosted
on the master node. This method allows you to configure web interface
access without using a SOCKS proxy.
to do this use the command
$ ssh -gnNT -L 9002:localhost:8088 user#example.com
where user#example.com is your username. Note the use of -g to open access to external ip addresses (beware this is a security risk)
you can check this is running using
nmap localhost
to close this ssh tunnel when done use
ps aux | grep 9002
to find the pid of your running ssh process and kill it.
Option 2 (recommended for new users): Use an SSH client to connect to the master node, configure SSH tunneling with dynamic port
forwarding, and configure your Internet browser to use an add-on such
as FoxyProxy or SwitchySharp to manage your SOCKS proxy settings. This
method allows you to automatically filter URLs based on text patterns
and to limit the proxy settings to domains that match the form of the
master node's DNS name. The browser add-on automatically handles
turning the proxy on and off when you switch between viewing websites
hosted on the master node, and those on the Internet. For more
information about how to configure FoxyProxy for Firefox and Google
Chrome, see Option 2, Part 2: Configure Proxy Settings to View
Websites Hosted on the Master Node.
This seems like insanity to me but I have been unable to find how to configure access in core-site.xml to override the web interface for the ResourceManager which by default it is available at localhost:8088/ and if Amazon think this is the way then I tend to go along with it

Hadoop 2.2 on EC2 Web interface: "Browse the filesystem" and Datanode Links broken

I have configured a single node cluster on Amazon EC2 instance (ubuntu-trusty-14.04-amd64-server-20140927 (ami-3d50120d)). Once I start the Hadoop cluster, I visit the NameNode web interface(http://ec2-xx-xx-xx-xx.us-west-2.compute.amazonaws.com:50070/dfshealth.jsp) which works fine. But when navigating to the link that says "Browse the fileSystem" the link is broken and points to http://ip-xxx-xx-xx-xxx.us-west-2.compute.internal:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=172.31.25.124:9000 - which is this instances private ip. For this also occurs when visiting the datanode like under "Live nodes".
Somehow, these links are being resolved to the private ip address of my instance. If I replace the url with the public dns of my instance these pages load correctly. Has anyone seen and better yet solved this issue?

Try using the fully qualified host names in Hadoops configs. I think you need to change core-site.xml and hdfs-site.xml to your public DNS names.
Similar issue

Use a socks proxy together with a proxy configuration tool. The instructions for EMR should work the same for an ec2 hadoop deployment.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-proxy.html

Hadoop Dedoop Application unable to contact Hadoop Namenode : Getting "Unable to contact Namenode" error

I'm trying to use the Dedoop application that runs using Hadoop and HDFS on Amazon EC2. The Hadoop cluster is set up and the Namenode JobTracker and all other Daemons are running without error.
But the war Dedoop.war application is not able to connect to the Hadoop Namenode after deploying it on tomcat.
I have also checked to see if the ports are open in EC2.
Any help is appreciated.

If you're using Amazon AWS, I highly recommend using Amazon Elastic Map Reduce. Amazon takes care of setting up and provisioning the Hadoop cluster for you, including things like setting up IP addresses, NameNode, etc.
If you're setting up your own cluster on EC2, you have to be careful with public/private IP addresses. Most likely, you are pointing to the external IP addresses - can you replace them with the internal IP addresses and see if that works?

Can you post some lines of the Stacktrace from Tomcat's log files?
Dedoop must etablish an SOCKS proxy server (similar to ssh -D port username#host) to pass connections to Hadoop nodes on EC2. This is mainly because Hadoop resolves puplic IPS to EC2-internal IPs which breaks MR Jobs submission and HDFS management.
To this end Tomcat must be configured to to etablish ssh connections. The setup procedure is described here.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio