iam using hadoop 2.7.1 on centos 7
when high availability is included with hadoop cluster
and active name node fails ,it becomes stand by
but webhdfs doesn't support high availability ?isn't it
what should be the alternative to send get and put request to other active name
node with the failure of master name node
Yes, WebHDFS is not High Availability aware. This issue is still open. Refer HDFS-6371
Instead, you can opt for HttpFs. It is inteoperable with the webhdfs REST API and HA aware.
Or, write your custom implementation to redirect requests to the Active Namenode.
Webhdfs server runs in the same process as NameNode. So you need to run webhdfs compatible proxy server, that would get rid of NN failover:
HttpFs - as part of Hadoop
Apache Knox- as part of HDP distribution.
They both webhdfs compatible, so you don't need to change any REST API.
Related
Is it possible to use multiple hive servers in the jdbc URL?
jdbc:hive2://ip1:10000,ip2:10000/;transportMode=http;
Basically I want an Active Passive kind of setup and if the first server is not available i want to use the second one. I don't want to go through zk setup as load balancing is not required.
I am using hive over socks proxy.
Zookeeper is for failover also, not just load balancing, and that's how you can get a highly available HiveServer connection
To provide high availability or load balancing for HiveServer2, Hive provides a function called dynamic service discovery where multiple HiveServer2 instances can register themselves with Zookeeper
https://www.ibm.com/support/knowledgecenter/en/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.admin.doc/doc/admin_HA_HiveS2.html
You should be using Zookeeper already for a highly available namenode
I am following this guide on Hadoop/FIWARE-Cosmos and I have a question about the Hive part.
I can access the old cluster’s (cosmos.lab.fiware.org) headnode through SSH, but I cannot do it for the new cluster. I tried both storage.cosmos.lab.fiware.org and computing.cosmos.lab.fiware.org and failed to connect.
My intention in trying to connect via SSH was to test Hive queries on our data through the Hive CLI. After failing to do so, I checked and was able to connect to the 10000 port of computing.cosmos.lab.fiware.org with telnet. I guess Hive is served through that port. Is this the only way we can use Hive in the new cluster?
The new pair of clusters have not enabled the ssh access. This is because users tend to install a lot of stuff (even not related with Big Data) in the “old” cluster, which had the ssh access enabled as you mention. So, the new pair of clusters are intended to be used only through the APIs exposed: WebHDFS for data I/O and Tidoop for MapReduce.
Being said that, a Hive Server is running as well and it should be exposing a remote service in the 10000 port as you mention as well. I say “it should be” because it is running an experimental authenticator module based in OAuth2 as WebHDFS and Tidoop do. Theoretically, connecting to that port from a Hive client is as easy as using your Cosmos username and a valid token (the same you are using for WebHDFS and/or Tidoop).
And what about a Hive remote client? Well, this is something your application should implement. Anyway, I have uploaded some implementation examples in the Cosmos repo. For instance:
https://github.com/telefonicaid/fiware-cosmos/tree/develop/resources/java/hiveserver2-client
I'm facing an issue with the WebHDFS access on my Amazon EC2 machine. I have installed Hortonworks HDP 2.3 btw.
I can retrieve the file status from my local machine in the browser (chrome) with following http request:
http://<serverip>:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS
This works fine but if I try to open the file with ?op=OPEN, then it redirects me to the private DNS of the machine, which I cannot access:
http://<privatedns>:50075/webhdfs/v1/user/admin/file.csv?op=OPEN&namenoderpcaddress=<privatedns>:8020&offset=0
I also tried to get access to WebHDFS from the AWS machine itself with this command:
[ec2-user#<ip> conf]$ curl -i http://localhost:50070/webhdfs/v1/user/admin/file.csv?op=GETFILESTATUS
curl: (7) couldn't connect to host
Does anyone know why I cannot connect to localhost or why the OPEN on my local machine does not work?
Unfortunately I couldn't find any tutorial to configure the WebHDFS for a Amazon machine.
Thanks in Advance
What happens is that the namenode redirects you to the datanode. Seems like you installed a single-node cluster, but conceptually the namenode and datanode(s) are distinct, and in your configuration the datanode(s) live/listen on the private side of your EC2 VPC.
You could reconfigure your cluster to host the datanodes on the public IP/DNS (see HDFS Support for Multihomed Networks), but I would not go that way. I think the proper solution is to add a Know gateway, which is a specialized component for accessing a private cluster from a public API. Specifically, you will have to configure the datanode URLs, see Chapter 5. Mapping the Internal Nodes to External URLs. The example there seems spot on for your case:
For example, when uploading a file with WebHDFS service:
The external client sends a request to the gateway WebHDFS service.
The gateway proxies the request to WebHDFS using the service URL.
WebHDFS determines which DataNodes to create the file on and returns
the path for the upload as a Location header in a HTTP redirect, which
contains the datanode host information.
The gateway augments the routing policy based on the datanode hostname
in the redirect by mapping it to the externally resolvable hostname.
The external client continues to upload the file through the gateway.
The gateway proxies the request to the datanode by using the augmented
routing policy.
The datanode returns the status of the upload and the gateway again
translates the information without exposing any internal cluster
details.
Newbie w/ etcd/zookeeper type services ...
I'm not quite sure how to handle cluster installation for etcd. Should the service be installed on each client or a group of independent servers? I ask because if I'm on a client, how would I query the cluster? Every tutorial I've read shows a curl command running against localhost.
For etcd cluster installation, you can install the service on independent servers and form a cluster. The cluster information can be queried by logging onto one of the machines and running curl or remotely by specifying the IP address of one of the cluster member node.
For more information on how to set it up, follow this article
I'm trying to use the Dedoop application that runs using Hadoop and HDFS on Amazon EC2. The Hadoop cluster is set up and the Namenode JobTracker and all other Daemons are running without error.
But the war Dedoop.war application is not able to connect to the Hadoop Namenode after deploying it on tomcat.
I have also checked to see if the ports are open in EC2.
Any help is appreciated.
If you're using Amazon AWS, I highly recommend using Amazon Elastic Map Reduce. Amazon takes care of setting up and provisioning the Hadoop cluster for you, including things like setting up IP addresses, NameNode, etc.
If you're setting up your own cluster on EC2, you have to be careful with public/private IP addresses. Most likely, you are pointing to the external IP addresses - can you replace them with the internal IP addresses and see if that works?
Can you post some lines of the Stacktrace from Tomcat's log files?
Dedoop must etablish an SOCKS proxy server (similar to ssh -D port username#host) to pass connections to Hadoop nodes on EC2. This is mainly because Hadoop resolves puplic IPS to EC2-internal IPs which breaks MR Jobs submission and HDFS management.
To this end Tomcat must be configured to to etablish ssh connections. The setup procedure is described here.