hadoop api configuration on the client machine - hadoop

ultra-noob. I have a server machine with cdh3u1 pseudo-distrib, and a client machine with a java application using the cdh3u1 API.
How do I configure the client to talk to the server? I've been googling for hours and couldn't find where is the "client configuration" file. The "hdfs-default", "core-default" and "mapred-default" and their "-site" counterparts all look like server (namenode and datanode) config to me.
Is it just "multipurpose client server" config and I should cherry-pick the attributes in these files that are appropriate to the client? which are they? probably missing something big here...
Thanks, Ido

make sure that the client machine can access the hadoop server machine ip. If you use a virtualbox for the hadoop server (cdh3 vm), then add a "host-only" network interface (see details here: host-only networking with virtualbox. I'm assuming that your static ip for the hadoop server is 192.168.56.101 and that you're able to ping it from your client.
configure a hostname for your hadoop server machine in both the server and client machine. If you want to name your hadoop server "local-elephant", add the following line to /etc/hosts in both machines: 192.168.56.101 local-elephant.
in the server machine goto /etc/hadoop/conf change the values of the following properties from "localhost" to "local-elephant": in core-site.xml the value of fs.default.name and in mapred-site.xml the value of mapred.job.tracker.
in the client machine, create core-site.xml and mapred-site.xml in the classpath of your java application. In those files put only the fs.default.name and mapred.job.tracker properties.

Related

Hue UI is not accessible from a remote host

I'am trying to use Hue as a file browser for HDFS. So for that I have clone the hue repository and build the app with the following commands given in README.md of the hue repository.
git clone https://github.com/cloudera/hue.git
cd hue
make apps
build/env/bin/hue runserver
Hue UI is accessible in local machine using default port using the url http://localhost:8000 and everything works fine. But when I use my machine ip address http://x.x.x.x:8000 and try to access the Hue UI it keeps on processing and waiting.
Other observations -:
I can ping from remote machine to the host machine.
There is no firewall blocking the ports. (checked with nmap port scanner)
Machines are in same network.
I can access other ports for Hadoop NameNodes UI and DataNodes.
Changing the http_host in hue.ini doesn't affect the result
The ideal setup for Hue is configuring a reverse proxy (Nginx or Apache HTTP, for example)
However, you should refer to the Configuration documentation to externally run the server outside of 127.0.0.1
[desktop]
# Webserver listens on this address and port
http_host=0.0.0.0
http_port=8888
I was able to find a solution to the issue.. First hue run on a CherryPy web server so starting server by command build/env/bin/hue runserver will start the development server where hue.ini configuration is neglected.
So the correct command to start the production server after setting up correct configuration in hue.ini file is build/env/bin/hue runcpserver. Then I was able to access it using remote host without any problem. You also can use supervisor to start the production server. More information about that can be found here

cant ping hadoop hostname

I am trying to configure Hadoop cluster installed using Cent OS 6.x virtual machine,i have configured single node Hadoop cluster first to replicate and form the cluster from that later,but confused on configuration of static ip address for my virtual Hadoop cluster, my ifcfg-eth0 currently look like below,
DEVICE=eth0
TYPE=Ethernetle
UUID=892c57f5-17db-486d-b1b9-97efa8799bf0
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=dhcp
HWADDR=00:0C:29:5C:04:D0
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
NAME="System eth0"
can anyone help me to configure static address for my virtual Hadoop cluster, and also i am not able to ping any other host name other than localhost but can able to ping host address,anyone please help me to resolve this ping and static address issues.
A bridge network will help. In virtualbox select machine, settings>network
Also following is an example of the file you are editing.
/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
ONBOOT=yes
PROTO=static
IPADDR=10.0.1.200
NETMASK=255.255.255.0
There are other requirements. Attached link might help
http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/
For internet access files listed below needs to be modified.
/etc/resolv.conf
#Generated by NetworkManager
nameserver 8.8.8.8
nameserver 8.8.4.4
Also the file which you modified for static IP /etc/sysconfig/network-scripts/ifcfg-eth0, need following additions
DNS1=8.8.8.8
DNS2=8.8.4.4
ONBOOT=yes

Submit YARN jobs to remote Hadoop cluster via socks proxy

I am trying to access a firewalled Hadoop cluster running YARN via a SOCKS proxy. The cluster itself is not using proxied connections -- only my client running on a local machine (e.g. a laptop) is connected via ssh -D 9999 user#gateway-host to a machine that can see the Hadoop cluster.
In the Hadoop configuration core-site.xml (on my laptop) I have the following lines:
<property>
<name>hadoop.socks.server</name>
<value>localhost:9999</value>
</property>
<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>
Accessing HDFS this way works great. However, when I try to submit a YARN job, it fails and I can see in the logs that the nodes are not able to talk to each other:
java.io.IOException: Failed on local exception: java.net.SocketException: Connection refused; Host Details : local host is: "host1"; destination host is: "host2":8030;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
where host1 and host2 are both parts of the hadoop cluster.
I guess what is happening is that the hadoop nodes are trying to communicate via a socks proxy as well and this is obviously failing since no proxy server exists on each host. Is there a way to fix this apart from setting up a dedicated proxy server?
You are right, the Hadoop nodes must not use the SOCKS proxy for the communication. You can achieve that by marking the SocketFactory setting on the cluster side final.
In core-site.xml on the cluster, add the final tag to the default SocketFactory property:
<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.StandardSocketFactory</value>
<final>true</final>
</property>
Obviously, you must restart cluster services.

CDH WebHDFS request redirects to local address on EC2

I am trying to setup an enviroment where I run some of my backend locally, and send requests to an EC2 instance from my local computer. I have CDH 4.5 setup, and it works OK. When I run the following request
curl --negotiate -i -L -u:hdfs http://ec2-xx-xx-xx-xx.eu-west-1.compute.amazonaws.com:50070/webhdfs/v1/tmp/test.txt?op=OPEN
This works from any EC2 instance in that region but does not work outside that. If I try locally it would return the following error
curl: (6) Could not resolve host: ip-xx-xx-xx-xx.eu-west-1.compute.internal
Not sure where I can set this not to redirect the call this way?
Many thanks
The easiest and fastest way to solve this problem is to configure your client hosts file to map the internal address to the external address.
WebHDFS uses the host name configured in hdfs-site.xml which is configured automatically by the Cloudera agent on that datanode. I don't know of a way to override the configured hostname for each datanode in CDH.

Weird DNS server causes Hadoop and HBase to malfunction

I have a network with some weird (as I understand) DNS server which causes Hadoop or HBase to malfunction.
It resolves my hostname to some address my machine doesn't know about (i.e. there is no such interface).
Hadoop does work if I have following entries in /etc/hosts:
127.0.0.1 localhost
127.0.1.1 myhostname
If entry "127.0.1.1 myhostname" is not present uploading file to HDFS fails and complains that it can replicate the file only to 0 datanodes instead of 1.
But in this case HBase does not work: creating a table from HBase shell causes NotAllMetaRegionsOnlineException (caused actually by HMaster trying to bind to wrong address returned by DNS server for myhostname).
In other network, I am using following /etc/hosts:
127.0.0.1 localhost
192.168.1.1 myhostname
And both Hadoop and HBase work.
The problem is that in second network the address is dynamic and I can't list it into /etc/hosts to override result returned by weird DNS.
Hadoop is run in pseudo-distributed mode. HBase also runs on single node.
Changing behavior of DNS server is not an option.
Changing "localhost" to 127.0.0.1 in hbase/conf/regionservers doesn't change anything.
Can somebody suggest a way how can I override its behavior while retaining internet connection (I actually work at client's machine through Teamviewer). Or some way to configure HBase (or Zookeeper it is managing) not to use hostname to determine address to bind?
Luckily, I've found the workaround to this DNS server problem.
DNS server returned invalid address when queried by local hostname.
HBase by default does reverse DNS lookup on local hostname to determine where to bind.
Because the address returned by DNS server was invalid, HMaster wasn't able to bind.
Workaround:
In hbase/conf/hbase-site.xml explicitly specify interfaces that will be used for master and regionserver:
<configuration>
<property>
<name>hbase.master.dns.interface</name>
<value>lo</value>
</property>
<property>
<name>hbase.regionserver.dns.interface</name>
<value>lo</value>
</property>
</configuration>
In this case, I specified loopback interface (lo) to be used for both master and regionserver.
a simple tool I wrote to check for DNS issues:
https://github.com/sujee/hadoop-dns-checker

Resources