How to connect mac to hadoop/hdfs cluster - hadoop

I have CDH for running in a cluster and I have ssh access to the machine. I need to connect my Mac to Cluster, so if I do hadoop fs -ls , it should show me the content of the cluster.
I have configured HADOOP_CONF to point to the configuration of the cluster. I am running CDH4 in my cluster. Am I missing something here , Is it possible to connect ?
Is there some ssh key setup that I need to do ?

There are a few of things you will need to ensure to do this:
You need to set your HADOOP_CONF_DIR environment variable to point to a directory that carries config XMLs that point to your cluster.
Your Mac should be able to directly access the hosts that form your cluster (all of them). This can be done via VPN, for example - if the cluster is secured from external networks.
Your Mac should carry the same version of Hadoop that the cluster runs.

Related

How to connect Apache Spark with Yarn from the SparkContext?

I have developed a Spark application in Java using Eclipse.
So far, I am using the standalone mode by configuring the master's address to 'local[*]'.
Now I want to deploy this application on a Yarn cluster.
The only official documentation I found is http://spark.apache.org/docs/latest/running-on-yarn.html
Unlike the documentation for deploying on a mesos cluster or in standalone (http://spark.apache.org/docs/latest/running-on-mesos.html), there is not any URL to use within SparkContext for the master's adress.
Apparently, I have to use line commands to deploy spark on Yarn.
Do you know if there is a way to configure the master's adress in the SparkContext like the standalone and mesos mode?
There actually is a URL.
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager
You should have at least hdfs-site.xml, yarn-site.xml, and core-site.xml files that specify all the settings and URLs for the Hadoop cluster you connect to.
Some properties from yarn-site.xml include yarn.nodemanager.hostname and yarn.nodemanager.address.
Since the address has a default of ${yarn.nodemanager.hostname}:0, you may only need to set the hostname.

Hadoop client node installation

I have 12 node cluster. Its Hardware information are :
NameNode : CPU Core i3 2.7 Ghz | 8GB RAM | 500 GB HDD
DataNode : CPU Core i3 2.7 Ghz | 2GB RAM | 500 GB HDD
I have installed the hadoop 2.7.2. I am using normal hadoop installation process on ubuntu and it work fine. But I want to add client machine.and I have no such clue that how to add client machine.
Question :
Installing process of Client machine. ?
How to run any script of pig/hive on that client machine ?
Client should have same copy of Hadoop Distribution and configuration which is present at Namenode then Only Client will come to know on which node Job tracker/Resourcemanager is running, and IP of Namenode to access HDFS data.
Also you need to update /etc/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Note that, you shouldn’t start any hadoop service on client machine.
Steps to follow on client machine:
create an user account on the cluster, say user1
create an account on client machine with the same name: user1
configure client machine to access the cluster machines (ssh w\out passphrase i.e, password less login)
copy/get a hadoop distribution same as cluster to client machine and extract it to /home/user1/hadoop-2.x.x
copy(or Edit) the hadoop configuration files (*-site.xml) from Namenode of the cluster - from this client will know where the Namenode/resourcemanager is running.
Set environment variables: JAVA_HOME, HADOOP_HOME (/home/user1/hadoop-2.x.x)
Set hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
test it out: hadoop fs -ls / which should list the root directory of the cluster hdfs.
you may face some issues like privileges, may need to set JAVA_HOME places like conf/hadoop-env.sh on client machine. update/comment any error you get.
Answers to more questions from comments:
How to load data from client node to hdfs ? - Just run hadoop fs commands from client machine: hadoop fs -put /home/user1/data/* /user/user1/data - you can also write shell-scripts that would run these command(s) if you need to run them many times.
Why I am installing hadoop on the client if we only use ssh to connect remotely to the master node ?
because client need to communicate with cluster, and need to know
where cluster nodes are.
client will be running hadoop jobs
like hadoop fs commands, hive queries, hadoop jar commnads, spark
jobs, developing mapreduce jobs etc for which client will need
hadoop binaries on client node.
Basically you are not only using the ssh to
connect, but you are performing some operations on hadoop cluster from
client node so you would need hadoop binaries.
ssh is used by
hadoop binaries on client node, when you run such operations like hadoop fs
-ls/ from client node to cluster. (remember adding $HADOOP_HOME/bin to PATH as part of installation process above)
when you are saying "we only use ssh" - that sounds to me like when you want to make changes/access hadoop configuration files from cluster you are connecting using ssh to cluster nodes - you do this as part of administrative work but when you need to run hadoop commands/jobs against cluster from client node you dont need to ssh manually - hadoop installation on client node will take care of it.
with out hadoop instalations how can you run hadoop commands/jobs/queries from client node to cluster?
3. should user name 'user1' must be same ? what if it is different ? - it will work. you can install hadoop on client node under group user say: qa or dev, and all users on client node as sudo under that group. than when user1 on client node need to run any hadoop job on cluster: user1 should be able to sudo -i -u qa and then run hadoop command from it.

hadoop 2 node cluster setup

Can anyone please tell how to create a hadoop 2 node cluster ?i have 2 Virual machines both have hadoop single node setup installed properly,please tell how to form 2 node cluster from that ,and mention proper network setup for that.Thanks in Advance.i am using hadoop -1.0.3 version.
Hi can you do one thing for network configuration of vms.
Make sure of the network connect of virtual machine setting of hostonly.
And add two hostname and ips in /etc/hosts file.
add two hostnames in $HADOOP_HOME/conf/slaves file.
Make sure of secondary namenode hostname in $HADOOP_HOME/conf/masters file.
Then make sure same configuration on two machine.
hadoop namenode -format
start-all.sh
Best of luck.

Hadoop fuse on multinode

I need to use hadoop fuse to mount HDFS on a multi-node cluster. How can I achieve that?
I have successfully deployed fuse on a single-node cluster, but I doubt it would work on multi-node. Can anyone please throw light over this ?
It doesn't matter, whether your cluster is single node or multinode. If you want to mount HDFS on a remote machine, make sure that particular machine has access to cluster network. Setup a hadoop client(with the same hadoop version in cluster) in the node in which you are planning to mount HDFS using FUSE.
The difference while mounting is namenode url.
(dfs://NAMENODEHOST:NN-IPC-PORT/)
In case of single node namenode url would be localhost(0.0.0.0/127.0.0.1/0), but in multinode cluster you have to give namenode Hostname/Ip address instead of localhost. It's possible to mount hdfs in any linux machines which can access hadoop cluster.
Trying to use Fuse to mount HDFS. Can't compile libhdfs

Multi Node Cluster Hadoop Setup

Pseudo-Distributed single node cluster implementation
I am using window 7 with CYGWIN and installed hadoop-1.0.3 successfully. I still start services job tracker,task tracker and namenode on port (localhost:50030,localhost:50060 and localhost:50070).I have completed single node implementation.
Now I want to implement Pseudo-Distributed multiple node cluster . I don't understand how to divide in master and slave system through network ips?
For your ssh problem just follow the link of single node cluster :
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
and yes, you need to specify the ip's of master and slave in conf file
for that you can refer this url :
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
I hope this helps.
Try to create number of VM you want to add in your cluster.Make sure that those VM are having the same hadoop version .
Figure out the IPs of each VM.
you will find files named master and slaves in $HADOOP_HOME/conf mention the IP of VM to conf/master file which you want to treat as master and and do the same with conf/slaves
with slave nodes IP.
Make sure these nodes are having Passwordless-ssh connection.
Format your namenode and then run start-all.sh.
Thanks,

Resources