How to use remote hadoop cluster - hadoop

I have a Hadoop cluster deployed, and the client MapReduce program is running on another machine. How can I use that cluster?

If you have you have your jars in a client machine install hadoop-client packages in that machine and have configuration details of cluster in conf folder so that you can trigger your jobs from client machine into remote cluster

Related

Configure hadoop-client to connect to hadoop in other machine/server

On server A i have hadoop and python scripts for performing tasks on hadoop.
On server B i have hive/hadoop.
Is it possible to configure hadoop-client on server A to be connected to hadoop on server B?
It's not clear what Python library you are using, but assuming PySpark, you can copy or configure the HADOOP_CONF_DIR on your client machine, and it can communicate with any external Hadoop system.
At the very least, you'll need to configure a core-site.xml to communicate with HDFS and a hive-site.xml to communicate with Hive.
If you are using PyHive library, you just connect to user#hiveserver2:1000

Hadoop client node installation

I have 12 node cluster. Its Hardware information are :
NameNode : CPU Core i3 2.7 Ghz | 8GB RAM | 500 GB HDD
DataNode : CPU Core i3 2.7 Ghz | 2GB RAM | 500 GB HDD
I have installed the hadoop 2.7.2. I am using normal hadoop installation process on ubuntu and it work fine. But I want to add client machine.and I have no such clue that how to add client machine.
Question :
Installing process of Client machine. ?
How to run any script of pig/hive on that client machine ?
Client should have same copy of Hadoop Distribution and configuration which is present at Namenode then Only Client will come to know on which node Job tracker/Resourcemanager is running, and IP of Namenode to access HDFS data.
Also you need to update /etc/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Note that, you shouldn’t start any hadoop service on client machine.
Steps to follow on client machine:
create an user account on the cluster, say user1
create an account on client machine with the same name: user1
configure client machine to access the cluster machines (ssh w\out passphrase i.e, password less login)
copy/get a hadoop distribution same as cluster to client machine and extract it to /home/user1/hadoop-2.x.x
copy(or Edit) the hadoop configuration files (*-site.xml) from Namenode of the cluster - from this client will know where the Namenode/resourcemanager is running.
Set environment variables: JAVA_HOME, HADOOP_HOME (/home/user1/hadoop-2.x.x)
Set hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
test it out: hadoop fs -ls / which should list the root directory of the cluster hdfs.
you may face some issues like privileges, may need to set JAVA_HOME places like conf/hadoop-env.sh on client machine. update/comment any error you get.
Answers to more questions from comments:
How to load data from client node to hdfs ? - Just run hadoop fs commands from client machine: hadoop fs -put /home/user1/data/* /user/user1/data - you can also write shell-scripts that would run these command(s) if you need to run them many times.
Why I am installing hadoop on the client if we only use ssh to connect remotely to the master node ?
because client need to communicate with cluster, and need to know
where cluster nodes are.
client will be running hadoop jobs
like hadoop fs commands, hive queries, hadoop jar commnads, spark
jobs, developing mapreduce jobs etc for which client will need
hadoop binaries on client node.
Basically you are not only using the ssh to
connect, but you are performing some operations on hadoop cluster from
client node so you would need hadoop binaries.
ssh is used by
hadoop binaries on client node, when you run such operations like hadoop fs
-ls/ from client node to cluster. (remember adding $HADOOP_HOME/bin to PATH as part of installation process above)
when you are saying "we only use ssh" - that sounds to me like when you want to make changes/access hadoop configuration files from cluster you are connecting using ssh to cluster nodes - you do this as part of administrative work but when you need to run hadoop commands/jobs against cluster from client node you dont need to ssh manually - hadoop installation on client node will take care of it.
with out hadoop instalations how can you run hadoop commands/jobs/queries from client node to cluster?
3. should user name 'user1' must be same ? what if it is different ? - it will work. you can install hadoop on client node under group user say: qa or dev, and all users on client node as sudo under that group. than when user1 on client node need to run any hadoop job on cluster: user1 should be able to sudo -i -u qa and then run hadoop command from it.

How to run hadoop balancer from client node?

I want to ask how can I run the hadoop balancer? I've tried before on the namenode to run hadoop balancer command, but it has no effect at all (my new datanode still empty). I also read that hadoop balancer is not run on namenode but on client node. So what is the client node, how can I configure it, and how can client node access the hadoop file system?
Thanks all, I need your suggest
Client node is also know as edge node, Usually all the developers in a organization will not have access to all nodes on cluster. So for developers to accesss cluster usually we will have a Client node. You need to install hadoop-client packages on client node. If you are using cloudera RPM based installation, you can use below command.
sudo yum install hadoop-client
After client node installation update your configuration files like core-site.xml, hdfs-site.xml and other required files. Now when you execute hadoop CLI commands, they will be executed on cluster.
Balancer can be run from any node in the cluster. It can be a client machine/any node in cluster.
sudo -u hdfs hdfs balancer
Regarding newly added datanode, Just check in the namenode web UI if your node is added ? If you are able to see there, just check logs.

submit hadoop job on cloudera

I am wondering if we can setup a cloudera cluster on amazon and kick off a hadoop job from my local linux without ssh into amazon's node.
Is there anything like a client to do this communication?
The tips from the following tutorial really work. You should be able to put a working Hadoop Cluster in under 20 minutes, from cold iron to production ready, using just his guidance:
Hadoop Quickstart: Build a Cluster In The Cloud In 20 Minutes
Really worth checking it.
You can install an Hadoop client in your local linux and use the "hadoop jar" command with your own jar. Specify the option mapred.job.tracker in the command line and the client will push your jar to the jobtracker and duplicate it in all the tasktrackers that will be used for this job.

How to connect mac to hadoop/hdfs cluster

I have CDH for running in a cluster and I have ssh access to the machine. I need to connect my Mac to Cluster, so if I do hadoop fs -ls , it should show me the content of the cluster.
I have configured HADOOP_CONF to point to the configuration of the cluster. I am running CDH4 in my cluster. Am I missing something here , Is it possible to connect ?
Is there some ssh key setup that I need to do ?
There are a few of things you will need to ensure to do this:
You need to set your HADOOP_CONF_DIR environment variable to point to a directory that carries config XMLs that point to your cluster.
Your Mac should be able to directly access the hosts that form your cluster (all of them). This can be done via VPN, for example - if the cluster is secured from external networks.
Your Mac should carry the same version of Hadoop that the cluster runs.

Resources