Hadoop Distributed Cluster Machines - hadoop

Assume that there is a Hadoop Cluster that has 20 machines.
Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.
My question is: In which machine out of these 20 machines do I need to install hadoop software?
Do I need to install Hadoop in all those 20 machines?

You need to install hadoop in all machines , u just need to make suitable changes in configuration files of namenode and datanode. You can refer Michall nolls Multi-node installation for the same.

Related

Hadoop client node installation

I have 12 node cluster. Its Hardware information are :
NameNode : CPU Core i3 2.7 Ghz | 8GB RAM | 500 GB HDD
DataNode : CPU Core i3 2.7 Ghz | 2GB RAM | 500 GB HDD
I have installed the hadoop 2.7.2. I am using normal hadoop installation process on ubuntu and it work fine. But I want to add client machine.and I have no such clue that how to add client machine.
Question :
Installing process of Client machine. ?
How to run any script of pig/hive on that client machine ?
Client should have same copy of Hadoop Distribution and configuration which is present at Namenode then Only Client will come to know on which node Job tracker/Resourcemanager is running, and IP of Namenode to access HDFS data.
Also you need to update /etc/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Note that, you shouldn’t start any hadoop service on client machine.
Steps to follow on client machine:
create an user account on the cluster, say user1
create an account on client machine with the same name: user1
configure client machine to access the cluster machines (ssh w\out passphrase i.e, password less login)
copy/get a hadoop distribution same as cluster to client machine and extract it to /home/user1/hadoop-2.x.x
copy(or Edit) the hadoop configuration files (*-site.xml) from Namenode of the cluster - from this client will know where the Namenode/resourcemanager is running.
Set environment variables: JAVA_HOME, HADOOP_HOME (/home/user1/hadoop-2.x.x)
Set hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
test it out: hadoop fs -ls / which should list the root directory of the cluster hdfs.
you may face some issues like privileges, may need to set JAVA_HOME places like conf/hadoop-env.sh on client machine. update/comment any error you get.
Answers to more questions from comments:
How to load data from client node to hdfs ? - Just run hadoop fs commands from client machine: hadoop fs -put /home/user1/data/* /user/user1/data - you can also write shell-scripts that would run these command(s) if you need to run them many times.
Why I am installing hadoop on the client if we only use ssh to connect remotely to the master node ?
because client need to communicate with cluster, and need to know
where cluster nodes are.
client will be running hadoop jobs
like hadoop fs commands, hive queries, hadoop jar commnads, spark
jobs, developing mapreduce jobs etc for which client will need
hadoop binaries on client node.
Basically you are not only using the ssh to
connect, but you are performing some operations on hadoop cluster from
client node so you would need hadoop binaries.
ssh is used by
hadoop binaries on client node, when you run such operations like hadoop fs
-ls/ from client node to cluster. (remember adding $HADOOP_HOME/bin to PATH as part of installation process above)
when you are saying "we only use ssh" - that sounds to me like when you want to make changes/access hadoop configuration files from cluster you are connecting using ssh to cluster nodes - you do this as part of administrative work but when you need to run hadoop commands/jobs against cluster from client node you dont need to ssh manually - hadoop installation on client node will take care of it.
with out hadoop instalations how can you run hadoop commands/jobs/queries from client node to cluster?
3. should user name 'user1' must be same ? what if it is different ? - it will work. you can install hadoop on client node under group user say: qa or dev, and all users on client node as sudo under that group. than when user1 on client node need to run any hadoop job on cluster: user1 should be able to sudo -i -u qa and then run hadoop command from it.

How to install impala on an already running hadoop cluster

I have an already up and running Hadoop, 5-node cluster. I want to install Impala on the HDFS cluster without the Cloudera Manager. Can anyone supposedly guide me through the process or a link of the same.
Thanks.

Differences : Single-node and Multi-node

I'm trying to install Hadoop in a virtual machine, I found a tutorial explaining how to do that in a multi-node cluster .
So my question is what's the difference between a single-node and a multi-node cluster ?
Thanks in advance :)
Single node cluster : By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used.
Pseudo-distributed or multi-node cluster: The Hadoop daemons run on a local machine, thus simulating a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine. HDFS is used instead of local FS
And you can have your work environment set up as follow
Step 1 - Download VMware player latest version and install on your laptop/desktop. You can also install VMware tools, which will be very useful for your working with guest OS.
Step 2 - Once you Step 1 is completed then Download Cloudera Quick Start VM from
http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo
Step 3 - Open VMPlayer program and click on “Open a virtual Machine”. and go to directory “cloudera-quickstart-vm-4.4.0-1-vmware”. Select cloudera-quickstart-vm-4.4.0-1-vmware. This will create a virtual machine instance in VM Player.
Step 4 - Start the Cloudera VM by Clicking on power on to start the Cloudera Demo VM.
You are good to go
Good luck

hadoop: different datanodes configuration in shared directory

I try to run hadoop in clustered server machines.
But, problem is server machines uses shared directories, but file directories are not physically in one disk. So, I guess if I configure different datanode direcotry in each machine (slave), I can run hadoop without disk/storage bottleneck.
How do I configure datanode differently in each slave or
How do I configure setup for master node to find hadoop that are installed in different directory in slave node when starting namenode and datanodes using "start-dfs.sh" ?
Or, is there some fancy way for this environment?
Thanks!

Hadoop 2 node cluster

Is it necessary to have 2 linux machine for configuring two node hadoop cluster(ie 1 master 1 slave).Can it be done on single machine
Yes, you can run in local mode or pseudo-distributed mode, you don't necessarily need an actual cluster of machines.
See: http://hadoop.apache.org/docs/stable/single_node_setup.html

Resources