How to administer Hadoop Cluster - hadoop

i have running 4 nodes hadoop cluster and i am asking about any way to administer that cluster remotely
for example
administering the cluster from my laptop for
executing MapReduce tasks
disabling or enabling data nodes
is there any way to do that remotely ?

If you're using the Cloudera distribution, the Cloudera Manager webapp would let you do that.
Other distributions may have similar control apps. That would give you per-node control.
For executing MR tasks, you would setup normally submit the job from an external node anyway, pointing to the correct JobTracker and NameNode. So I'm not sure what else you're asking for there.

Related

What is the best way to test hadoop?

I have completed the hadoop cluster setup with 3 journal nodes for QJM, 4 datanodes, 2 namenode, 3 zookeeper but I need to confirm whether the connectivity had made successfully between them so, I am searching for a too which can perform the following task
1) Should check which namenode currently in active state
2) Is both the namenode is communicating with each other successfully
3) Should check whether all journal nodes are communicating with each other successfully
4) Should check whether all zookeeper are communicating with each other successfully
5) Which zookeeper currently playing the master role
Is there any tool or any commands available to check above task?
Can anyone please help me to solve this?
Using Ambari you can monitor complete cluster performace and working.
Also if you want to validate your hadoop jobs programatically,
Then you can use the concept of Counters in Hadoop.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Is it possible to add node automatically when hadoop program is on running application

I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file
and when debating on seminar in my company, my boss and chief insisted that even if hadoop application running state, if hadoop need more node or cluster, automatically, hadoop will add more node
Is it possible? When I studied about hadoop clusturing, Many hadoop books and community site insisted that after configuration and running application, We can't add more node or cluster.
But My boss said to me that Amazon said adding node on running application is possible.
Is really true?
hadoop master users on stack overflow community, Please tell me detail about the truth.
Yes it indeed is possible.
Here is the explanation in hadoop's wiki.
Also Amazon's EMR enables one to add 100s of nodes on-the-fly in an alreadt running cluster and as soon as the machines are up they are delegated tasks(unstarted mapper and/or reducer tasks) by the master.
So, yes, it is very much possible and is in use and not just in theory.

cloudera cluster node roles

I need to ran simple benchmark test on my cloudera CDH4 cluster setup.
My cloudera cluster setup (CDH4) has 4 nodes, A, B, C and D
I am using cloudera manager FREE edition to manage cloudera services.
Each node is configured to perform multiple roles as stated below.
A : NameNode, JobTrackerNode, regionserver, SecondaryNameNode, DataNode, TaskTrackerNode
B : DataNode, TaskTrackerNode
C : DataNode, TaskTrackerNode
D : DataNode, TaskTrackerNode
My first question is, Can one node be NameNode and DataNode?
Is this setup all right?
My second question is, on cloudera manager UI, i can see many services running but i am not sure whether i need all this services or not?
Services running on my setup are :
hbase1
hdfs1
mapreduce1
hue1
oozie1
zookeeper1
Do i need only hdfs1 and mapreduce1 services. If yes how can i remove other services?
Cloud and hadoop concept is new to me so pardon me if some of my assumptions are illogical or wrong.
answer to your first question is yes. but you would never do that in production as NameNode needs sufficient amount of RAM. people usually run only NameNode+JobTracker on their master node. it is also better to run SecondarNameNode on a different machine.
coming to your second question, Cloudera Manager is not only Hadoop. it's a complete package that includes several Hadoop sub-projects like HBase(a NOSQL DB), Oozie(a Workflow engine) etc. and these are the processes which yo see on the UI.
If you wanna play just with Hadoop, HDFS and MapReduce are sufficient. You can stop rest of the processes easily from the UI itself. it won't do any harm to your Hadoop cluster.
HTH

Resources