How to config and use multi master nodes in a Hadoop cluster? - hadoop

Could anyone please tell us how to config and use multi master nodes in a Hadoop cluster?

If you are looking multiple NameNodes, then check HDFS high availability and HDFS federation. Both should be available in 2x Hadoop release.
One more master in 1x Hadoop release in the JobTracker and there can be only one JobTracker in a cluster. BTW, JobTracker functionality has been split in the 2x Hadoop release. Check this for more details.
There might be some other alternate options also, but depends on the requirement for having multiple masters. Is it availability, scalability or something other?

Related

differences between HDFS and ZooKeeper?

While reading ZooKeeper's documentation, it seems to me that HDFS relies on pretty much the same mechanisms of distribution/replication (broadly speeking) as ZooKeeper. I hear some echo from one to another, but I still can't distinguish things clearly and striclty.
I understand ZooKeeper is a Cluster Management / Sync tool, while HDFS is a Distributed File Management System, but could ZK be needed on an HDFS cluster for example?
Yes, the factor is distributed processing and high availability on a hadoop cluster with a zookeper's quorum
For ex. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
Zookeper resolves cluster address to an active namenode.

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.

Does hadoop itself contains fault-tolerance failover functionality?

I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).

When will HDFS be unavailable?

Name node is the single point of failure for HDFS. Is this correct?
Then what about Jobtracker? If Jobtracker fails, is HDFS available?
HDFS is completely independent of the Jobtracker. As long as at least the NN is up, HDFS is nominally usable, with overall degradation dependent on the number of Datanodes that are down.
As Ambar mentioned HDFS as in the file system does not depend on the JobTracker. The current released version of Hadoop does not support Namenode high availability out of the box but you can work around it (e.g. deploy the namenode using a traditional clustering solution of active/passive with shared storage).
The next release (2.0/0.23) does fix the namenode availability issue.
You can read more about it in a blog post by Aaron Myers "High Availability for the Hadoop Distributed File System (HDFS)"
If the JobTracker is not available you cannot execute map/reduce jobs

Topologies for hadoop cluster?

In what topologies we can make hadoop cluster? Currently only running with tree structure,single master and multiple slaves. Am looking for more variants like multiple master etc.
Your question is interesting, after seeing the question for any tool or framework which provides multiple masters in Hadoop. But, i didn't find one.
Coming to hadoop, NameNode (Master) - Multiple DataNodes (Slaves), so according to hadoop's design perceptive this fair enough.
But, their is another addition to the NameNode called as secondary namenode, did n't think about it as a another master. refer here http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/

Resources