We know that servers which used for big data processing should be tolerant with hardware failure. I mean, if we had 3 server (A, B, C) and suddenly the B server is down, A and C could replace it position. But in hadoop, we know hadoop using namenode and datanode, which is when the namenode is down,we can't process the data anymore and it sounds lack of tolerant with hardware failure.
is there any reason with this design arch of hadoop?
The issue you have mentioned is known as sinlgle point of failure which exists in older hadoop versions.
Try newer versions of hadoop like 2.x.x. Hadoop from version 2.0.0 avoids this single point of failure by assigning two namenodes namely active and standby namenodes. When active namenode fails due to hardware or power issues, the standby namenode will act as active namenode.
Check this link: Hadoop High Availability for further details.
Related
I am a bit confused with Hadoop Namenode HA using QJM and HDFS federation. Both uses multiple namenode and both provides High Availability. I am not able to decide which architecture to used for Namenode High Availability since both looks exactly same except the QJM thing.
Please pardon me if this is not the type of question to be discussed here.
The main difference between HDFS High Availability and HDFS Federation would be that the namenodes in Federation aren't related to each other.
In HDFS federation, all the namenodes share a pool of metadata in which each namenode has it's own pool hence providing fault-tolerance i.e if one namenode in a federation fails, it doesn't affect the data of other namenodes.
So, Federation = Multiple namenodes and no correlation.
While in case of HDFS HA, there are two namenodes - Primary NN and Standby NN.
Primary NN works hard all the time, everytime while Standby NN just sits there and chills and updates it's metadata with respect to the Primary Namenode once in a while which makes them related.
When Primary NN gets tired of this usual sheet (i.e it fails), the Standby NameNode takes over with whatever most recent metadata it has.
As for a HA Architecture, you need to have atleast two sepearte machines configured as Namenode, out of which only one should run in Active State.
More details here: HDFS High Availability
While reading ZooKeeper's documentation, it seems to me that HDFS relies on pretty much the same mechanisms of distribution/replication (broadly speeking) as ZooKeeper. I hear some echo from one to another, but I still can't distinguish things clearly and striclty.
I understand ZooKeeper is a Cluster Management / Sync tool, while HDFS is a Distributed File Management System, but could ZK be needed on an HDFS cluster for example?
Yes, the factor is distributed processing and high availability on a hadoop cluster with a zookeper's quorum
For ex. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
Zookeper resolves cluster address to an active namenode.
What happens to hadoop cluster when Secondary NameNode fails.
Hadoop cluster is said to be a single point of failure as all medata is stored by NameNode. What about Secondary NameNode, if secondary namenode fails, will Cluster fail or keep running.
Secondary name node is little bit confusing name. Hadoop Cluster will run when it crashes. You can run Hadoop claster even without it and it is not used for high availability. I am talking about Hadoop versions <2.
More info: http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F
I have a question about the name node High Availability. Name node is so important because it stores all the metadata, if it is down, the whole Hadoop Cluster will be down as well. So is there any good way to approach the name node High Availability, for example there is backup name node that can take over when the primary name node fails?
(now I use Hadoop 1.1.2)
For ASF Hadoop 1.1.2, there are no solid NameNode HA options. These were released for 2.0 and are included in popular distributions like Cloudera's CDH4.
The options for NameNode HA include running a primary NameNode and a hot standby NameNode. They share an edits log, either on a NFS mount, or through quorum journal mode in HDFS itself. The former gives you the benefit of having an external source for storing your HDFS metadata, while the latter gives you the benefit of having no dependencies external to Hadoop.
Personally, I like the NFS option, as you can easily snapshot/backup the data resident the file server. The disadvantage to this approach is potentially inconsistent performance in terms of latency.
For more detail, check out the following articles:
http://www.slideshare.net/hortonworks/nn-ha-hadoop-worldfinal-10173419
http://blog.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
Name node is the single point of failure for HDFS. Is this correct?
Then what about Jobtracker? If Jobtracker fails, is HDFS available?
HDFS is completely independent of the Jobtracker. As long as at least the NN is up, HDFS is nominally usable, with overall degradation dependent on the number of Datanodes that are down.
As Ambar mentioned HDFS as in the file system does not depend on the JobTracker. The current released version of Hadoop does not support Namenode high availability out of the box but you can work around it (e.g. deploy the namenode using a traditional clustering solution of active/passive with shared storage).
The next release (2.0/0.23) does fix the namenode availability issue.
You can read more about it in a blog post by Aaron Myers "High Availability for the Hadoop Distributed File System (HDFS)"
If the JobTracker is not available you cannot execute map/reduce jobs