I have been studying Hadoop for a while now but I am confused about JobTracker and TaskTracker systems.
I am not sure how to phrase my question but here it is:
Do JobTracker and TaskTraker fall under HDFS or MapReduce category?
or a more appropriate question could be:
Can physical machines fall under HDFS or MapReduce category?
The JobTracker and TaskTracker services are part of MapReduce, which you can read more about here: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Physical machines can host both HDFS (NameNode, DataNode) and MapReduce (JobTracker, TaskTracker) services. In general, it's recommended to place DataNodes and TaskTrackers together on the same physical slave nodes for performance reasons. The TaskTrackers can read/write to the local DataNode.
Note that Hadoop 2 introduced YARN (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), which replaces MapReduce.
Related
While reading ZooKeeper's documentation, it seems to me that HDFS relies on pretty much the same mechanisms of distribution/replication (broadly speeking) as ZooKeeper. I hear some echo from one to another, but I still can't distinguish things clearly and striclty.
I understand ZooKeeper is a Cluster Management / Sync tool, while HDFS is a Distributed File Management System, but could ZK be needed on an HDFS cluster for example?
Yes, the factor is distributed processing and high availability on a hadoop cluster with a zookeper's quorum
For ex. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
Zookeper resolves cluster address to an active namenode.
Is there any possible solution to achieve Namenode HA in Hadoop 1.x ?
Hadoop 1.x is known for its single point of failure; there is a single Master Node that contains Hadoop Namenode and Hadoop JobTracker. The Namenode keeps look up table for every file (blocks of the file) location on the cluster. The Name node manages Hadoop Distributed File system and act as a HDFS master.
The Secondary NameNode is used for fault tolerance and it is a copy of the NameNode records. It is only used to backup the Namenode in case of crash.
What happens to hadoop cluster when Secondary NameNode fails.
Hadoop cluster is said to be a single point of failure as all medata is stored by NameNode. What about Secondary NameNode, if secondary namenode fails, will Cluster fail or keep running.
Secondary name node is little bit confusing name. Hadoop Cluster will run when it crashes. You can run Hadoop claster even without it and it is not used for high availability. I am talking about Hadoop versions <2.
More info: http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F
I have this doubt where I am running a 12 node cluster with separate NameNode and JobTracker. I can execute MapReduce job from JobTracker but I want to submit the jobs to JobTracker from any of my 10 DataNodes. Is it possible and If yes how to do this?
Yes, as long as hadoop is on the path (on each node), and the configuration for the cluster has been properly distributed to each data node.
In fact you don't necessarily need the configuration to be properly distributed, you'll just need to configure the jobtracker and hdfs urls accordingly (look at the GenericOptionsParser options for -jt and -fs options).
See this page for more information on generic options: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options
Name node is the single point of failure for HDFS. Is this correct?
Then what about Jobtracker? If Jobtracker fails, is HDFS available?
HDFS is completely independent of the Jobtracker. As long as at least the NN is up, HDFS is nominally usable, with overall degradation dependent on the number of Datanodes that are down.
As Ambar mentioned HDFS as in the file system does not depend on the JobTracker. The current released version of Hadoop does not support Namenode high availability out of the box but you can work around it (e.g. deploy the namenode using a traditional clustering solution of active/passive with shared storage).
The next release (2.0/0.23) does fix the namenode availability issue.
You can read more about it in a blog post by Aaron Myers "High Availability for the Hadoop Distributed File System (HDFS)"
If the JobTracker is not available you cannot execute map/reduce jobs