What happens to hadoop job when the NameNode is down? - hadoop

In Hadoop 1.2.1, I would like to know some basic understanding on these below questions
Who receives the hadoop job? Is it NameNode or JobTracker?
What will happen if somebody submits a Hadoop job when the NameNode is down?Does the hadoop job fail? or Does it get in to Hold?
What will happen if somebody submits a Hadoop job when the JobTracker is down? Does the hadoop job fail? or Does it get in to Hold?

By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.
1) JobTracker (Yarn ResourceManager with hadoop 2.x)
2) I am not completely sure, but probably job will become submitted and fail afterwards
3) You cannot submit job to a stopped JobTracker.

Client submits job to the Namenode. Namenode looks for the data requested by the client and gives the block information.
JobTracker is responsible for the job to be completed and the allocation of resources to the job.
In Case 2 & 3 - Jobs fails.

Related

Does Namenode know the job submitted by client in HDFS multi node system?

When client submits a job, task tracker receives it. Can Namenode sees the code of this submitted job?
Task Tracker doesn't exist in Hadoop2 / YARN, but no, the code is not running within the Namenode process.
The short answer is No.
The long answer is that namenode do not execute the mapreduce program, so name node has nothing to do with the mapreduce code. The mapreduce jar is physically uploaded to the each node responsible to execute the map/reduce. So, basically only those nodes which are executing the mapreduce refers to the jar. The only role of namenode is to make sure that the jars are written to the nodes responsible for executing map/reduce.

Writing to hadoop cluster while it is busy running map reducer jobs

I know that Hadoop has the Fair Scheduler, where we can assign a job to some priority group and the cluster resources are allocated to the job based on priority. What I am not sure and what I ask is how a non map-red program is prioritized by the Hadoop cluster. Specifically how do the writes to Hadoop through external clients (say some standalone program which is directly opening HDFS file and streaming data to it) would be prioritized by Hadoop when the cluster is busy running map-red jobs.
The Resource Manager only can prioritize jobs submitted to it (such as MapReduce applications, Spark jobs, etc ...).
Other than distcp, HDFS operations only interact with the NameNode and Datanodes not the Resource Manager so they would be handled by the NameNode in the order they're received.

YARN and Hadoop

I had a couple of questions regarding job submission to HDFS and the YARN architecture in Hadoop:
So in the Hadoop ecosystem you have one NameNode for each cluster which can contain any number of data nodes that store your data. When you submit a job to Hadoop, the job tracker on the NameNode will pick each job and assign it to the task tracker on which the file is present on the data node.
So my question is how do the components of YARN work together in HDFS:?
So YARN consists of the NodeManager and the Resource Manager. Out of these two components: Is the NodeManager run on every DataNode and the ResourceManager runs on each NameNode for each cluster? So when the task tracker (in each DataNode) gets assigned a task from the job tracker (in the NameNode), the NodeManager in a specific data node will create an container which will request resources from the ResourceManager in the NameNode. So this resource manager and node manager only come into play when a task tracker in a data node gets a job from the job tracker in the NameNode, in which the NodeManager will ask the ResourceManager for resources for the job to be executed. Is this correct?
You are partially correct. YARN was brought into picture to avoid the burden of Jobtracker which does both scheduling and monitoring. So with YARN you dont have any Job tracker or task tracker. The job done by Job tracker is now done by Resource Manager which has two main components Scheduler(allocating resources to applications) and ApplicationsManager(accepting job submissions and restarts the ApplicationMaster in case of any failure). Now each application has a ApplicationMaster which negotiates containers(where the job would be run) from the scheduler for running application.
Nodemanager runs on every slave node/data node. Resource Manager may/maynot be installed where the namenode is present. For a large cluster we usually need to separate the masters, so that the load doesn't go to a single machine.

Is JobTracker a single point of failure too (besides NameNode) in Hadoop?

I am new to Hadoop. In hadoop, I know that when a NameNode fails the entire Hadoop framework goes down. So it's a single point of failure in Hadoop. Is it same for JobTracker? Because if the JobTracker goes down, there would be no daemon to contact Namenode after a job submission and also no point for running the TaskTrackers. How is this handled exactly?
Yes, JobTracker is a single point of failure in MRv1. In case of JobTracker failure all running jobs are halted (http://wiki.apache.org/hadoop/JobTracker).
In YARN, Resource manager is not a single point of failure.
If you need MRv1, you can use MapR distribution, which provides the JobTracker high availability (http://www.mapr.com/resources/videos/demo-hadoop-jobtracker-failing-and-recovering-mapr-cluster).
Jobtracker HA(High Availability using Active and Standby) can be configured in Cloudera Hadoop distribution. See the following link, this feature is available from CDH4.2.1 onwards:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_3_1.html
The same can be configured in Hortwonworks distribution also
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_hdp1-system-admin-guide/content/sysadminguides_ha_chap2_5_5.html
In MR2 master service is ResourceManager, which is not Single Point of Failure
Yes job tracker is a single point of failure. In case of namenode failure, secondary namenode will take a charge and act as namenode. In MR-II, there is a resource manager concept introduced. YARN has no. of resource manager, if one fails another resource manager will take a charge.One resource manager is active and other resource manager's are in stand by mode.
No no If NN failure, not Hadoop Framework goes down. Framework different NN failure is different. Hadoop framework is a layer on all nodes. If Name Node goes down, Framework doesn't no where the data should store, and doesn't no where space available to be store. So it's not possible to sore actual data.
Job tracker coordinates with Namenode to get a data to be processed. So when Namenode failure, job tracker also not work properly. So first namenode should work properly. In Hadoop this mechanism is called Namenode Single point of failure.
Job tracker is responsible for job schedule and process the data. If Job tracker not working, Client submits a job request, but the client donesn't no where should that job should submit and where should process. But that logic (you should submit) should know how to resolve the problem, but doesn't know where should submit. So Job tracker failure, it's not possible to process the data and schedule job.
It's a biggest problem in Bigdata analysis problem.
Now Hadoop 2.x resolved these two problems. YERN don't have any single point of failure in namenode level and datanode level.

How to submit MapReduce job from DataNode to JobTracker?

I have this doubt where I am running a 12 node cluster with separate NameNode and JobTracker. I can execute MapReduce job from JobTracker but I want to submit the jobs to JobTracker from any of my 10 DataNodes. Is it possible and If yes how to do this?
Yes, as long as hadoop is on the path (on each node), and the configuration for the cluster has been properly distributed to each data node.
In fact you don't necessarily need the configuration to be properly distributed, you'll just need to configure the jobtracker and hdfs urls accordingly (look at the GenericOptionsParser options for -jt and -fs options).
See this page for more information on generic options: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options

Resources