Does Namenode know the job submitted by client in HDFS multi node system? - hadoop

When client submits a job, task tracker receives it. Can Namenode sees the code of this submitted job?

Task Tracker doesn't exist in Hadoop2 / YARN, but no, the code is not running within the Namenode process.

The short answer is No.
The long answer is that namenode do not execute the mapreduce program, so name node has nothing to do with the mapreduce code. The mapreduce jar is physically uploaded to the each node responsible to execute the map/reduce. So, basically only those nodes which are executing the mapreduce refers to the jar. The only role of namenode is to make sure that the jars are written to the nodes responsible for executing map/reduce.

Related

Hadoop - Can name node execute task?

Is it possible for the name node to execute tasks? By default, the tasks execute on the data nodes of the cluster.
Assuming you are asking about MapReduce...
With YARN, MapReduce tasks execute in application masters, not namenodes and not within the datanode process. These are monitored by the ResourceManager.
Application master services are only commonly installed along side the datanode, and you can install it on the namenode as well, but you really shouldn't in a production environment

What happens to hadoop job when the NameNode is down?

In Hadoop 1.2.1, I would like to know some basic understanding on these below questions
Who receives the hadoop job? Is it NameNode or JobTracker?
What will happen if somebody submits a Hadoop job when the NameNode is down?Does the hadoop job fail? or Does it get in to Hold?
What will happen if somebody submits a Hadoop job when the JobTracker is down? Does the hadoop job fail? or Does it get in to Hold?
By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.
1) JobTracker (Yarn ResourceManager with hadoop 2.x)
2) I am not completely sure, but probably job will become submitted and fail afterwards
3) You cannot submit job to a stopped JobTracker.
Client submits job to the Namenode. Namenode looks for the data requested by the client and gives the block information.
JobTracker is responsible for the job to be completed and the allocation of resources to the job.
In Case 2 & 3 - Jobs fails.

Map on data node is run by whom?

I faced this tricky question in one of my interviews.
Question was
Who run map on data node ?
Answer is neither Job tracker nor task tracker.
Could anybody help me please
Datanodes do not run any task, they are part of HDFS and take care of storing data.
So "map on data node" makes no sense at all.
if hadoop 1.x is installed on the system then If task tracker is running on the same data node then task tracker daemon is the one who would run the map task after getting instruction from job tracker .
if no task tracker is running on the data node then no map task can run on that node , data node takes care of storage part it has nothing to do with map processing .
if hadoop 2.x then application master is the entity which does so by coordinating with node manager and resource manager.

Is JobTracker a single point of failure too (besides NameNode) in Hadoop?

I am new to Hadoop. In hadoop, I know that when a NameNode fails the entire Hadoop framework goes down. So it's a single point of failure in Hadoop. Is it same for JobTracker? Because if the JobTracker goes down, there would be no daemon to contact Namenode after a job submission and also no point for running the TaskTrackers. How is this handled exactly?
Yes, JobTracker is a single point of failure in MRv1. In case of JobTracker failure all running jobs are halted (http://wiki.apache.org/hadoop/JobTracker).
In YARN, Resource manager is not a single point of failure.
If you need MRv1, you can use MapR distribution, which provides the JobTracker high availability (http://www.mapr.com/resources/videos/demo-hadoop-jobtracker-failing-and-recovering-mapr-cluster).
Jobtracker HA(High Availability using Active and Standby) can be configured in Cloudera Hadoop distribution. See the following link, this feature is available from CDH4.2.1 onwards:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_3_1.html
The same can be configured in Hortwonworks distribution also
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_hdp1-system-admin-guide/content/sysadminguides_ha_chap2_5_5.html
In MR2 master service is ResourceManager, which is not Single Point of Failure
Yes job tracker is a single point of failure. In case of namenode failure, secondary namenode will take a charge and act as namenode. In MR-II, there is a resource manager concept introduced. YARN has no. of resource manager, if one fails another resource manager will take a charge.One resource manager is active and other resource manager's are in stand by mode.
No no If NN failure, not Hadoop Framework goes down. Framework different NN failure is different. Hadoop framework is a layer on all nodes. If Name Node goes down, Framework doesn't no where the data should store, and doesn't no where space available to be store. So it's not possible to sore actual data.
Job tracker coordinates with Namenode to get a data to be processed. So when Namenode failure, job tracker also not work properly. So first namenode should work properly. In Hadoop this mechanism is called Namenode Single point of failure.
Job tracker is responsible for job schedule and process the data. If Job tracker not working, Client submits a job request, but the client donesn't no where should that job should submit and where should process. But that logic (you should submit) should know how to resolve the problem, but doesn't know where should submit. So Job tracker failure, it's not possible to process the data and schedule job.
It's a biggest problem in Bigdata analysis problem.
Now Hadoop 2.x resolved these two problems. YERN don't have any single point of failure in namenode level and datanode level.

who communicates with the namenode in yarn?

since the jobTracker in MapReduce1 is replaced by the Application Master and Resouce Manager in Yarn I wonder who is communication in Yarn with the namenode to find out where the data is stored in the different datanodes?
Is the Application Master doing so?
In YARN, the per-application ApplicationMaster is responsible for getting the information about the input splits from Namenode. Later when the task attempts are executed over the assigned nodes, the YarnChild fetches the respective splits from HDFS.

Resources