YARN and Hadoop - hadoop

I had a couple of questions regarding job submission to HDFS and the YARN architecture in Hadoop:
So in the Hadoop ecosystem you have one NameNode for each cluster which can contain any number of data nodes that store your data. When you submit a job to Hadoop, the job tracker on the NameNode will pick each job and assign it to the task tracker on which the file is present on the data node.
So my question is how do the components of YARN work together in HDFS:?
So YARN consists of the NodeManager and the Resource Manager. Out of these two components: Is the NodeManager run on every DataNode and the ResourceManager runs on each NameNode for each cluster? So when the task tracker (in each DataNode) gets assigned a task from the job tracker (in the NameNode), the NodeManager in a specific data node will create an container which will request resources from the ResourceManager in the NameNode. So this resource manager and node manager only come into play when a task tracker in a data node gets a job from the job tracker in the NameNode, in which the NodeManager will ask the ResourceManager for resources for the job to be executed. Is this correct?

You are partially correct. YARN was brought into picture to avoid the burden of Jobtracker which does both scheduling and monitoring. So with YARN you dont have any Job tracker or task tracker. The job done by Job tracker is now done by Resource Manager which has two main components Scheduler(allocating resources to applications) and ApplicationsManager(accepting job submissions and restarts the ApplicationMaster in case of any failure). Now each application has a ApplicationMaster which negotiates containers(where the job would be run) from the scheduler for running application.
Nodemanager runs on every slave node/data node. Resource Manager may/maynot be installed where the namenode is present. For a large cluster we usually need to separate the masters, so that the load doesn't go to a single machine.

Related

How does distributed computations on RDD's work in Apache Spark?

If I have a very simple Spark program that simply does:
val rdd2 = sc.textFile("hdfs:///text.txt")
println(rdd.count)
When I submit this spark program to YARN using yarn-cluster:
The YARN ResourceManager will negotiate a container and lauch the Spark ApplicationMaster.
Then the ApplicationMaster will register itself with the ResourceManager and ask for resources.
Ones the resource specifications is obtained from the ResourceManager, the ApplicationMaster will launch the container on the NodeManager.
My question is since the data in Hadoop is distributed across multiple Machines (Let's say that text.txt in the example above is divided up into 3 blocks:
Does an Application Master get launched on each and every machine that has a text.txt block?
Is the spark executor a software which is already installed on each and every node of the cluster or does the executor gets instantiated into the container that is launched by the ApplicationMaster on the Node?
Good question, but probably too big for this forum.
To begin, your assumptions are generally correct but the timing isn't right.
The YARN ResourceManager will negotiate a container and lauch the Spark ApplicationMaster.
Then the ApplicationMaster will register itself with the ResourceManager and ask for resources.
Once the resource specifications is obtained from the ResourceManager, the ApplicationMaster will launch the container on
the NodeManager.
If you're using sc (the SparkContext) then this has already happened. There can be additional work with the ResourceManager if you add or remove executors, but the SparkContext only exists after initial resources are allocated.
Does an Application Master get launched on each and every machine that has a text.txt block?
No, but workers or executors could be launched on any machine that has a block. Alternatively they could be launched on just a single machine. However every block is (potentially) read by a worker. In this HDFS case, workers can read from anywhere on the cluster.
Is the spark executor a software which is already installed on each and every node of the cluster or does the executor gets instantiated
into the container that is launched by the ApplicationMaster on the
Node?
It can be installed on the nodes, or it can be given to the nodes at run time. The Spark executor is just a jar. I've seen it put into HDFS itself, or placed on each machine in the cluster as a local resource. You could place it in any location that the worker can access.

Hadoop - Can name node execute task?

Is it possible for the name node to execute tasks? By default, the tasks execute on the data nodes of the cluster.
Assuming you are asking about MapReduce...
With YARN, MapReduce tasks execute in application masters, not namenodes and not within the datanode process. These are monitored by the ResourceManager.
Application master services are only commonly installed along side the datanode, and you can install it on the namenode as well, but you really shouldn't in a production environment

difference between hadoop mr1 and yarn and mr2?

Can someone pls tell what is the differece between MR1 and yarn and MR2
My understanding is MR1 will be having below components
Namenode,
secondary name node,
datanode,
job tracker,
task tracker
Yarn
Node manager
Resource Manager
Is Yarn consists of MR1 or MR2 ( or both MR2 and Yarn are same?)
sorry if i asked basic level question
MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.
YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN

Is JobTracker a single point of failure too (besides NameNode) in Hadoop?

I am new to Hadoop. In hadoop, I know that when a NameNode fails the entire Hadoop framework goes down. So it's a single point of failure in Hadoop. Is it same for JobTracker? Because if the JobTracker goes down, there would be no daemon to contact Namenode after a job submission and also no point for running the TaskTrackers. How is this handled exactly?
Yes, JobTracker is a single point of failure in MRv1. In case of JobTracker failure all running jobs are halted (http://wiki.apache.org/hadoop/JobTracker).
In YARN, Resource manager is not a single point of failure.
If you need MRv1, you can use MapR distribution, which provides the JobTracker high availability (http://www.mapr.com/resources/videos/demo-hadoop-jobtracker-failing-and-recovering-mapr-cluster).
Jobtracker HA(High Availability using Active and Standby) can be configured in Cloudera Hadoop distribution. See the following link, this feature is available from CDH4.2.1 onwards:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_3_1.html
The same can be configured in Hortwonworks distribution also
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_hdp1-system-admin-guide/content/sysadminguides_ha_chap2_5_5.html
In MR2 master service is ResourceManager, which is not Single Point of Failure
Yes job tracker is a single point of failure. In case of namenode failure, secondary namenode will take a charge and act as namenode. In MR-II, there is a resource manager concept introduced. YARN has no. of resource manager, if one fails another resource manager will take a charge.One resource manager is active and other resource manager's are in stand by mode.
No no If NN failure, not Hadoop Framework goes down. Framework different NN failure is different. Hadoop framework is a layer on all nodes. If Name Node goes down, Framework doesn't no where the data should store, and doesn't no where space available to be store. So it's not possible to sore actual data.
Job tracker coordinates with Namenode to get a data to be processed. So when Namenode failure, job tracker also not work properly. So first namenode should work properly. In Hadoop this mechanism is called Namenode Single point of failure.
Job tracker is responsible for job schedule and process the data. If Job tracker not working, Client submits a job request, but the client donesn't no where should that job should submit and where should process. But that logic (you should submit) should know how to resolve the problem, but doesn't know where should submit. So Job tracker failure, it's not possible to process the data and schedule job.
It's a biggest problem in Bigdata analysis problem.
Now Hadoop 2.x resolved these two problems. YERN don't have any single point of failure in namenode level and datanode level.

who communicates with the namenode in yarn?

since the jobTracker in MapReduce1 is replaced by the Application Master and Resouce Manager in Yarn I wonder who is communication in Yarn with the namenode to find out where the data is stored in the different datanodes?
Is the Application Master doing so?
In YARN, the per-application ApplicationMaster is responsible for getting the information about the input splits from Namenode. Later when the task attempts are executed over the assigned nodes, the YarnChild fetches the respective splits from HDFS.

Resources