difference between hadoop mr1 and yarn and mr2? - hadoop

Can someone pls tell what is the differece between MR1 and yarn and MR2
My understanding is MR1 will be having below components
Namenode,
secondary name node,
datanode,
job tracker,
task tracker
Yarn
Node manager
Resource Manager
Is Yarn consists of MR1 or MR2 ( or both MR2 and Yarn are same?)
sorry if i asked basic level question

MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.

YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN

Related

Understanding mapreduce.framework.name wrt Hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
mapreduce.framework.name
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?
yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
So,
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Namenode
Datanode
ResourceManager
NodeManager
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Namenode
Datanode
Jobtracker
Tasktracker
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

YARN and Hadoop

I had a couple of questions regarding job submission to HDFS and the YARN architecture in Hadoop:
So in the Hadoop ecosystem you have one NameNode for each cluster which can contain any number of data nodes that store your data. When you submit a job to Hadoop, the job tracker on the NameNode will pick each job and assign it to the task tracker on which the file is present on the data node.
So my question is how do the components of YARN work together in HDFS:?
So YARN consists of the NodeManager and the Resource Manager. Out of these two components: Is the NodeManager run on every DataNode and the ResourceManager runs on each NameNode for each cluster? So when the task tracker (in each DataNode) gets assigned a task from the job tracker (in the NameNode), the NodeManager in a specific data node will create an container which will request resources from the ResourceManager in the NameNode. So this resource manager and node manager only come into play when a task tracker in a data node gets a job from the job tracker in the NameNode, in which the NodeManager will ask the ResourceManager for resources for the job to be executed. Is this correct?
You are partially correct. YARN was brought into picture to avoid the burden of Jobtracker which does both scheduling and monitoring. So with YARN you dont have any Job tracker or task tracker. The job done by Job tracker is now done by Resource Manager which has two main components Scheduler(allocating resources to applications) and ApplicationsManager(accepting job submissions and restarts the ApplicationMaster in case of any failure). Now each application has a ApplicationMaster which negotiates containers(where the job would be run) from the scheduler for running application.
Nodemanager runs on every slave node/data node. Resource Manager may/maynot be installed where the namenode is present. For a large cluster we usually need to separate the masters, so that the load doesn't go to a single machine.

MapReduce 2 without YARN

Considering the fact that YARN is a better option to run mapreduce2, but is it possible to run MR2 without YARN?
I tried using MR2 but it runs with YARN.
MRv2 is actually YARN! So, no you can't run mapreduce2 jobs without YARN!
Official documentation :
Apache Hadoop NextGen MapReduce (YARN)
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now
have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major
functionalities of the JobTracker, resource management and job
scheduling/monitoring, into separate daemons. The idea is to have a
global ResourceManager (RM) and per-application ApplicationMaster
(AM). An application is either a single job in the classical sense of
Map-Reduce jobs or a DAG of jobs.

What is the difference between a mapreduce application and a yarn application?

A cluster which runs mapreduce 2 doesn't have a job tracker and instead it is split into two separate components, resource manager and job manager. However, these thing are transparent from a user and he doesn't need to know whether the cluster is running mapreduce 1 or 2 when submitting a mapreduce job.
The thing I cannot quite understand is Yarn application. How is it different from a regular mapreduce application? What's the advantage of running a mapreduce job as a yarn application, etc? Could someone shed some light on that for me?
MR1 has Job tracker and task tracker which takes care of Map reduce application.
In MR2 Apache separated the management of the map/reduce process from the cluster's resource management by using YARN. YARN is a better resource manger than we had in MR1. It also enables versatility. MR2 is built on top of YARN.
Apart from Map reduce, we can run applications like spark, storm, Hbase, Tex etc on top of Yarn, which we cannot do using MR1.
The following is the architecture for MR1 and MR2.
HDFS <---> MR
HDFS <----> Yarn <----> MR

who communicates with the namenode in yarn?

since the jobTracker in MapReduce1 is replaced by the Application Master and Resouce Manager in Yarn I wonder who is communication in Yarn with the namenode to find out where the data is stored in the different datanodes?
Is the Application Master doing so?
In YARN, the per-application ApplicationMaster is responsible for getting the information about the input splits from Namenode. Later when the task attempts are executed over the assigned nodes, the YarnChild fetches the respective splits from HDFS.

Resources