Understanding mapreduce.framework.name wrt Hadoop - hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
mapreduce.framework.name
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?

yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
So,
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Namenode
Datanode
ResourceManager
NodeManager
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Namenode
Datanode
Jobtracker
Tasktracker
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

Related

Hadoop - Can name node execute task?

Is it possible for the name node to execute tasks? By default, the tasks execute on the data nodes of the cluster.
Assuming you are asking about MapReduce...
With YARN, MapReduce tasks execute in application masters, not namenodes and not within the datanode process. These are monitored by the ResourceManager.
Application master services are only commonly installed along side the datanode, and you can install it on the namenode as well, but you really shouldn't in a production environment

Difference between local and yarn in hadoop

I have been trying to install Hadoop on a single node following the instructions written here. There are two sets of instructions, one for running a MapReduce job locally, and another for YARN.
What is difference between running a MapReduce job locally and running on YARN?
If you use local the map and reduce tasks are run in the same jvm. Usually this mode is used when we want to debug the code. Whereas if we use yarn resource manager which is in MRV2 comes into play and mappers and reducers will run in different nodes and different jvms with in the same node(if it is pseudo distributed mode).

difference between hadoop mr1 and yarn and mr2?

Can someone pls tell what is the differece between MR1 and yarn and MR2
My understanding is MR1 will be having below components
Namenode,
secondary name node,
datanode,
job tracker,
task tracker
Yarn
Node manager
Resource Manager
Is Yarn consists of MR1 or MR2 ( or both MR2 and Yarn are same?)
sorry if i asked basic level question
MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.
YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN

How to start multiple datanode processes on standalone hadoop setup(pseudo-distributed)

I am new to Hadoop. I have configured standalone hadoop setup on single VM running Ubuntu 13.03. After starting the hadoop processes using start-all.sh, jps command shows
775 DataNode
1053 JobTracker
962 SecondaryNameNode
1365 Jps
1246 TaskTracker
590 NameNode
As per my understanding Hadoop has started with 1 namenode and 1 datanode. I want to create multiple datanode processes i.e. multiple instances of datanode. Is there any way I can do that?
There are multiple possibilities how to install and configure Hadoop.
Local (standalone) Mode - it means all Hadoop components run in a signle Java process
Pseudo-Distributed Mode - Hadoop runs all its components (datanode, tastracker, jobtracker, namenode, ...) as separate Java processes. It servers as a simulation for fully distributed installation but it runs on local machine only.
Distributed Mode - fully distributed installation. Shortly without any details: Some machines play 'slave' role and contain Datanode+Tasktracker components and there is a server playing 'master' role and contains Namenode+JobTracker.
Back to your queastion, if you would like to run Hadoop on single machine, you have the first two options. It is impossible to run it in fully distributed mode on a single node. Maybe you can do do a workaround, but it is nonsence from basic point of view. Hadoop was designed as a distributed system, the possibility to run it on a single machine serves IMHO for debug/trial purposes only.
For more details follow Hadoop documentation. I hope I answered your question.

can the same code be used for both hadoop and yarn

I have been thinking about this question for a while now. I have been trying to compare the performance of hadoop 1 vs yarn by running the basic word count example. I am still unsure about how the same .jar file can be used to execute on both the frameworks. As far as I understand yarn has a different set of api's which it uses to set connection with resource manager, create an application master etc.
So if I develop an application(.jar), can it be run on both the frameworks without any change in code?
Also what could be meaningful parameters to differentiate hadoop vs yarn for a particular application?
Ok, let's clear up some terms here.
Hadoop is the umbrella system that contains the various components needed for distributed storage and processing. I believe the term you're looking for when you say hadoop 1 is MapReduce v1 (MRv1)
MRv1 is a component of Hadoop that includes the job tracker and task trackers. It only relies on HDFS.
YARN is a component of Hadoop that abstracts out the resource management part of MRv1.
MRv2 is the mapreduce application rewritten to run on top of YARN.
So when you're asking if hadoop 1 is interchangeable with YARN, you're probably actually asking if MRv1 is interchangeable with MRv2. And the answer is generally, yes. The Hadoop system knows how to run the same mapreduce application on both mapreduce platforms.
Adding to climbage's answer:
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes an application(.jar), can be run on both the frameworks without any change in code.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.

Resources