Hadoop YARN - performance of LocalJobRunner vs. cluster deployed job - hadoop

I'm doing some tests with M/R jobs running on 2 nodes Hadoop 2.2.0 cluster. One thing I would like to understand is the performance considerations of running the job in local mode (not managed by the ResourceManager) and running it on YARN. Tests I made show it runs much much faster when the job is being executed via LocalJobRunner than when it being managed by YARN. When set up the cluster I was following the steps described here http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ , perhaps there is some configuration the guide forgot to mention?
Thanks!

You'd run LocalJobRunner for tests and small examples. You'd use the cluster when you need to processes amounts of data that would justify using Hadoop in the first place (a.k.a "Big data").
When you run a small example the overhead of running things distributed overwhelms the benefits of parallelization

Arnon is right. I found out that in one of my usecases that running using LocalJobRunner is much faster than using yarn. Running using LocalJobRunner would run the map processes as in-process and in local machine. Jobs are not submitted to HDFS cluster. Hence, map tasks are not scheduled in multiple machines. So, use LocalJobRunner shall be used for unit testing the code. Thats it. For all other practical purposes, use yarn.

Related

Is there any Difference between YARN and MR2?

I am bit confused between YARN and MR2. are they same ? I have seen YARN as platform for running distributed application in some documentations and MapReduce as such distributed appication, What is meant by that ? Is that a MapReduce application built by a programmer(eg : wordcount program) ?
Again I have seen like 'Writing your own YARN application ', what is meant by that ? Which category a wordcount program belongs, a YARN application or a MR2 application ?
Please corret question if its not in the stackoverflow standard , I was trying to express all my doubts.
Firstly YARN introduced in HADOOP 2 onward.In hadoop 1, a single Name node managed the entire namespace for a Hadoop cluster.With the introduction of YARN(YET ANOTHER RESOURCE MANAGER) cluster resource management is taken care by YARN.
You can assume that YARN runs above HDFS and on the YARN different data processing component like mapreduce and othes run.
In Hadoop 1, users had the option of writing MapReduce programs in Java, in Python, or other scripting languages using streaming, or using Pig, a data transformation language. Regardless of which method was used, all fundamentally relied on the MapReduce processing model to run.
YARN supports multiple processing models in addition to MapReduce. One of the most significant benefits of this is that we are no longer limited to working the often I/O intensive, high latency MapReduce framework.
Now your question- 'Writing your own YARN application ', what is meant by that ?
Now on YARN we can write any of the distributed application. MR2 is one of that distributed application that is responsible for running MapReduce framework.So word count program belongs to MR2 application which is responsible to run this mapreduce frmework.
YARN and MR2 are related, but not the same.
MapReduce is a computational approach, and in the Hadoop world, MapReduce is also a name synonymous for Hadoop 1.0; in Hadoop 1.0, MapReduce provided both resource management and the execution framework for running "MapReduce" (1.0) applications
With Hadoop 2.0, resource management and the execution framework were split to be two separate entities, YARN (Yet Another Resource Negotiator) a general resource management system for cluster compute resources, and MapReduce 2.0 (abbreviated as MR2) an execution engine.
MR2 is a YARN application
You can write a "MapReduce" (2.0) based application that will run on YARN
If you are on a Hadoop 2.0 cluster and use something like Pig or Hive with the default execution engine, MR2, they will run as MR2 applications on YARN
Part of the reasoning for this split in Hadoop 2.0 was to allow for other execution engines, beyond MR2; today this also includes Apache Tez and Apache Spark, which can run on YARN.

Hadoop benchmark tests

I am currently performing some tests in a small cluster (4 nodes) and I would like to know if there are any best practices regarding the node that actually launches the test scripts in the cluster. I am using test tools like hive-testbench and HiBench.
There should be no impact in the performance if I run a spark script in the same machine that is hosting the thrift server, or should I add another machine to the cluster just to launch the test/benchmark scripts?
Thanks in advance.

Difference between local and yarn in hadoop

I have been trying to install Hadoop on a single node following the instructions written here. There are two sets of instructions, one for running a MapReduce job locally, and another for YARN.
What is difference between running a MapReduce job locally and running on YARN?
If you use local the map and reduce tasks are run in the same jvm. Usually this mode is used when we want to debug the code. Whereas if we use yarn resource manager which is in MRV2 comes into play and mappers and reducers will run in different nodes and different jvms with in the same node(if it is pseudo distributed mode).

One worker with two executor vs One worker per executor in spark

I am using spark-1.6 with standalone resource manager in client mode. Now, as it is supported to run multiple executors per worker in spark. Can anyone tell me the pros and cons of running which one should be preferred for the production environment?
Moreover, when spark comes with the pre-built binaries of hadoop-2.x why do we need to setup another hadoop cluster to run it in the yarn mode. What's the point of packing those jars in the spark. And what's the point of using the yarn when flexibility of multiple executors per worker is given in standalone mode

can the same code be used for both hadoop and yarn

I have been thinking about this question for a while now. I have been trying to compare the performance of hadoop 1 vs yarn by running the basic word count example. I am still unsure about how the same .jar file can be used to execute on both the frameworks. As far as I understand yarn has a different set of api's which it uses to set connection with resource manager, create an application master etc.
So if I develop an application(.jar), can it be run on both the frameworks without any change in code?
Also what could be meaningful parameters to differentiate hadoop vs yarn for a particular application?
Ok, let's clear up some terms here.
Hadoop is the umbrella system that contains the various components needed for distributed storage and processing. I believe the term you're looking for when you say hadoop 1 is MapReduce v1 (MRv1)
MRv1 is a component of Hadoop that includes the job tracker and task trackers. It only relies on HDFS.
YARN is a component of Hadoop that abstracts out the resource management part of MRv1.
MRv2 is the mapreduce application rewritten to run on top of YARN.
So when you're asking if hadoop 1 is interchangeable with YARN, you're probably actually asking if MRv1 is interchangeable with MRv2. And the answer is generally, yes. The Hadoop system knows how to run the same mapreduce application on both mapreduce platforms.
Adding to climbage's answer:
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes an application(.jar), can be run on both the frameworks without any change in code.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.

Resources