What additional benefit does Yarn bring to the existing map reduce? - hadoop

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way:
In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager only manages the allocation of resources to the different jobs apart from comprising a scheduler which just takes care of the scheduling jobs without worrying about any monitoring or status updates. Different resources such as memory, cpu time, network bandwidth etc. are put into one unit called the Resource Container. There are different AppMasters running on different nodes which talk to a number of these resource containers and accordingly update the Node Manager with the monitoring/status details.
I want to know that how does using this kind of an approach increase the performance from the map-reduce perspective? Also, if there is any definitive content on the motivation behind Yarn and its benefits over the existing implementation of Map-reduce, please point me to the same.

Here are some of the articles (1, 2, 3) about YARN. These talk about the benefits of using YARN.
YARN is more general than MR and it should be possible to run other computing models like BSP besides MR. Prior to YARN, it required a separate cluster for MR, BSP and others. Now they they can coexist in a single cluster, which leads to higher usage of the cluster. Here are some of the applications ported to YARN.
From a MapReduce perspective in legacy MR there are separate slots for Map and Reduce tasks, but in YARN their is no fixed purpose of a container. The same container can be used for a Map task, Reduce task, Hama BSP Task or something else. This leads to better utilization.
Also, it makes it possible to run different versions of Hadoop in the same cluster which is not possible with legacy MR, which makes is easy from a maintenance point.
Here are some of the additional links for YARN. Also, Hadoop: The Definitive Guide, 3rd Edition has an entire section dedicated to YARN.
FYI, it had been a bit controversial to develop YARN instead of using some of frameworks which had been doing something similar and had been running for ages successfully with bugs ironed out.

I do not think that Yarn will speedup the existing MR framework. Looking into architecture we can see that the system now is more modular - but modularity usually contradicts higher performance.
It can be claimed that YARN has nothing to do with MapReduce. MapReduce just became one of the YARN applications. You can see it as moving from some embedded program to embeded OS with program within it
At the same time Yarn opens the door for different MR implementations with different frameworks. For example , if we assume that our dataset is smaller then cluster memory we can get much better performance. I think http://www.spark-project.org/ is one such example
To summarize it: Yarn does not improve the existing MR, but will enable other MR implementations to be better in all aspects.

All the above answers covered lot of information: I am simplifying all the information as follows:
MapReduce: YARN:
1. It is Platform plus Application It is a Platform in Hadoop 2.0 and
in Hadoop 1. 0 and it is only of doesn't exist in Hadoop 1.0
the applications in Hadoop 2.0
2. It is single use system i.e., It is multi purpose system, We can run
We can run MapReduce jobs only. MapReduce, Spark, Tez, Flink, BSP, MPP,
MPI, Giraph etc... (General Purpose)
3. JobTracker scalability i.e., Both Resource Management and
Both Resource Management and Application Management gets separated &
Job Management managed by RM+NM, Paradigm specific AMs
respectively.
4. Poor Resource Management Flexible Resource Management i.e.,
system i.e., slots (map/reduce) containers.
5. It is not highly available High availability and reliability.
6. Scaled out up to 5000 nodes Scaled out 10000 plus nodes.
7. Job->tasks Application -> DAG of Jobs -> tasks
8. Classical MapReduce = MapReduce Yarn MapReduce = MapReduce API +
API + MapReduce FrameWork MapReduce FrameWork + YARN System
+ MapReduce System So MR programs which were written over
Hadoop 1.0 run over Yarn also with out
changing a single line of code i.e.,
backward compatibility.

Let's see Hadoop 1.0 drawbacks, which have been addressed by Hadoop 2.0 with addition of Yarn.
Issue of Scalability : Job Tracker runs on a single machine even though you have thousands of nodes in Hadoop cluster. The responsibilities of Job tracker : Resource management, Job and Task schedule and monitoring. Since all these processes are running on a single node, this model is not scalable.
Issue of availability ( Single point of failure): Job Tracker is a single point of failure.
Resource utilization: Due to predefined number of Map & Reduce task slots, resources are not utilized properly. When all Mapper nodes are busy, Reducer nodes are idle and can't be used to process Mapper tasks.
Tight integration with Map Reduce framework: Hadoop 1.x can run Map reduce jobs only. Support for jobs other than Map Reduce jobs does not exists.
Now single Job Tracker bottleneck has been removed with YARN architecture in Hadoop 2.x
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Now advantages of YARN
Scalability issues have been resolved
No single point of failure. All components are highly available
Resource utilization has been improved with proper utilization of Map and reduce slots.
Non Map Reduce Jobs can be submitted

It looks like this link might be what you're looking for: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/.
My understanding is that YARN is supposed to be more generic. You can create your own YARN applications that negotiate directly with the Resource Manager for resources (1), and MapReduce is just one of several Application Managers that already exist (2).

Related

Hadoop Resource management

I have a 12 node cluster and I am running a yarn architecture. It seems that my nodes are busy most of the time and many times job fails. How can I check the usage of the resources at any point of time?
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
There are multiple ways to monitor the cluster.
If you are using Cloudera distribution then you can go to Cloudera Manager to monitor and manage the resources
If you are using Hortonworks distribution then you can go to Ambari web interface to monitor and manage the resources
If you are not using any distributions then clusters will be managed using Ganglia or Nagios web interface
Even if you do not have any of these you can go to resource manager web interface which typically runs on http://:8088. 8088 is default port number, it can be customized and you can get that information from yarn-site.xml
If your organization does not provide access to the web interfaces you can use commands such as yarn application --list and mapred job --list to see what is going on in the cluster
It is little tedious to monitor actual usage. You should know linux commands to monitor and develop shell scripts.
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
Yes, you need to use queues and pools concept of schedulers embedded in Yarn. There are 3 types of scheduler FIFO, Capacity and Fair. FIFO should not be used in any of the clusters, it is mainly for development. You need to understand capacity and fair scheduler and set the limits.
It seems that my nodes are busy most of the time and many times job fails
You can implement some generic performance tuning guidelines to improve the thorughput. Have a look at this post : Tips to improve MapReduce Job performance in Hadoop , cloudera article and Map reduce performance aticle
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
Adding to Durga's answer,
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types.
By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness developed by Ghodsi.
The scheduler organizes apps further into “queues”, and shares resources fairly between these queues. By default, all users share a single queue, named “default”. If an app specifically lists a queue in a container resource request, the request is submitted to that queue. It is also possible to assign queues based on the user name included with the request through configuration.
e.g.
<user name="sample_user">
<maxRunningApps>30</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.
Traditionally each organization has it own private set of compute resources that have sufficient capacity to meet the organization’s SLA under peak or near peak conditions. This generally leads to poor average utilization and overhead of managing multiple independent clusters, one per each organization
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>u:user1:queue1,g:group1:queue2,u:%user:%user,u:user2:%primary_group</value>
<description>
Here, <user1> is mapped to <queue1>, <group1> is mapped to <queue2>,
maps users to queues with the same name as user, <user2> is mapped
to queue name same as <primary group> respectively. The mappings will be
evaluated from left to right, and the first valid mapping will be used.
</description>
</property>
Have a look at Fair scheduler and Capacity scheduler

How do you setup multiple Spark Streaming jobs with different batch durations?

We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:
SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()
The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.
The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?
Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!
In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.
To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).
It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.
In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.

Hadoop Performance Monitoring tools for Windows

Any tools for monitoring performance on a Hadoop cluster in Windows. We installed Hortonworks HDP 2.2.0 on windows single node cluster and tested our jar. we were able to process 5 million records in 26 minutes. Now we have set up a cluster with 4 slave machines and 1 name node. Though the RAM of each machine is 8 Gigs, we are just doing a proof of concept. we see no improvement in the processing time in the cluster. Are there any tools which point out the problem. All the available are written for Linux.
Thanks,
Kishore.
5 million records doesn't sound like a lot to throw on Hadoop. What's the size of your data in gb?
I don't know any Hadoop monitoring tools for Windows but you should start with the basics - is your data splittable? Have a look at the resource manager's view - how many containers did you have for your map-reduce app? Were they distributed on all machines? (the capacity scheduler tends not to distribute the load on several machines if it can stick all of it on one). CPU usage per task attempt, io per task attempt?
You should also store, compare and analyze Windows performance counters - cpu, i/o, network to see if you have any bottlenecks.
You may not need Windows-native tools to surface the kinds of performance metrics you are looking for. If you're after performance metrics from YARN, MapReduce, or HDFS, you can collect metrics from each of those technologies out of the box from a web interface/HTTP endpoint exposed by each tech in question.
With HDFS, for example, you can collect metrics from the NameNode and DataNodes via HTTP. In addition, you can access the full suite of metrics via JMX, though that option requires a little more configuration.
I wrote a guide to collecting Hadoop performance metrics with native tools which you might find useful. It details methods for collecting metrics for MapReduce, YARN, HDFS, and ZooKeeper.

how YARN manages endless jobs like Storm

Couple of days ago Yahoo posted about Storm-on-YARN project http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html that makes possibility to run Storm on YARN.
That's big improvement, however I have two questions regarding to running tasks like Storm with YARN. Tasks like Storm don't have some limit on execution time... I mean, when you run Storm you expect it will work days or months - listen queue or whatever.
I mean there are set of tasks that don't have limitation in time execution (I'd like to report 0% progress)
1) what's about timeout? regular M/R is killed when it hangs on, how to prevent it? I walked through the code, but didn't find any special code
2) also, MR1 has queue where jobs waited for execution: when cluster finish one job, it picked up next job from queue. What about YARN? if I will push endless Storm-like jobs A, and the job B, will job B be executed?
Sorry, if my questions seem ridiculous, maybe I miss/don't understand something
Hadoop's JobTracker was(is) responsible for both cluster resources and the application lifecycle. YARN is only responsible for managing cluster resources and the application lifecycle is the responsibility of the application.
This change means that YARN can be used to manage any distributed paradigm. MR2 is of course the initial implementation ( map/reduce over YARN) but you can see some other implementations like the Storm-on-YARN you mentioned or HortonWorks intention to integrate SQL in hadoop etc.
You can take a look at a library called Weave from continuuity that provides a simple API for building distributed apps on YARN

Hadoop and map-reduce on multicore machines

I have read a lot about Hadoop and Map-Reduce running on clusters of machines. Does some one know if the Apache distribution can be run on an SMP with several cores. In particular, can multiple Map-Reduce processes be run on the same machine. The scheduler will take care of spreading them across multiple cores. Thanks. - KG
Yes. You have multiple map and reduce slots in each machine which are determined by the RAM and CPU (each JVM instance needs 1GB by default so a 8GB machine with 16 cores should still have 7 task slots)
from hadoop wiki
Use the configuration knob: mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum to control the number of
maps/reduces spawned simultaneously on a TaskTracker. By default, it
is set to 2, hence one sees a maximum of 2 maps and 2 reduces at a
given instance on a TaskTracker.
You can set those on a per-tasktracker basis to accurately reflect
your hardware (i.e. set those to higher nos. on a beefier tasktracker
etc.).
You can use those lightweight MapReduce frameworks for multicore computers.
For example
LeoTask: A lightweight, productive, and reliable mapreduce framework for multicore computers
https://github.com/mleoking/LeoTask
For Apache Hadoop 2.7.3, my experience has been that enabling YARN will also enable multi-core support. Here is a simple guide for enabling YARN on a single node:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node
The default configuration seems to work pretty well. If you want to tune your core usage, then perhaps look into setting 'yarn.scheduler.minimum-allocation-vcores' and 'yarn.scheduler.maximum-allocation-vcores' within yarn-site.xml (https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml)
Also, see here for instructions on how to configure a simple Hadoop sandbox with multicore support: https://bitbucket.org/aperezrathke/hadoop-aee

Resources