Differences between existing MapReduce and YARN (MRv2) - hadoop

Would anyone tell me, which are the differences between existing MapReduce and YARN, because I do not find all clearly differences between these two?
P.S: I'm asking for something like a comparison between these.
Thanks!

MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 nodes).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.

MRv1 which is also called as Hadoop 1 where the HDFS (Resource management and scheduling) and MapReduce(Programming Framework) are tightly coupled.
Because of this non-batch applications can not be run on the hadoop 1.
It has single namenode so, it doesn't provides high system availability and scalability.
MRv2 (aka Hadoop 2) in this version of hadoop the resource management and scheduling tasks are separated from MapReduce which is separated by YARN(Yet Another Resource Negotiator).
The resource management and scheduling layer lies beneath the MapReduce layer.
It also provides high system availability and scalability as we can create redundant NameNodes.
The new feature of snapshot through which we can take backup of filesystems which helps disaster recovery.

Related

TMs on the same nodemanager which leads to high pressure on HDFS

we have a 100-node hadoop cluster. Currently I write a Flink App to write many files on HDFS by BucktingSink. When I run Flink App on yarn I found that all task managers is distributed on the same nodemanager which means all subtasks is running on this node. It opens many file descriptors on the datanode of this busy node. (I think flink filesystem connector connect to local datanode in precedence) This leads to high pressure on that node which easily fails the job.
Any good idea to solve this problem? Thank you very much!
This sounds like a Yarn scheduling problem. Please take a look at Yarn's capacity scheduler which allows you to schedule containers on nodes based on the available capacity. Moreover you could tell Yarn to also consider virtual cores for scheduling. This allows to define a different resource dimension compared to memory only.

Ability to offer only part of the node resources?

Using dc/os we like to schedule tasks close to the data that the task requires that in our case is stored in hadoop/hdfs (on an HDP cluster). Issue is that the hadoop cluster is not run from within dc/os and so we are looking for a way to offer only a subset of the system resources.
For example: say we like to reserve 8GB of memory to data node services, then we like to provide the remainder to dc/os to schedule tasks.
From what i have read so far, the task can specify the resources it requires, but i have not found any means to specify what you want to offer from the node perspective.
I'm aware that a CDH cluster can be run on dc/os, that would be one way to go, but for now that is not provided for HDP.
Thanks for any idea's/tips,
Paul

Falcon's role in Hadoop ecosystem

I am supposed to work on cluster mirroring where I have to set up the similar HDFS cluster (same master and slaves) as a existing one and copy the data to the new and then run the same jobs as is.
I have read about falcon as a feed processing and a work flow coordinating tool and it is used for mirroring of HDFS clusters as well. Can someone enlighten me on what is Falcon's role in Hadoop ecosystem and how does it help in mirroring in particular. I am looking here to understand what all facon offers when it is part of my Hadoop eco-system (HDP).
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Falcon replication is asynchronous with delta changes. Recovery is done by running a process and swapping the source and target.
Data loss – Delta data may be lost if the primary cluster is completely shut down
Backup can be scheduled when needed depending on the bandwidth and network availability.

can the same code be used for both hadoop and yarn

I have been thinking about this question for a while now. I have been trying to compare the performance of hadoop 1 vs yarn by running the basic word count example. I am still unsure about how the same .jar file can be used to execute on both the frameworks. As far as I understand yarn has a different set of api's which it uses to set connection with resource manager, create an application master etc.
So if I develop an application(.jar), can it be run on both the frameworks without any change in code?
Also what could be meaningful parameters to differentiate hadoop vs yarn for a particular application?
Ok, let's clear up some terms here.
Hadoop is the umbrella system that contains the various components needed for distributed storage and processing. I believe the term you're looking for when you say hadoop 1 is MapReduce v1 (MRv1)
MRv1 is a component of Hadoop that includes the job tracker and task trackers. It only relies on HDFS.
YARN is a component of Hadoop that abstracts out the resource management part of MRv1.
MRv2 is the mapreduce application rewritten to run on top of YARN.
So when you're asking if hadoop 1 is interchangeable with YARN, you're probably actually asking if MRv1 is interchangeable with MRv2. And the answer is generally, yes. The Hadoop system knows how to run the same mapreduce application on both mapreduce platforms.
Adding to climbage's answer:
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes an application(.jar), can be run on both the frameworks without any change in code.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.

HBase and Hadoop

HBase requires Hadoop installation based on what I read so far. And it looks like HBase can be set up to use existing Hadoop cluster (which is shared with some other users) or it can be set up to use dedicated Hadoop cluster? I guess the latter would be a safer configuration but I am wondering if anybody has any experience on the former (but then I am not very sure my understanding of HBase setup is correct or not).
I know that Facebook and other large organizations separate their HBase cluster (real time access) from their Hadoop cluster (batch analytics) for performance reasons. Large MapReduce jobs on the cluster have the ability to impact performance of the real-time interface, which can be problematic.
In a smaller organization or in a situation in which your HBase response time doesn't necessarily need to be consistent, you can just use the same cluster.
There aren't many (or any) concerns with coexistence other than performance concerns.
We've set it up with an existing Hadoop cluster that's 1,000 cores strong. Short answer: it works just fine, at least with Cloudera CH2 +149.88. But by Hadoop version, your mileage may vary.
In a distributed mode Hadoop is used for its HDFS storage. HBase will store HFile on HDFS, and thus get benefits from replication strategies and data-locality principles brought by datanodes.
RegionServer are about to basically handle local data, but still might have to fetch data from other datanodes.
Hope that will help you to understand why and how hadoop is used with HBase.

Resources