Is there any Difference between YARN and MR2? - hadoop

I am bit confused between YARN and MR2. are they same ? I have seen YARN as platform for running distributed application in some documentations and MapReduce as such distributed appication, What is meant by that ? Is that a MapReduce application built by a programmer(eg : wordcount program) ?
Again I have seen like 'Writing your own YARN application ', what is meant by that ? Which category a wordcount program belongs, a YARN application or a MR2 application ?
Please corret question if its not in the stackoverflow standard , I was trying to express all my doubts.

Firstly YARN introduced in HADOOP 2 onward.In hadoop 1, a single Name node managed the entire namespace for a Hadoop cluster.With the introduction of YARN(YET ANOTHER RESOURCE MANAGER) cluster resource management is taken care by YARN.
You can assume that YARN runs above HDFS and on the YARN different data processing component like mapreduce and othes run.
In Hadoop 1, users had the option of writing MapReduce programs in Java, in Python, or other scripting languages using streaming, or using Pig, a data transformation language. Regardless of which method was used, all fundamentally relied on the MapReduce processing model to run.
YARN supports multiple processing models in addition to MapReduce. One of the most significant benefits of this is that we are no longer limited to working the often I/O intensive, high latency MapReduce framework.
Now your question- 'Writing your own YARN application ', what is meant by that ?
Now on YARN we can write any of the distributed application. MR2 is one of that distributed application that is responsible for running MapReduce framework.So word count program belongs to MR2 application which is responsible to run this mapreduce frmework.

YARN and MR2 are related, but not the same.
MapReduce is a computational approach, and in the Hadoop world, MapReduce is also a name synonymous for Hadoop 1.0; in Hadoop 1.0, MapReduce provided both resource management and the execution framework for running "MapReduce" (1.0) applications
With Hadoop 2.0, resource management and the execution framework were split to be two separate entities, YARN (Yet Another Resource Negotiator) a general resource management system for cluster compute resources, and MapReduce 2.0 (abbreviated as MR2) an execution engine.
MR2 is a YARN application
You can write a "MapReduce" (2.0) based application that will run on YARN
If you are on a Hadoop 2.0 cluster and use something like Pig or Hive with the default execution engine, MR2, they will run as MR2 applications on YARN
Part of the reasoning for this split in Hadoop 2.0 was to allow for other execution engines, beyond MR2; today this also includes Apache Tez and Apache Spark, which can run on YARN.

Related

Hadoop Ecosystem: Map Reduce needed for Pig/Hive

There is a whole lot of hadoop ecosystem pictures on the internet, so i struggle to get an understanding how the tools work together.
E.g. in the picture attached, why are pig and hive based on map reduce whereas the other tools like spark or storm on YARN?
Would you be so kind and explain this?
Thanks!
BR
haddop ecosystem
The picture shows Pig and Hive on top of MapReduce. This is because MapReduce is a distributed computing engine that is used by Pig and Hive. Pig and Hive queries get executed as MapReduce jobs. It is easier to work with Pig and Hive, since they give a higher-level abstraction to work with MapReduce.
Now let's take a look at Spark/Storm/Flink on YARN in the picture. YARN is a cluster manager that allows various applications to run on top of it. Storm, Spark and Flink are all examples of applications that can run on top of YARN. MapReduce is also considered as an application that can run on YARN, as shown in the diagram. YARN handles the resource management piece so that multiple applications can share the same cluster. (If you are interested in another example of a similar technology, check out Mesos).
Finally, at the bottom of the picture is HDFS. This is the distributed storage layer that allows applications to store and access data. It provides features such as distributed storage, replication and fault tolerance.
If you are interested in deeper-dives, check out the Apache Projects page.

What is the difference between a mapreduce application and a yarn application?

A cluster which runs mapreduce 2 doesn't have a job tracker and instead it is split into two separate components, resource manager and job manager. However, these thing are transparent from a user and he doesn't need to know whether the cluster is running mapreduce 1 or 2 when submitting a mapreduce job.
The thing I cannot quite understand is Yarn application. How is it different from a regular mapreduce application? What's the advantage of running a mapreduce job as a yarn application, etc? Could someone shed some light on that for me?
MR1 has Job tracker and task tracker which takes care of Map reduce application.
In MR2 Apache separated the management of the map/reduce process from the cluster's resource management by using YARN. YARN is a better resource manger than we had in MR1. It also enables versatility. MR2 is built on top of YARN.
Apart from Map reduce, we can run applications like spark, storm, Hbase, Tex etc on top of Yarn, which we cannot do using MR1.
The following is the architecture for MR1 and MR2.
HDFS <---> MR
HDFS <----> Yarn <----> MR

What is "Hadoop" - the definition of Hadoop?

It is kind of obvious and we will all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens with different other combinations and other products in the Hadoop ecosystem?
Is, for example, HDFS + YARN + Spark still Hadoop? Is HBASE Hadoop? I guess we consider HDFS + YARN + Pig Hadoop, since Pig uses MapReduce.
Are only the MapReduce tools considered Hadoop, but anything else run on HDFS + YARN(like Spark) is non Hadoop?
I agree with your impression that the "Hadoop" term does not have a useful definition. "We have a Hadoop cluster" may mean various things.
There is an official answer though at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
So "Hadoop" is the name of a project and a software library. Any other usage is ill-defined.
In addition to Apache hadoop definition from Official website, I would like to highlight that Hadoop is a framework and there are many sub-systems in Hadoop ecosystem
Quoting this content from official website so that broken links in future does not cause any issue to this answer.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
More or less,
Hadoop = Distributed Storage (HDFS) + Distributed Processing ( YARN + Map Reduce)
But these four modules does not cover complete Hadoop Ecosystem. There are many Hadoop related projects and 40+ subsystems in Hadoop Ecosystems.
Other Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
Coming back to your question:
Just have a look at 40+ sub systems in Hadoop eco system. Every thing you have quoted may not be Hadoop but most them are related to Hadoop.
Spark is part of Hadoop ecosystem. But it can neither use HDFS nor YARN. HDFS data sets can be replaced with RDD ( resilient distributed dataset) and can run in Standalone mode without YARN.
Have a look at this article and this article for Hadoop & Spark comparison.
Spark use cases over Hadoop:
Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Stream processing
Sensor data processing
Since Spark doesn't have storage system, it have to depend on one of distributed storages where HDFS is one of them.
Have a look at related SE question:
Can apache spark run without hadoop?
The most generally accepted understanding of Hadoop: HDFS and Map/Reduce and their related processes and tooling.
A related term: Hadoop ecosystem: Hive/Pig/Hbase, Zookeeper, Oozie. Also vendor specific ones such as impala, ambari.
Why we require a Big Data System?
STORE (to store massive amount of data)
PROCESS (process the data/requests in timely manner)
SCALE (scale easily as data grows)
There was a big data solution provided by google.
Google File System: To solve distributed storage.
Map Reduce: To solve distributed computing.
Google published the research papers. Apache developed open source system similar to developed by Google, it is known as HADOOP.
Similar to Google File system in hadoop is HDFS(Hadoop Distributed File System): File system to manage the storage of data.
MAP Reduce: Framework to process data across multiple servers.
Note: In 2013, Apache released HADOOP 2.0
(MapReduce was broken into two components:
MapReduce: A framework to define the data processing task.
YARN: A framework to run the data processing task.
)
HADOOP ECOSYSTEM
Hadoop was not easy to understand, and it was limited to hard core developer. so to make the usage of hadoop easier. Lots of tools sprung up, which as a whole is known as Hadoop Ecosystem.
Hadoop Ecosystem contains tools like:
HIVE:
provides an SQL interface to hadoop.
bridge to hadoop for folks who dont have dont have exposure to OOPs in JAVA.
HBASE:
a database management system on top of hadoop.
integrates with our app just like a traditional database.
PIG:
DATA Manipulation Language.
transforms unstructured data into structured format.
query this structured data using interfaces like Hive.
SPARK:
A distributed computing engine used along with Hadoop.
Interactive shell to quickly process datasets.
has a bunch of built in libraries for machine learning stream processing, graph processing etc.
OOZIE:
tool to schedule workflows on all the hadoop ecosystem technologies.
FLUME/SQOOP:
tools to transfer data between other systems and hadoop.
This concludes a very high level overview of Hadoop.
Hadoop is not a stack like LAMP or MEAN stack. Hadoop is a collections of frameworks and tools that work togethor to solve complex big data problems.
It is basically a Project under Apache foundation. Various Subprojects like Mapreduce, Ambari, Sqoop, Spark, Zookeeper, etc togethor makes ,what is called a Hadoop Ecosystem.
Source : https://www.datacloudschool.com/2020/01/introduction-what-is-hadoop.html

can the same code be used for both hadoop and yarn

I have been thinking about this question for a while now. I have been trying to compare the performance of hadoop 1 vs yarn by running the basic word count example. I am still unsure about how the same .jar file can be used to execute on both the frameworks. As far as I understand yarn has a different set of api's which it uses to set connection with resource manager, create an application master etc.
So if I develop an application(.jar), can it be run on both the frameworks without any change in code?
Also what could be meaningful parameters to differentiate hadoop vs yarn for a particular application?
Ok, let's clear up some terms here.
Hadoop is the umbrella system that contains the various components needed for distributed storage and processing. I believe the term you're looking for when you say hadoop 1 is MapReduce v1 (MRv1)
MRv1 is a component of Hadoop that includes the job tracker and task trackers. It only relies on HDFS.
YARN is a component of Hadoop that abstracts out the resource management part of MRv1.
MRv2 is the mapreduce application rewritten to run on top of YARN.
So when you're asking if hadoop 1 is interchangeable with YARN, you're probably actually asking if MRv1 is interchangeable with MRv2. And the answer is generally, yes. The Hadoop system knows how to run the same mapreduce application on both mapreduce platforms.
Adding to climbage's answer:
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes an application(.jar), can be run on both the frameworks without any change in code.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.

What are sites for Hadoop Best practices

What are sites for Hadoop Best practice , Not the Books where I can get the step by step process to create new projects and small examples . I am not able to find a single site like this , please share.
There is an awesome article from yahoo developers on Apache Hadoop: Best Practices and Anti-Patterns
Hadoop is not something one single application instead it is a distributed processing framework which is used by several applications which sits top of this framework. Pig, Hive, HBase, Cassandra, etc are few of many such application designed for specific requirement. Underneath all of these application consume Hadoop framework which mainly consist of distributed file system (HDFS) and distributed processing (MapReduce).
Technically when you have a bare minimum Hadoop cluster (HDFS + MapReduce only) you can start writing MapReduce based applications (in Java or other languages are supported through Hadoop Streaming) to process some data.
What you could do is first download a pre-build/configured Hadoop virtual Image from Cloudera or Hortonworks distribution and get it running in your machine. After that start learning writing MapReduce jobs in Java and run in your virtual machine.
Here is the URL to download Cloudera Hadoop Distribution VM
Here is the link to learn writing simplest wordcount job.

Resources