MapReduce Architecture - hadoop

I have created a diagram that represents how the MapReduce framework works. Could somebody please validate that this is an accurate representation?
P.S. For the purpose of this example, we are also interested in the system components shown in this diagram.

The MapReduce Architecture works in various different phases for executing a job. Here are the different stages of running a MapReduce Application -
The first stage involves the user writing his data into the HDFS for further processing. This data is stored on different nodes in the form of blocks in the HDFS.
Now the client submits its MapReduce job.
Then, the resource manager launches a container to start the App master.
The App master sends a resource request to the resource manager.
The resource manager now allocates containers on slaves via the node manager.
The App master starts respective tasks in the containers.
The job is now been executed in the container.
When the processing is complete, the resource manager deallocates the resources.
Source: Cloudera

JobTracker, TaskTracker, and MasterNode aren't real things in Hadoop 2+ w/ YARN. Jobs are submitted to a ResourceManager, which creates an ApplicationMaster on one of the NodeManagers.
"Slave Nodes" are commonly also your DataNodes because that is the core tenant of Hadoop - move the processing to the data.
The "Recieve the data" arrow is bi-directional, and there is no arrow from the NameNode to the DataNode. 1) Get the file locations from the NameNode, then locations are sent back to clients. 2) The clients (i.e. NodeManager processes running on a DataNode, or "slave nodes"), will directly read from the DataNodes themselves - the datanodes don't know directly where the other slave nodes exist.
That being said, HDFS and YARN are typically all part of the same "bubble", so the "HDFS" labelled circle you have should really be around everything.

Related

When do YARN and NameNode interact

When a job is submitted, when do YARN and NameNode interact? When a job is submitted, who does it get sent to? Could someone explain the end-to-end flow - how hadoop ecosystem works?
Thanks!
Namenode: Stores the meta-data of all the data stored in data nodes and monitors the health of data nodes. Basically, it is a master-slave architecture.
YARN: It stands for Yet Another Resource Negotiator. The yarn has mainly two components.
1.> Scheduling
2.> Application Manager
Yarn also contains the master, i.e Resource Manager and Slave, i.e Node Manager.
For scheduling purpose, there are 3 Schedulers:
1.> FIFO 2.> Capacity 3.> Fair-share
There is a component called Application Master assigned by Resource Manager under the Node Manager.
One application master is assigned to one application.
The job is directly submitted by the client and Resource Manager assigns the job to the Application Master and Node manager monitors the liveliness of Application Master
Now, whenever the job comes in, Resource manager creates a job id and assign an Application Master for that job. Resource Manager contacts to the Namenode to retrieve the information about the metadata of the required data on which the task has to be performed. And the information received by Resource Manager is then passed to Application Master.
This is the basic overview of the working of Yarn with Namenode. You can also read in detail from YARN
Also, NameNode interaction is just in the Hadoop applications running within YARN that talk to the NameNode. Not all YARN applications need to communicate with HDFS
Basically there is no direct interaction between YARN and HDFS, see https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
However YARN jobs require some files (libraries, configuration, etc) which usually resides on HDFS

difference between hadoop mr1 and yarn and mr2?

Can someone pls tell what is the differece between MR1 and yarn and MR2
My understanding is MR1 will be having below components
Namenode,
secondary name node,
datanode,
job tracker,
task tracker
Yarn
Node manager
Resource Manager
Is Yarn consists of MR1 or MR2 ( or both MR2 and Yarn are same?)
sorry if i asked basic level question
MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.
YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Is JobTracker a single point of failure too (besides NameNode) in Hadoop?

I am new to Hadoop. In hadoop, I know that when a NameNode fails the entire Hadoop framework goes down. So it's a single point of failure in Hadoop. Is it same for JobTracker? Because if the JobTracker goes down, there would be no daemon to contact Namenode after a job submission and also no point for running the TaskTrackers. How is this handled exactly?
Yes, JobTracker is a single point of failure in MRv1. In case of JobTracker failure all running jobs are halted (http://wiki.apache.org/hadoop/JobTracker).
In YARN, Resource manager is not a single point of failure.
If you need MRv1, you can use MapR distribution, which provides the JobTracker high availability (http://www.mapr.com/resources/videos/demo-hadoop-jobtracker-failing-and-recovering-mapr-cluster).
Jobtracker HA(High Availability using Active and Standby) can be configured in Cloudera Hadoop distribution. See the following link, this feature is available from CDH4.2.1 onwards:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/cdh4hag_topic_3_1.html
The same can be configured in Hortwonworks distribution also
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_hdp1-system-admin-guide/content/sysadminguides_ha_chap2_5_5.html
In MR2 master service is ResourceManager, which is not Single Point of Failure
Yes job tracker is a single point of failure. In case of namenode failure, secondary namenode will take a charge and act as namenode. In MR-II, there is a resource manager concept introduced. YARN has no. of resource manager, if one fails another resource manager will take a charge.One resource manager is active and other resource manager's are in stand by mode.
No no If NN failure, not Hadoop Framework goes down. Framework different NN failure is different. Hadoop framework is a layer on all nodes. If Name Node goes down, Framework doesn't no where the data should store, and doesn't no where space available to be store. So it's not possible to sore actual data.
Job tracker coordinates with Namenode to get a data to be processed. So when Namenode failure, job tracker also not work properly. So first namenode should work properly. In Hadoop this mechanism is called Namenode Single point of failure.
Job tracker is responsible for job schedule and process the data. If Job tracker not working, Client submits a job request, but the client donesn't no where should that job should submit and where should process. But that logic (you should submit) should know how to resolve the problem, but doesn't know where should submit. So Job tracker failure, it's not possible to process the data and schedule job.
It's a biggest problem in Bigdata analysis problem.
Now Hadoop 2.x resolved these two problems. YERN don't have any single point of failure in namenode level and datanode level.

who communicates with the namenode in yarn?

since the jobTracker in MapReduce1 is replaced by the Application Master and Resouce Manager in Yarn I wonder who is communication in Yarn with the namenode to find out where the data is stored in the different datanodes?
Is the Application Master doing so?
In YARN, the per-application ApplicationMaster is responsible for getting the information about the input splits from Namenode. Later when the task attempts are executed over the assigned nodes, the YarnChild fetches the respective splits from HDFS.

Resources