How does distributed computations on RDD's work in Apache Spark? - hadoop

If I have a very simple Spark program that simply does:
val rdd2 = sc.textFile("hdfs:///text.txt")
println(rdd.count)
When I submit this spark program to YARN using yarn-cluster:
The YARN ResourceManager will negotiate a container and lauch the Spark ApplicationMaster.
Then the ApplicationMaster will register itself with the ResourceManager and ask for resources.
Ones the resource specifications is obtained from the ResourceManager, the ApplicationMaster will launch the container on the NodeManager.
My question is since the data in Hadoop is distributed across multiple Machines (Let's say that text.txt in the example above is divided up into 3 blocks:
Does an Application Master get launched on each and every machine that has a text.txt block?
Is the spark executor a software which is already installed on each and every node of the cluster or does the executor gets instantiated into the container that is launched by the ApplicationMaster on the Node?

Good question, but probably too big for this forum.
To begin, your assumptions are generally correct but the timing isn't right.
The YARN ResourceManager will negotiate a container and lauch the Spark ApplicationMaster.
Then the ApplicationMaster will register itself with the ResourceManager and ask for resources.
Once the resource specifications is obtained from the ResourceManager, the ApplicationMaster will launch the container on
the NodeManager.
If you're using sc (the SparkContext) then this has already happened. There can be additional work with the ResourceManager if you add or remove executors, but the SparkContext only exists after initial resources are allocated.
Does an Application Master get launched on each and every machine that has a text.txt block?
No, but workers or executors could be launched on any machine that has a block. Alternatively they could be launched on just a single machine. However every block is (potentially) read by a worker. In this HDFS case, workers can read from anywhere on the cluster.
Is the spark executor a software which is already installed on each and every node of the cluster or does the executor gets instantiated
into the container that is launched by the ApplicationMaster on the
Node?
It can be installed on the nodes, or it can be given to the nodes at run time. The Spark executor is just a jar. I've seen it put into HDFS itself, or placed on each machine in the cluster as a local resource. You could place it in any location that the worker can access.

Related

MapReduce Architecture

I have created a diagram that represents how the MapReduce framework works. Could somebody please validate that this is an accurate representation?
P.S. For the purpose of this example, we are also interested in the system components shown in this diagram.
The MapReduce Architecture works in various different phases for executing a job. Here are the different stages of running a MapReduce Application -
The first stage involves the user writing his data into the HDFS for further processing. This data is stored on different nodes in the form of blocks in the HDFS.
Now the client submits its MapReduce job.
Then, the resource manager launches a container to start the App master.
The App master sends a resource request to the resource manager.
The resource manager now allocates containers on slaves via the node manager.
The App master starts respective tasks in the containers.
The job is now been executed in the container.
When the processing is complete, the resource manager deallocates the resources.
Source: Cloudera
JobTracker, TaskTracker, and MasterNode aren't real things in Hadoop 2+ w/ YARN. Jobs are submitted to a ResourceManager, which creates an ApplicationMaster on one of the NodeManagers.
"Slave Nodes" are commonly also your DataNodes because that is the core tenant of Hadoop - move the processing to the data.
The "Recieve the data" arrow is bi-directional, and there is no arrow from the NameNode to the DataNode. 1) Get the file locations from the NameNode, then locations are sent back to clients. 2) The clients (i.e. NodeManager processes running on a DataNode, or "slave nodes"), will directly read from the DataNodes themselves - the datanodes don't know directly where the other slave nodes exist.
That being said, HDFS and YARN are typically all part of the same "bubble", so the "HDFS" labelled circle you have should really be around everything.

Understanding mapreduce.framework.name wrt Hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
mapreduce.framework.name
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?
yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
So,
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Namenode
Datanode
ResourceManager
NodeManager
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Namenode
Datanode
Jobtracker
Tasktracker
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

YARN and Hadoop

I had a couple of questions regarding job submission to HDFS and the YARN architecture in Hadoop:
So in the Hadoop ecosystem you have one NameNode for each cluster which can contain any number of data nodes that store your data. When you submit a job to Hadoop, the job tracker on the NameNode will pick each job and assign it to the task tracker on which the file is present on the data node.
So my question is how do the components of YARN work together in HDFS:?
So YARN consists of the NodeManager and the Resource Manager. Out of these two components: Is the NodeManager run on every DataNode and the ResourceManager runs on each NameNode for each cluster? So when the task tracker (in each DataNode) gets assigned a task from the job tracker (in the NameNode), the NodeManager in a specific data node will create an container which will request resources from the ResourceManager in the NameNode. So this resource manager and node manager only come into play when a task tracker in a data node gets a job from the job tracker in the NameNode, in which the NodeManager will ask the ResourceManager for resources for the job to be executed. Is this correct?
You are partially correct. YARN was brought into picture to avoid the burden of Jobtracker which does both scheduling and monitoring. So with YARN you dont have any Job tracker or task tracker. The job done by Job tracker is now done by Resource Manager which has two main components Scheduler(allocating resources to applications) and ApplicationsManager(accepting job submissions and restarts the ApplicationMaster in case of any failure). Now each application has a ApplicationMaster which negotiates containers(where the job would be run) from the scheduler for running application.
Nodemanager runs on every slave node/data node. Resource Manager may/maynot be installed where the namenode is present. For a large cluster we usually need to separate the masters, so that the load doesn't go to a single machine.

difference between hadoop mr1 and yarn and mr2?

Can someone pls tell what is the differece between MR1 and yarn and MR2
My understanding is MR1 will be having below components
Namenode,
secondary name node,
datanode,
job tracker,
task tracker
Yarn
Node manager
Resource Manager
Is Yarn consists of MR1 or MR2 ( or both MR2 and Yarn are same?)
sorry if i asked basic level question
MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable. The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.
YARN is a generic platform for any form of distributed application to run on, while MR2 is one such distributed application that runs the MapReduce framework on top of YARN

Queries about YARN (failure modes, container size, practical example)

I want to ask few questions to understand the working of YARN:
Anyone can explain or refer to any document which can easily about the failure modes in YARN (i.e. Task Failure, Application master failure, Node Manager failure, Resource manager failure)
What is the container size in YARN? is it same as slot in Map reduce 1?
Any practical/working example of YARN ?
Thank you
Refer to Hadoop Definitive Guide text book ... Apart from that there is lot of info in apache web site.
Container size is not fixed it is dynamically allocated based on requirement by Resource Manager.
From developer perspective same old map-reduce will work on YARN.
ResourceManager failures
In the initial versions of the YARN framework, ResourceManager failures meant a total cluster failure, as it was a single point of failure. The ResourceManager stores the state of
the cluster, such as the metadata of the submitted application, information on cluster
resource containers, information on the cluster’s general configurations, and so on.
Therefore, if the ResourceManager goes down because of some hardware failure, then
there is no way to avoid manually debugging the cluster and restarting the
ResourceManager. During the time the ResourceManager is down, the cluster is
unavailable, and once it gets restarted, all jobs would need a restart, so the half-completed jobs lose any data and need to be restarted again. In short, a restart of the ResourceManager used to restart all the running ApplicationMasters. The latest versions of YARN address this problem in two ways. One way is by creating an active-passive ResourceManager architecture, so that when one goes down, another becomes active and takes responsibility for the cluster. Another way is by using the Zookeeper ResourceManager quorum, so that the ResourceManager state is stored externally over the Zookeeper, and one
ResourceManager is in an active state and one or more ResourceManagers are in passive mode, waiting for something to happen that brings them to an active state.
ApplicationMaster failures
When the ApplicationMaster fails, the ResourceManager simply starts another container with a new ApplicationMaster running in it for another application attempt. It is the responsibility of the new ApplicationMaster
to recover the state of the older ApplicationMaster, and this is possible only when ApplicationMasters persist their states in the external location so that it can be used for future reference. ApplicatoinMaster will store their state to persisitant disk thus all the status till the failure can be recovered.
NodeManager Failures
If a Node Manager fails, the ResourceManager detects this failure using a time-out (that is, stops receiving the heartbeats from the NodeManager). The ResourceManager then removes the NodeManager from its pool of available NodeManagers. It also kills all the containers running on that node & reports the failure to all running AMs. AMs are then responsible for reacting to node failures, by redoing the work done by any containers running on that node during the fault.
Container Failures
Container failures will be reported by node manager to Resource manager and Resource manager informs the same to Application Master. Now Application will restart the container.

Resources