Understanding MapReduce in Hadoop 1.x - hadoop

I am bit confused on what does the term "MapReduce" with respect to Hadoop 1.x. With respect to this, I come across various terms like: JobTracker , TaskTracker (the daemons in MapReduce). Now when we say MapReduce does it refer to these daemons or the API which a developer uses to code MapReduce applications?
Does the user application execute on TaskTracker , JobTracker? Is MapReduce itself a run-time environment?
Can anyone please help me understand this in simple words?

MapReduce is the programming model for data processing (in Hadoop).
Its implementation in Hadoop-1.x is often referred as the Classic MapReduce Implementation (or MapReduce v1) which uses JobTracker and TaskTrackers of Hadoop for the execution of Jobs and its corresponding APIs (user-facing client-side features) for writing them.
JobTracker coordinates the Job run.
TaskTrackers run the tasks that the job has been split into.
To sum up, the MapReduce APIs determine how the MapReduce programming model has to be written whereas the Implementation determine how the Job written using this programming model is executed.
The YARN implementation (MapReduce v2) of MapReduce programming model differs in its APIs used for writing it and the daemons (ResourceManager, ApplicationMaster and NodeManagers) used for execution.

Related

Multi-threading in Hadoop/Spark

I have an idea about Multi-threading in general but not sure how it is used in Hadoop.
Based on my knowledge, Yarn is responsible for managing/controlling Spark/Mapreduce job resources, can't think of Multi-threading here. Not sure whether it can be used anywhere else in Hadoop Eco System.
I would appreciate if anybody could provide some information on this.
Many thanks,
actually, YARN is responsible for managing the resource allocation and de-allocation for containers requested by Application Master(MR-AppMaster or Spark-Driver). So the RPC between them are all about negotiation of resource agreement and it does not consider any details how tasks are running inside MapReduce and Spark.
For MapReduce-Hadoop, each task(mapper or reducer) is a single process running on a JVM, it doesn't employ any multi-threaded here.
For Spark, each executor are actually composed of many worker threads. Here each Spark task is corresponding to each task(single process) in MapReduce. So Spark does implement based on multi-threads models for lower
overhead of JVM and data shuffling between tasks.
Based on my experiences, Multi-threads models lower the overhead but suffers from the huge cost of fault tolerance. If an executor in Spark fails, all the tasks running inside the executor have to re-run but only single task needs to re-run for MapReduce. Also Spark suffers from huge memory pressure because all the tasks in side a executor needs to cache data in terms of RDD. But Mapreduce task only process one block at a time.
Hope this is helpful.
It is possible to run multithreaded code in Spark. Take an example of Java code in Spark
AnyCollections.parallelStream().forEach(temo -> {
// Add your spark code here.
}
Now based on the number of cores in the driver it will spawn multiple executors and do stuff in parallel.

mapreduce parameters in Spark

I wanted to know if mapreduce.* parameters are applicable in Spark.
As far as I know in Spark there is no buffer for the map output and for the reduce task the whole process is also different. Parameters like mapreduce.task.io.sort.mb ,mapreduce.reduce.shuffle.input.buffer.percent or mapreduce.reduce.input.buffer.percent control these kind of buffers.
I'm working in optimising parameters for spark tasks/jobs running in a hadoop/yarn cluster.
It is safe to say that these mapreduce parameters don't matter and that I should only care about spark.* parameters since the map, shuffle and reduce parts are different?
It's safe because Spark doesn't use MapReduce as processing engine, but it interacts directly with YARN to submit operations. Thus, when you use Spark, there is no MapReduce job scheduled, but you have a Spark application and Spark jobs.

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler?
is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,
Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)
MapReduce 1.0
In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.
Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.
MapReduce 2.0
MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.
In MapReduce 2.0, the JobTracker is divided into three services:
ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
JobHistoryServer, to provide information about completed jobs
Application Master, to manage each MapReduce job and is terminated when the job completes.
TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.
This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.
"when MRmaster asks resource manger for resources?"
when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.
"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities"
I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.
There is no YARN in MapReduce 1. In MapReduce there is Yarn.
If for straggler problem you mean that if first guy waits 'something' which then causes more waits along a road who depends on that first guy then I guess there is always this problem in MR jobs. Getting allocated resources naturally participate to this problem along with all other things which may cause components to wait something.
Tez which is supposed to be a drop-in replacement for MR job runtime makes a things differently. Instead of doing task runs in a same way current MR Appmaster does it tries to use DAG of tasks which does a much better job of not getting into bad straggler problem.
You need to understand a relationship between MR and YARN. YARN is simply a dummy resource scheduler meaning it doesn't schedule 'tasks'. What it gives to MR Appmaster is a set or resources(in a sense it's only combination of memory and cpu and location). It's then MR Appmaster responsibility to decide what to do with those resources.

In which types of use cases is MapReduce superior to Spark?

I just attended a introductory class on Spark and asked the speaker if Spark could fully replace MapReduce, and was told that Spark could be used in replace of MapReduce for any use case, but there are particular use cases that MapReduce is actually faster than Spark.
What are the characteristics of the use cases that MapReduce can solve faster than Spark?
Pardon me for quoting myself from Quora, but:
For the data-parallel, one-pass, ETL-like jobs MapReduce was designed for, MapReduce is lighter-weight compared to the Spark equivalent
Spark is fairly mature, and so is YARN now, but Spark-on-YARN is still pretty new. The two may not be optimally integrated yet. For example until recently I don't think Spark could ask YARN for allocations based on number of cores? That is: MapReduce might be easier to understand, manage and tune
You can reproduce almost all of MapReduce's behavior in Spark, since Spark has narrow, simpler functions that can be used to produce a lot of executions. You don't always want to mimic MapReduce.
One thing Spark can't do yet is an out-of-core sort of the sort you happen to get from how classic MapReduce works, but that's coming. I suppose there aren't very direct analogs of a few things like MultipleOutputs either.

What's the equivalent of 'controller' in MapReduce as in Hadoop

On original MapReduce Paper, it's said the controller controls the mapreduce job flow.
But there's some paper refers 'controller' on more specific tasks like collecting information each mapper and control different partition from result.
This doesn't seem like 'MapReduce' equivalent. But multiple paper refer the same concept. So...What's the equivalent of it in hadoop?
This is the original paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
There is only one use of the word, "controller" in the paper. Google has its own implementation of MapReduce, and I am not sure who, beyond those working at Google, know much about their implementation.
Hadoop, on the other hand, is an open source implementation of MapReduce. Hadoop has 2 parts.
Storage
Processing
The storage system in Hadoop is called HDFS (Hadoop Distributed File System). The processing paradigm in Hadoop is MapReduce. Hadoop works in a master/slave architecture. For HDFS, there is a master (NameNode) and slaves (DataTrackers), and for MapReduce, there is a master (JobTracker) and slaves (TaskTracker).
Back to your question, if anything has the role of "controlling," then it should be the master (NameNode for HDFS/storage and JobTracker for MapReduce/processing).

Resources