What's the equivalent of 'controller' in MapReduce as in Hadoop - hadoop

On original MapReduce Paper, it's said the controller controls the mapreduce job flow.
But there's some paper refers 'controller' on more specific tasks like collecting information each mapper and control different partition from result.
This doesn't seem like 'MapReduce' equivalent. But multiple paper refer the same concept. So...What's the equivalent of it in hadoop?

This is the original paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
There is only one use of the word, "controller" in the paper. Google has its own implementation of MapReduce, and I am not sure who, beyond those working at Google, know much about their implementation.
Hadoop, on the other hand, is an open source implementation of MapReduce. Hadoop has 2 parts.
Storage
Processing
The storage system in Hadoop is called HDFS (Hadoop Distributed File System). The processing paradigm in Hadoop is MapReduce. Hadoop works in a master/slave architecture. For HDFS, there is a master (NameNode) and slaves (DataTrackers), and for MapReduce, there is a master (JobTracker) and slaves (TaskTracker).
Back to your question, if anything has the role of "controlling," then it should be the master (NameNode for HDFS/storage and JobTracker for MapReduce/processing).

Related

Understanding MapReduce in Hadoop 1.x

I am bit confused on what does the term "MapReduce" with respect to Hadoop 1.x. With respect to this, I come across various terms like: JobTracker , TaskTracker (the daemons in MapReduce). Now when we say MapReduce does it refer to these daemons or the API which a developer uses to code MapReduce applications?
Does the user application execute on TaskTracker , JobTracker? Is MapReduce itself a run-time environment?
Can anyone please help me understand this in simple words?
MapReduce is the programming model for data processing (in Hadoop).
Its implementation in Hadoop-1.x is often referred as the Classic MapReduce Implementation (or MapReduce v1) which uses JobTracker and TaskTrackers of Hadoop for the execution of Jobs and its corresponding APIs (user-facing client-side features) for writing them.
JobTracker coordinates the Job run.
TaskTrackers run the tasks that the job has been split into.
To sum up, the MapReduce APIs determine how the MapReduce programming model has to be written whereas the Implementation determine how the Job written using this programming model is executed.
The YARN implementation (MapReduce v2) of MapReduce programming model differs in its APIs used for writing it and the daemons (ResourceManager, ApplicationMaster and NodeManagers) used for execution.

Is HDFS necessary for Spark workloads?

HDFS is not necessary but recommendations appear in some places.
To help evaluate the effort spent in getting HDFS running:
What are the benefits of using HDFS for Spark workloads?
Spark is a distributed processing engine and HDFS is a distributed storage system.
If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.
Have a look at this comparision
S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
When to use HDFS as storage engine for Spark distributed processing?
If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.
The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.
The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.
Spark local vs hdfs permormance
HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures.
From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.
Second: which problems did you face exactly with the instruction?
BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...
So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.
HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).
You could use S3 and Redshift.
See here:
https://github.com/databricks/spark-redshift

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler?
is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,
Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)
MapReduce 1.0
In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.
Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.
MapReduce 2.0
MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.
In MapReduce 2.0, the JobTracker is divided into three services:
ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
JobHistoryServer, to provide information about completed jobs
Application Master, to manage each MapReduce job and is terminated when the job completes.
TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.
This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.
"when MRmaster asks resource manger for resources?"
when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.
"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities"
I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.
There is no YARN in MapReduce 1. In MapReduce there is Yarn.
If for straggler problem you mean that if first guy waits 'something' which then causes more waits along a road who depends on that first guy then I guess there is always this problem in MR jobs. Getting allocated resources naturally participate to this problem along with all other things which may cause components to wait something.
Tez which is supposed to be a drop-in replacement for MR job runtime makes a things differently. Instead of doing task runs in a same way current MR Appmaster does it tries to use DAG of tasks which does a much better job of not getting into bad straggler problem.
You need to understand a relationship between MR and YARN. YARN is simply a dummy resource scheduler meaning it doesn't schedule 'tasks'. What it gives to MR Appmaster is a set or resources(in a sense it's only combination of memory and cpu and location). It's then MR Appmaster responsibility to decide what to do with those resources.

What is the Hadoop ecosystem and how does Apache Spark fit in?

I'm having a lot of trouble grasping what exactly a 'Hadoop ecosystem' is conceptually. I understand that you have some data processing tasks that you want to run and so you use MapReduce to split the job up into smaller pieces but I'm unsure about what people mean when they say 'Hadoop Ecosystem'. I'm also unclear as to what the benefits of Apache Spark are and why this is seen as so revolutionary? If it's all in-memory calculation, wouldn't that just mean that you would need higher RAM machines to run Spark jobs? How is Spark different than writing some parallelized Python code or something of that nature.
Your question is rather broad - the Hadoop ecosystem is a wide range of technologies that either support Hadoop MapReduce, make it easier to apply, or otherwise interact with it to get stuff done.
Examples:
The Hadoop Distributed Filesystem (HDFS) stores data to be processed by MapReduce jobs, in a scalable redundant distributed fashion.
Apache Pig provides a language, Pig Latin, for expressing data flows that are compiled down into MapReduce jobs
Apache Hive provides an SQL-like language for querying huge datasets stored in HDFS
There are many, many others - see for example https://hadoopecosystemtable.github.io/
Spark is not all in-memory; it can perform calculations in-memory if enough RAM is available, and can spill data over to disk when required.
It is particularly suitable for iterative algorithms, because data from the previous iteration can remain in memory. It provides a very different (and much more concise) programming interface, compared to plain Hadoop. It can provide some performance advantages even when the work is mostly done on disk rather than in-memory. It supports streaming as well as batch jobs. It can be used interactively, unlike Hadoop.
Spark is relatively easy to install and play with, compared to Hadoop, so I suggest you give it a try to understand it better - for experimentation it can run off a normal filesystem and does not require HDFS to be installed. See the documentation.

Data movement HDFS Vs Parallel file system Vs MPI

I'm currently working on implementation of machine learning algorithms on MR-MPI (MapReduce on MPI). And i'm also trying to understand about other MapReduce frameworks especially Hadoop, so the following is my basic question (I'm new to MapReduce frameworks, i aplogize if my question dosen't make sense).
Question: Since MapReduce can be implemented on top of many things such as a parallel file system(GPFS), HDFS, MPI, e.t.c.,. After the map step there is a collate operation and then followed by a reduce operation. For a collate operation we need some data movement to happen across the nodes. In this regard i would like to know what is the difference in data movement mechanisms(between nodes) in HDFS Vs GPFS Vs MPI.
I appreciate if you provide me some good explanation and can give me some good references on each of these so i can get into further details.
Thanks.
MapReduce as a paradigm can bi implemented on many storage systems. Indeed Hadoop has so called DFS (Distributed file system) abstraction which enable integration of different storage system and run MapReduce over them. For example there are Amazon S3, Local file system, Opens Stack Swift and other integrations.
In the same time HDFS integration has one special property - it reports to the MR engine (JobTracker, to be more specific) where data resides and it enable smart scheduling of Mapping in the way that data to be processed by each Mapper is usually collocated with the Mapper.
As a result during Mapping phase data is not moved over network when MR run over HDFS.
To be more general can be stated that idea of Hadoop MR is to move code to data and not opposite, and it should be important criteria when evaluating any scalable MR implementation - does this system care that mappers process local data.
The OP has mixed a couple of things - messaging and file system, so there are multiple ansewers.
Hadoop/MAPI is a WIP and you can find more details here.
Hadoop/GPFS is still open.
Hadoop/HDFS comes out of the box from Apache Hadoop. For data transfer between the mappers and reducers HTTP is used, not sure why.

Resources