Why Map Reduce is not good for applications requires node coordination?
Also, when it is better than MPI for large scale distributed applications
Related
Any clustering system including Hadoop has both benefits and harms. Benefits is ability to compute in parallel, harms are overhead for task distribution.
Suppose I don't want to have any benefits and using one node. How can I run Hadoop to completely avoid overhead? Is running on single node pseudodistributed node sufficient? Or it will still have some parallelizing overhead?
I am writing a mapreduce program that would run on AWS EMR.
My program calculates probabilities out of the google ngram corpus.
I was wondering if there is a difference between running a single mapreduce that handles all calculations at once and multiple mapreduce that handles one calculation at each time.
Both are done without using any data structures (arrays, lists...).
Is there a difference in terms of efficiency? or network communication?
Both are doing exactly the same in in the same manner, I only separate the calculations the job of the reducer.
Yes there will be a difference between them but the magnitude of difference depend on your map reduce program.
Reason for difference is when you will run multiple light map reduce program then there is going to be head over of starting and executing multiple map and reducer as each map reduce program when start require allocation of container for which application master has to communicate back and forth between resource manager and node manager, new log files are generated, network communication between name node and datanode are required similarly there are many other head overs also. So single heavy map reduce is better then various light map reduce if your program is not that large.
But if your single map reducer program is too large and complex such that it cause clogging in JVM and memory ( which acc to me is highly unlikely unless your cluster hardware are too minimal ) then multiple small map reduce are more feasible.
From you question I have a intuition that your map reduce is not that large so I will suggest you to go ahead with single heavy map reduce.
I'm no expert at all of Hadoop, but is my understanding that Hadoop is well suited for parallel algorithms where the parallelism lies either as a map-reduce form or any other kind of divide and conquer.
Are there other class of algorithm techniques that are well suited as well?
Hadoop is suited for embarrassingly parallel workload (no dependency between parallel tasks). There is no mechanism of message passing between processes. Map and Reduce processes follow an IO based communication pattern, which itself is a great overhead.
Map Reduce is not suitable for programming iterative algorithms (for example KMeans, PageRank) because each iteration is a separate map reduce application and due to huge IO overhead the performance of your algorithm degrades. For iterative algorithms you can use Message Passing Interfaces (MPI). It supports socket based communication between processes, hence you can achieve significant improvement in performance as compared to map reduce. Since a large number of machine learning algorithms are iterative in nature, mapreduce should not be used for programming them.
If fault tolerance is necessary for your application, Hadoop is a better option than MPI.
I am on my way for becoming a cloudera Hadoop administrator. Since my start, I am hearing a lot about computing slots per machine in a Hadoop Cluster like defining number of Map Slots and Reduce slots.
I have searched internet for a log time for getting a Noob definition for a Map Reduce Slot but didn't find any.
I am really pissed off by going through PDF's explaining the configuration of Map Reduce.
Please explain what exactly it means when it comes to a computing slot in a Machine of a cluster.
In map-reduce v.1 mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum are used to configure number of map slots and reduce slots accordingly in mapred-site.xml.
starting from map-reduce v.2 (YARN), containers is a more generic term is used instead of slots, containers represents the max number of tasks that can run in parallel under the node regardless being Map task, Reduce task or application master task (in YARN).
generally it depends on CPU and memory
In out cluster, we set 20 map slot and 15 reduce slot for a machine with 32Core,64G memory
1.approximately one slot needs one cpu core
2.number of map slot should be a little more than reduce
IN MRV1 each machine had fixed number of Slots dedicated for maps and reduce.
In general each machine is configured with 4:1 ratio of maps:reducer on a machine .
logically one would be reading lot of data(Maps) and crunching them to small set(Reduce).
In MRV2 concept of containers came in and any container can run either a map/reducer/shell script .
A bit late though, I'll answer anyways.
Computing Slot. Can you think of all the various computations in the Hadoop that would require some resource i.e. memory/CPUs/Disk Size.
Resource = Memory or CPU-Core or Disk Size required
Allocating resource to start a Container, allocating resource to perform a map or a reduce task etc.
It is all about how you would want to manage the resources you have in hand. Now what would that be? RAM, Cores, Disks Size.
Goal is to ensure your processing is not constrained by any one of these cluster resources. You want your processing to be as dynamic as possible.
As an example, Hadoop YARN allows you to configure min RAM required to start a YARN container, min RAM require to start a MAP/REDUCE task, JVM Heap Size (for Map and Reduce tasks) and the amount of virtual memory each task would get.
Unlike Hadoop MR1, you do not pre-configure (as an example RAM size) before you even begin executing Map-Reduce tasks. In the sense you would want your resource allocation to be as elastic as possible, i.e. dynamically increase RAM/CPU cores for either MAP or a REDUCE task.
The Apache Hadoop is inspired by the Google MapReduce paper. The flow of MapReduce can be considered as two set of SIMDs (single instruction multiple data), one for Mappers, another for Reducers. Reducers, through predefined "key", consume the output of Mappers. The essence of MapReduce framework (and Hadoop) is to automatically partition the data, determine the number of partitions and parallel jobs, and manage distributed resources.
I have a general algorithm (not necessarily MapReducable) to be run in parallel. I am not implementing the algorithm itself the MapReduce-way. Instead, the algorithm is just a single-machine python/java program. I want to run 64 copies of this program in parallel (assuming there is no concurrency issue in the program). i.e. I am more interested in the computing resources in the Hadoop cluster than the MapReduce frameworks. Is there anyway I can use the Hadoop cluster in this old fashion?
Other way of thinking about MapReduce, is MR does the transformation and Reduce does some sort of aggregations.
Hadoop also allows for a Map only job. This way it should be possible to run 64 copies of the Map program run in parallel.
Hadoop has the concept of slots. By default there will be 2 map and 2 reduce slots per node/machine. So, for 64 processes in parallel, 32 nodes are required. If the nodes are of higher end configuration, then the number of M/R slots per node can also be bumped up.