Understanding number of map and reduce tasks in Hadoop MapReduce - hadoop

Assume that 8GB memory is available with a node in hadoop system.
If the task tracker and data nodes consume 2GB and if the memory required for each task is 200MB, how many map and reduce can get started?
8-2 = 6GB
So, 6144MB/200MB = 30.72
So, 30 total map and reduce tasks will be started.
Am I right or am I missing something?

The number of mappers and reducers is not determined by the resources available. You have to set the number of reducers in your code by calling setNumReduceTasks().
For the number of mappers, it is more complicated, as they are set by Hadoop. By default, there is roughly one map task per input split. You can tweak that by changing the default block size, record reader, number of input files.
You should also set in the hadoop configuration files the maximum number of map tasks and reduce tasks that run concurrently, as well as the memory allocated to each task. Those last two configurations are the ones that are based on the available resources. Keep in mind that map and reduce tasks run on your CPU, so you are practically restricted by the number of available cores (one core cannot run two tasks at the same time).
This guide may help you with more details.

The number of concurrent task is not decided just based on the memory available on a node. it depends on the number of cores as well. If your node has 8 vcores and each of your task takes 1 core then at a time only 8 task can run.

Related

Limiting Map Reduce Job to use less resource

I am using Hadoop 2.5.0-cdh5.3.3 . I created a map reduce job that results in large number of map jobs. Hence it is consuming all the resources. How can I prevent it to use only limited resource ?
You can either increase the block size 64 mb to a higher config,
or
You can set a parameter called "mapred.map.tasks" to limit the number of map tasks per node.
Note: This will affect performance since more number of blocks will be allocated to task tracker node.

How does Hadoop decide how many nodes will perform the Map and Reduce tasks?

I'm new to hadoop and I'm trying to understand it. Im talking about hadoop 2. When I have an input file which I wanto to do a MapReduce, in the MapReduce programm I say the parameter of the Split, so it will make as many map tasks as splits,right?
The resource manager knows where the files are and will send the tasks to the nodes who have the data, but who says how many nodes will do the tasks? After the maps are donde there is the shuffle, which node will do a reduce task is decided by the partitioner who does a hash map,right? How many nodes will do reduce tasks? Will nodes who have done maps will do too reduce tasks?
Thank you.
TLDR: If I have a cluster and I run a MapReduce job, how does Hadoop decides how many nodes will do map tasks and then which nodes will do the reduce tasks?
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.
If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by ( < no. of nodes > * < no. of maximum containers per node > ).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired
Which nodes for Reduce tasks?
You can configure number of mappers and number of reducers per node as per Configuration parameters like mapreduce.tasktracker.reduce.tasks.maximum
if you set this parameter as zero, that node won't be considered for Reduce tasks. Otherwise, all nodes in the cluster are eligible for Reduce tasks.
Source : Map Reduce Tutorial from Apache.
Note: For a given Job, you can set mapreduce.job.maps & mapreduce.job.reduces. But it may not be effective. We should leave the decisions to Map Reduce Framework to decide on number of Map & Reduce tasks
EDIT:
How to decide which Reducer node?
Assume that you have equal reduce slots available on two nodes N1 and N2 and current load on N1 > N2, then , Reduce task will be assigned to N2. If both load and number of slots are same, whoever sends first heartbeat to resource manager will get the task. This is the code block for reduce assignment:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/JobQueueTaskScheduler.java#207
how does Hadoop decides how many nodes will do map tasks
By default, the number of mappers will be same as the number of split (blocks) of the input to the mapreduce.
Now about the nodes, In the Hadoop 2, each node runs it own NodeManager (NM). The job of the NM is to manage the application container assigned to it by the Resourcemanager (RM). So basically, each of the task will be running in the individual container. To run the mapper tasks, ApplicationMaster negotiate the container from the ResourceManager. Once the containers are allocated, the NodeManager will launch the task and monitor it.
which nodes will do the reduce tasks?
Again the reduce tasks will also runs in the containers. The ApplicationMaster (per-application (job)) will negotiate the containers from the RM and launch the reducer tasks. Mostly they run on the different nodes then the Mapper nodes.
The default number of reducers for any job is 1. The number of reducers can be set in the job configuration.

Map Reduce Slot Definition

I am on my way for becoming a cloudera Hadoop administrator. Since my start, I am hearing a lot about computing slots per machine in a Hadoop Cluster like defining number of Map Slots and Reduce slots.
I have searched internet for a log time for getting a Noob definition for a Map Reduce Slot but didn't find any.
I am really pissed off by going through PDF's explaining the configuration of Map Reduce.
Please explain what exactly it means when it comes to a computing slot in a Machine of a cluster.
In map-reduce v.1 mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum are used to configure number of map slots and reduce slots accordingly in mapred-site.xml.
starting from map-reduce v.2 (YARN), containers is a more generic term is used instead of slots, containers represents the max number of tasks that can run in parallel under the node regardless being Map task, Reduce task or application master task (in YARN).
generally it depends on CPU and memory
In out cluster, we set 20 map slot and 15 reduce slot for a machine with 32Core,64G memory
1.approximately one slot needs one cpu core
2.number of map slot should be a little more than reduce
IN MRV1 each machine had fixed number of Slots dedicated for maps and reduce.
In general each machine is configured with 4:1 ratio of maps:reducer on a machine .
logically one would be reading lot of data(Maps) and crunching them to small set(Reduce).
In MRV2 concept of containers came in and any container can run either a map/reducer/shell script .
A bit late though, I'll answer anyways.
Computing Slot. Can you think of all the various computations in the Hadoop that would require some resource i.e. memory/CPUs/Disk Size.
Resource = Memory or CPU-Core or Disk Size required
Allocating resource to start a Container, allocating resource to perform a map or a reduce task etc.
It is all about how you would want to manage the resources you have in hand. Now what would that be? RAM, Cores, Disks Size.
Goal is to ensure your processing is not constrained by any one of these cluster resources. You want your processing to be as dynamic as possible.
As an example, Hadoop YARN allows you to configure min RAM required to start a YARN container, min RAM require to start a MAP/REDUCE task, JVM Heap Size (for Map and Reduce tasks) and the amount of virtual memory each task would get.
Unlike Hadoop MR1, you do not pre-configure (as an example RAM size) before you even begin executing Map-Reduce tasks. In the sense you would want your resource allocation to be as elastic as possible, i.e. dynamically increase RAM/CPU cores for either MAP or a REDUCE task.

Pseudo distributed : Need to change number of mapper nodes

I am using a Intel(R) Core(TM)2 Duo processor. I have installed hadoop in pseudo distributed mode. I have written a program which needs 50 mappers nodes. Is it possible to have 50 mapper nodes in the pseudo distributed mode or I will be limited to 4 nodes(2 * number of cores) . I have tried setting "mapred.tasktracker.map.tasks.maximum" to 50, but there is no change in concurrency.
The maximum number of map and reduce tasks depends on the number of task trackers in your cluster and the maximum number of map/reduce tasks per node defined using the properties mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum.
I assume your map reduce job needs 50 map tasks in the default block size configuration. The number of map tasks needed for a job depends on the number of InputSplits for the processed data. Definitely you should not depend on the number of needed map tasks or define this limit anyhow in your program. This would impact the scaling of your map reduce job.
One option would be to to set the maximum number of mapper tasks to 50. The number of the available mapper tasks should be visible in the cluster summary section of the job tracker web ui. However as your processor has only two cores, you should reconsider, whether launching 50 mappers concurrently will have any positive impact on the performance of your map reduce job.

Hadoop slowstart configuration

What's an ideal value for "mapred.reduce.slowstart.completed.maps" for a Hadoop job? What are the rules to follow to set it appropriately?
Thanks!
It depends on a number of characteristics of your job, cluster and utilization:
How many map slots will your job require vs maximum map capacity: If you have a job that spawns 1000's of map tasks, but only have 10 map slots in total (an extreme case to demonstrate a point), then starting your reducers early could deprive over reduce tasks from executing. In this case i would set your slowstart to a large value (0.999 or 1.0). This is also true if your mappers take an age to complete - let someone else use the reducers
If your cluster is relatively lightly loaded (there isn't contention for the reducer slots) and your mappers output a good volume of data, then a low value for slowstart will assist in getting your job to finish earlier (while other map tasks execute, get the map output data moved to the reducers).
There are probably more

Resources