Limiting Map Reduce Job to use less resource - hadoop

I am using Hadoop 2.5.0-cdh5.3.3 . I created a map reduce job that results in large number of map jobs. Hence it is consuming all the resources. How can I prevent it to use only limited resource ?

You can either increase the block size 64 mb to a higher config,
or
You can set a parameter called "mapred.map.tasks" to limit the number of map tasks per node.
Note: This will affect performance since more number of blocks will be allocated to task tracker node.

Related

Understanding number of map and reduce tasks in Hadoop MapReduce

Assume that 8GB memory is available with a node in hadoop system.
If the task tracker and data nodes consume 2GB and if the memory required for each task is 200MB, how many map and reduce can get started?
8-2 = 6GB
So, 6144MB/200MB = 30.72
So, 30 total map and reduce tasks will be started.
Am I right or am I missing something?
The number of mappers and reducers is not determined by the resources available. You have to set the number of reducers in your code by calling setNumReduceTasks().
For the number of mappers, it is more complicated, as they are set by Hadoop. By default, there is roughly one map task per input split. You can tweak that by changing the default block size, record reader, number of input files.
You should also set in the hadoop configuration files the maximum number of map tasks and reduce tasks that run concurrently, as well as the memory allocated to each task. Those last two configurations are the ones that are based on the available resources. Keep in mind that map and reduce tasks run on your CPU, so you are practically restricted by the number of available cores (one core cannot run two tasks at the same time).
This guide may help you with more details.
The number of concurrent task is not decided just based on the memory available on a node. it depends on the number of cores as well. If your node has 8 vcores and each of your task takes 1 core then at a time only 8 task can run.

Controlling reducer shuffle merge memory in Hadoop 2

I want understand how memory is used in the reduce phase of a MapReduce Job, so I can control the settings in the designated way.
If I understand correctly, the reducer first fetches its map output and leaves them in memory up to a certain threshold. The settings to control this are:
mapreduce.reduce.shuffle.merge.percent: The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.
mapreduce.reduce.input.buffer.percent: The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
Next, these spilled blocks are merged. It seems the following option controls how much memory is used for the shuffle:
mapreduce.reduce.shuffle.input.buffer.percent: The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
But then, there is the setting:
mapreduce.reduce.shuffle.memory.limit.percent: Maximum percentage of the in-memory limit that a single shuffle can consume.
But it is not clear to what value this percentage applies. Is there more information available regarding these values, i.e. what they control and how they differ?
Finally, after the merge completes, the reduce process is ran on the inputs. In the [Hadoop book][1], I found that the final merge-step directly feeds the reducers. But, the default value for mapreduce.reduce.input.buffer.percent=0 contradicts this, indicating that everything is spilled to disk BEFORE the reducers start. Is there any reference on which one of these explanations is correct?
[1]: Hadoop, The definitive guide, Fourth edition, p. 200
Here is how mapreduce.reduce.shuffle.memory.limit.percent is used and its percentage implies a 0.70 percent of the whole reducer memory. That would be the maximum bytes upto which the data could be kept in memory for a single shuffle.
maxSingleShuffleLimit = (long)(maxSize * MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
//MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION=mapreduce.reduce.shuffle.memory.limit.percent(0.25 f)
maxSize = (int)(conf.getInt("mapred.job.reduce.total.mem.bytes",(int)Math.min(Runtime.getRuntime().maxMemory(), Integer.MAX_VALUE))* maxInMemCopyUse);//maxInMemCopyuse(mapred.job.shuffle.input.buffer.percent - 0.70f)
This property is used in the copy phase of the reducer. If the required map output is greater than the maxShufflelimit then the data is moved to disk,else kept in memory.
Property mapreduce.reduce.input.buffer.percent is completety different.
Once all the data is copied and all the merge is done, just before the reducer starts it just checks whether the data stored in memory exceeds this limit.
You could refer this code(however it is for old mapred it should give an insight) on how maxSingleShuffleLimit and the other property are used.

Map Reduce Slot Definition

I am on my way for becoming a cloudera Hadoop administrator. Since my start, I am hearing a lot about computing slots per machine in a Hadoop Cluster like defining number of Map Slots and Reduce slots.
I have searched internet for a log time for getting a Noob definition for a Map Reduce Slot but didn't find any.
I am really pissed off by going through PDF's explaining the configuration of Map Reduce.
Please explain what exactly it means when it comes to a computing slot in a Machine of a cluster.
In map-reduce v.1 mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum are used to configure number of map slots and reduce slots accordingly in mapred-site.xml.
starting from map-reduce v.2 (YARN), containers is a more generic term is used instead of slots, containers represents the max number of tasks that can run in parallel under the node regardless being Map task, Reduce task or application master task (in YARN).
generally it depends on CPU and memory
In out cluster, we set 20 map slot and 15 reduce slot for a machine with 32Core,64G memory
1.approximately one slot needs one cpu core
2.number of map slot should be a little more than reduce
IN MRV1 each machine had fixed number of Slots dedicated for maps and reduce.
In general each machine is configured with 4:1 ratio of maps:reducer on a machine .
logically one would be reading lot of data(Maps) and crunching them to small set(Reduce).
In MRV2 concept of containers came in and any container can run either a map/reducer/shell script .
A bit late though, I'll answer anyways.
Computing Slot. Can you think of all the various computations in the Hadoop that would require some resource i.e. memory/CPUs/Disk Size.
Resource = Memory or CPU-Core or Disk Size required
Allocating resource to start a Container, allocating resource to perform a map or a reduce task etc.
It is all about how you would want to manage the resources you have in hand. Now what would that be? RAM, Cores, Disks Size.
Goal is to ensure your processing is not constrained by any one of these cluster resources. You want your processing to be as dynamic as possible.
As an example, Hadoop YARN allows you to configure min RAM required to start a YARN container, min RAM require to start a MAP/REDUCE task, JVM Heap Size (for Map and Reduce tasks) and the amount of virtual memory each task would get.
Unlike Hadoop MR1, you do not pre-configure (as an example RAM size) before you even begin executing Map-Reduce tasks. In the sense you would want your resource allocation to be as elastic as possible, i.e. dynamically increase RAM/CPU cores for either MAP or a REDUCE task.

increase the number of map and reduce function

I have a question.
I want to increase my map and reduce functions to the number of my input data. when I execute System.out.println(conf.getNumReduceTasks()) and System.out.println(conf.getNumMapTasks()) it shows me:
1 1
and when I execute conf.setNumReduceTasks(1000000) and conf.setNumMapTasks(1000000) and again execute the println method it shows me:
1000000 1000000
but I think there is no change in my mapreduce program execution time. my input is from cassandra, actually it is the cassandra column family rows that is about 362000 rows.
I want to set the number of my map and reduce function to the number of input rows..
what should I do?
Setting the number of map/reduce tasks for your map/reduce job does define how many map/reduce processes will be used to process your job. Consider if you really need so many java processes.
That said, the number of map tasks is mostly determined automatically; setting the number of map tasks is only a hint that can increase the number of maps that were determined by Hadoop.
For reduce tasks, the default is 1 and the practical limit is around 1,000.
See: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
It's also important to understand that each node of your cluster also has a maximum number of map/reduce tasks that can execute concurrently. This is set by the following configuration settings:
mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
The default for both of these is 2.
So increasing the number of map/reduce tasks will be limited to the number of tasks that can run simultaneously per node. This may be one reason you aren't seeing a change in execution time for your job.
See: http://hadoop.apache.org/docs/stable/mapred-default.html
The summary is:
Let Hadoop determine the number of maps, unless you want more map tasks.
Use the mapred.tasktracker..tasks.maximum settings to control how many tasks can run at one time.
The max value for number of reduce tasks should be somewhere between 1 or 2 * (mapred.tasktracker.reduce.tasks.maximum * #nodes). You also have to take into account how many map/reduce jobs you expect to run at once, so that a single job doesn't consume all the available reduce slots.
A value of 1,000,000 is almost certainly too high for either setting; it's not practical to run that many java processes. I expect that such high values are simply being ignored.
After setting the mapred.tasktracker..tasks.maximum to the number of tasks your nodes are able to run simultaneously, then try increasing your job's map/reduce tasks incrementally.
You can see the actual number of tasks used by your job in the job.xml file to verify your settings.

Pseudo distributed : Need to change number of mapper nodes

I am using a Intel(R) Core(TM)2 Duo processor. I have installed hadoop in pseudo distributed mode. I have written a program which needs 50 mappers nodes. Is it possible to have 50 mapper nodes in the pseudo distributed mode or I will be limited to 4 nodes(2 * number of cores) . I have tried setting "mapred.tasktracker.map.tasks.maximum" to 50, but there is no change in concurrency.
The maximum number of map and reduce tasks depends on the number of task trackers in your cluster and the maximum number of map/reduce tasks per node defined using the properties mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum.
I assume your map reduce job needs 50 map tasks in the default block size configuration. The number of map tasks needed for a job depends on the number of InputSplits for the processed data. Definitely you should not depend on the number of needed map tasks or define this limit anyhow in your program. This would impact the scaling of your map reduce job.
One option would be to to set the maximum number of mapper tasks to 50. The number of the available mapper tasks should be visible in the cluster summary section of the job tracker web ui. However as your processor has only two cores, you should reconsider, whether launching 50 mappers concurrently will have any positive impact on the performance of your map reduce job.

Resources