How to set executor number in MapReduce? - hadoop

In spark, we can set executor number.
In mapreduce, how to set the executor number? Not set map or reduce task num, but set executor num.
I know how to set vcores and mem each map or reduce task use.
But there are so many map tasks, and I don't want my mr job use too much resource.

The number of mappers depends on the number of splits for your input data, which depends on InputFormat, that said user can give hint about number of mappers via mapreduce.job.maps, but InputFormat may choose to ignore it tho. Number of reducers is configureable via mapreduce.job.reduces.

Related

In hadoop, 1 reduce or number of reduces = number of mappers

In hadoop, what is the difference between using n mappers and n reduce, or n mappers and 1 reduce.
in the case of using 1 reduce, the reduce phase is made of which computer (mappers), if I have 3 computers
The number of mappers is controlled by the amount of data being processed. Reducers are controlled either by the developer or different system parameters.
To override the number of reducers:
set mapreduce.job.reduces=#;
or if it is a Hive job and you want to control more how much work each reducer has to do then you can tweak certain parameters such as:
hive.exec.reducers.bytes.per.reducer.
You can still override by using mapreduce.job.reduces it is just using the bytes per reducer allows you to control the amount each reducer processes.
In regards to controlling where the reducers run you really cannot control that except by using Node Labels. This would mean controlling where all of the tasks in the job run not just the reducers.

number of mapper and reducer tasks in MapReduce

If I set the number of reduce tasks as something like 100 and when I run the job, suppose the reduce task number exceeds (as per my understanding the number of reduce tasks depends on the key-value we get from the mapper.Suppose I am setting (1,abc) and (2,bcd) as key value in mapper, the number of reduce tasks will be 2) How will MapReduce handle it?.
as per my understanding the number of reduce tasks depends on the key-value we get from the mapper
Your understanding seems to be wrong. The number of reduce tasks does not depend on the key-value we get from the mapper.
In a MapReduce job the number of reducers is configurable on a per job basis and is set in the driver class.
For example if we need 2 reducers for our job then we need to set it in the driver class of our MapReduce job as below:-
job.setNumReduceTasks(2);
In the Hadoop: The Definitive Guide book, Tom White states that -
Setting reducer count is kind of art, instead of science.
So we have to decide how many reducers we need for our job. For your example if you have the intermediate Mapper input as (1,abc) and (2,bcd) and you have not set the number of reducers in the driver class then Mapreduce by default runs only 1 reducer and both of the key value pairs will be processed by a single Reducer and you will get a single output file in the specified output directory.
The default value of number of reducer on MapReduce is 1 irrespective of the number of the (key,value) pairs.
If you set the number of Reducer for a MapReduce job, then number of Reducer will not exceed than the defined value irrespective of the number of different (key,value) pairs.
Once the Mapper task are completed the output is processed by Partitioner by dividing the data into Reducers. The default partitioner for hadoop is HashPartitioner which partition the data based on hash value of keys. It has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.
So the number of reducer will never exceed than what you have defined in the Driver class.

Hadoop Basics:Number of map tasks mappers reduce tasks reducers

What is the difference between a mapper and a map task?
Similarly, a reducer and a reduce task?
Also, how are number of mappers,maptasks,reducers,reducetasks determined during the execution of a mapreduce task?
Give interrelationships between them if there is any.
Simply map task is an instance of Mapper. Mapper and reducer are methods in mapreduce jobs.
When we run a mapreduce job, number of map tasks spawned depends on the number blocks(number of blocks depend on input splits) in the input. However the number of reduce tasks can be specified in the mapreduce driver code. Either it can be specified by setting property mapred.reduce.tasks in the job configuration object or org.apache.hadoop.mapreduce.Job#setNumReduceTasks(int reducerCount); method can be used.
In the old JobConf API setNumMapTasks() method was there. But setNumMapTasks() method is removed in the new API org.apache.hadoop.mapreduce.Jobwith the intension of number of mappers should be calculated based on the input splits.

Hadoop streaming api - limit number of mappers on a per job basis

I have a job running on a small hadoop cluster that I want to limit the number of mappers it spawns per datanode. When I use the -Dmapred.map.tasks=12, it still spawns 17 mappers for some reason. I've figured out a way to limit it globally, but I want to do it on a per job basis.
In Map Reduce , the total number of mappers will be spawned depends upon the input splits that are being created from your data .
There will be one mapper task spawned per input split. SO , you cannot decrease the count of mapper in Map Reduce.

Hadoop slowstart configuration

What's an ideal value for "mapred.reduce.slowstart.completed.maps" for a Hadoop job? What are the rules to follow to set it appropriately?
Thanks!
It depends on a number of characteristics of your job, cluster and utilization:
How many map slots will your job require vs maximum map capacity: If you have a job that spawns 1000's of map tasks, but only have 10 map slots in total (an extreme case to demonstrate a point), then starting your reducers early could deprive over reduce tasks from executing. In this case i would set your slowstart to a large value (0.999 or 1.0). This is also true if your mappers take an age to complete - let someone else use the reducers
If your cluster is relatively lightly loaded (there isn't contention for the reducer slots) and your mappers output a good volume of data, then a low value for slowstart will assist in getting your job to finish earlier (while other map tasks execute, get the map output data moved to the reducers).
There are probably more

Resources