Hadoop streaming api - limit number of mappers on a per job basis - hadoop

I have a job running on a small hadoop cluster that I want to limit the number of mappers it spawns per datanode. When I use the -Dmapred.map.tasks=12, it still spawns 17 mappers for some reason. I've figured out a way to limit it globally, but I want to do it on a per job basis.

In Map Reduce , the total number of mappers will be spawned depends upon the input splits that are being created from your data .
There will be one mapper task spawned per input split. SO , you cannot decrease the count of mapper in Map Reduce.

Related

In hadoop, 1 reduce or number of reduces = number of mappers

In hadoop, what is the difference between using n mappers and n reduce, or n mappers and 1 reduce.
in the case of using 1 reduce, the reduce phase is made of which computer (mappers), if I have 3 computers
The number of mappers is controlled by the amount of data being processed. Reducers are controlled either by the developer or different system parameters.
To override the number of reducers:
set mapreduce.job.reduces=#;
or if it is a Hive job and you want to control more how much work each reducer has to do then you can tweak certain parameters such as:
hive.exec.reducers.bytes.per.reducer.
You can still override by using mapreduce.job.reduces it is just using the bytes per reducer allows you to control the amount each reducer processes.
In regards to controlling where the reducers run you really cannot control that except by using Node Labels. This would mean controlling where all of the tasks in the job run not just the reducers.

Estimation Of Mappers for a cluster

Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(Genaral)
Thanks.
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
You are right. Number of mappers are usually based on number of DFS blocks in the input.
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
By default, Sqoop will use four tasks in parallel to import/export data.
You may change this by using -m <number of mappers> option.
Refer: Sqoop parallelism
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(General)
CPU cores are processing units. In simple words "More the Cores the better.", that is if we have more cores it can process more parallely.
Example: if you have 4 cores, 4 mappers can run parallely.(theoretically!)
Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
answer : No it has nothing to do with the RAM size. it all depends on the number of input split.
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
answer : by default the number of mappers for a Sqoop job are 4. you can change the default by using -m (1,2,3,4,5...) or --num-mappers parameter , but you have to make sure that either you have primary key in you db or you are using -split-by parameter otherwise there will be only one mapper running and you have to explicitly say -m 1.
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(General)
answer : core in CPU is the processing unit which can run a task. and when you say 4 core processor that means it can run 4 task at a time. the number of cores does not participate in calculating the number of mappers by mapreduce framework. but yes if there are 4 cores and mapreduce calculates the number of mappers are 12 then at a time 4 mappers will be running in parallel and after that rest will be running serially.

Hadoop map process

If there is a job that has only a map and no reduce, and if all data value that are to be processed are mapped to a single key, will the job only be processed on a single node?
No.
Basically, number of nodes will be determined by number of mappers. 1 mapper will run on 1 node, N mappers on N nodes, one node per mapper.
The number of mappers needed for your job will be set by Hadoop, depending on the amount of data, and on the size of blocks your data will be split in. Each block of data will be processed by 1 mapper.
So if for instance you have an amount of data, that is split in N blocks, you will need N mappers to process it.
Directly from Hadoop Definitive Guide, 6 Chapter, Anatomy of Map reduce job run.
"To create the list of tasks to run, the job scheduler first retrieves
the input splits computed by the client from the shared filesystem. It
then creates one map task for each split. The number of reduce tasks
to create is determined by the mapred.reduce.tasks property in the
Job, which is set by the setNumReduceTasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given
IDs at this point."

Hadoop Map/Reduce Job distribution

I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.
Thank you
Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.
For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.
Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.
Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.

hadoop: limit number of concurrent map / reduce tasks per job

I want to submit a research job into a production cluster. As I don't need this job to finish quickly, and I don't want to delay production jobs, I want to limit the number of tasks that are executing for that job at any given time. Can I do that in Hadoop 2?
For limiting the Hadoop mapreduce resources (map/reduce slots) Fair scheduler can be used. You better create a new fairscheduler pool by setting up desired number of maximum mappers and maximum reducers and job can be submitted to that newly created fairscheduler pool.
you can also do the following
job.getConfiguration().setInt("mapred.map.tasks", 1);
job.setNumReduceTasks(1);
job.setPriority(JobPriority.VERY_LOW);

Resources