Limiting the number of mappers running on Hadoop Streaming - hadoop

Is it possible to limit the number of mappers running for a job at any given time using Hadoop Streaming? For example, I have a 28 node cluster that can run 1 task per node. If I have a job with 100 tasks, I'd like to only use say 20 out of the 28 nodes at any point in time. I'd like to do limit some jobs because they may contain many long running tasks and I sometimes want to run some faster running jobs and be sure that they can run immediately, rather than wait for the long running job to finish.
I saw this question and the title is spot on but the answers don't seem to address this particular issue.
Thanks!

While i am not aware about "node-wise" capacity scheduling, there is alternative scheduler built for the very similar case: Capacity Scheduler.
http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html
You should define special queue for potentially long jobs and queue for short jobs and this scheduler will care to have some capacity to be always available for each queue's jobs.

Following option may make sense if the amount of work in each mapper is substantial, since this strategy does involve overhead of reading up to 20 counters in each map invocation.
Create a group of counters and make the groupname MY_TASK_MAPPERS . make the key equal to MAPPER<1..K> where K is the max #of mappers you want. Then in the Mapper iterate through the counters until one of them is found to be 0. Place the machine's un-dotted ip address as a long value in the counter - effectively assigning that one machine to that mapper. If instead all K are already taken, then just quit the mapper without doing anything.

Related

How many reducers can simultaneously run?

Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.

Why the running time of a mapper should be more than 1 minute?

I have read from many blogs/web pages that state
the running time of a mapper should be more than X minutes
I understand there are overheads involved in setting up a mapper but how exactly is this calculated? Why is it after X minutes the overhead then is justified? And when we talk about overheads, what are the Hadoop overheads?
Its not a hard code rule but makes sense. At the background so many small process are handled out before a mapper is started. Its initialization,other stuffs apart from the real processing would itself take 10-15 seconds. So to reduce the number of split which in turn would reduce the mapper count, maxsplitsize could be set to some higher value that is what that blog conveys. If we fail to do that. Below are the overheads the MR framework has to handle while creating a mapper.
Calculating splits for that mapper.
The job scheduler in jobtracker has to create a separarte map task this would increase the latency a bit.
When it comes to assignment the jobtracker will have to look for a tasktracker based on its data locality. This will again involving creating local temp directories in the tasktracker which would be used up by the setup and cleanup task for that mapper, for example in the setup the if we are reading from a distributed cache and putting that intop a hashmap or initializing and cleaning up something in the mapper.And if already there are enough map and reduce tasks running in that tasktracker this would put a overhead on the tasktracker.
In worst case the number of fixed map task is full, then the JT will have to look for a different TT which would lead to a remote read.
Also TT would only send the heartbeat to JT once in 3 seconds this would cause a delay in the job initialization because the TT would have to contact the JT to run a job as well as sending the completed status.
Unfortunately if your mapper fails then that task would be run 3 times before it finally fails.

How many Mapreduce Jobs can be run simultaneously

I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.

increase the number of map and reduce function

I have a question.
I want to increase my map and reduce functions to the number of my input data. when I execute System.out.println(conf.getNumReduceTasks()) and System.out.println(conf.getNumMapTasks()) it shows me:
1 1
and when I execute conf.setNumReduceTasks(1000000) and conf.setNumMapTasks(1000000) and again execute the println method it shows me:
1000000 1000000
but I think there is no change in my mapreduce program execution time. my input is from cassandra, actually it is the cassandra column family rows that is about 362000 rows.
I want to set the number of my map and reduce function to the number of input rows..
what should I do?
Setting the number of map/reduce tasks for your map/reduce job does define how many map/reduce processes will be used to process your job. Consider if you really need so many java processes.
That said, the number of map tasks is mostly determined automatically; setting the number of map tasks is only a hint that can increase the number of maps that were determined by Hadoop.
For reduce tasks, the default is 1 and the practical limit is around 1,000.
See: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
It's also important to understand that each node of your cluster also has a maximum number of map/reduce tasks that can execute concurrently. This is set by the following configuration settings:
mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
The default for both of these is 2.
So increasing the number of map/reduce tasks will be limited to the number of tasks that can run simultaneously per node. This may be one reason you aren't seeing a change in execution time for your job.
See: http://hadoop.apache.org/docs/stable/mapred-default.html
The summary is:
Let Hadoop determine the number of maps, unless you want more map tasks.
Use the mapred.tasktracker..tasks.maximum settings to control how many tasks can run at one time.
The max value for number of reduce tasks should be somewhere between 1 or 2 * (mapred.tasktracker.reduce.tasks.maximum * #nodes). You also have to take into account how many map/reduce jobs you expect to run at once, so that a single job doesn't consume all the available reduce slots.
A value of 1,000,000 is almost certainly too high for either setting; it's not practical to run that many java processes. I expect that such high values are simply being ignored.
After setting the mapred.tasktracker..tasks.maximum to the number of tasks your nodes are able to run simultaneously, then try increasing your job's map/reduce tasks incrementally.
You can see the actual number of tasks used by your job in the job.xml file to verify your settings.

Why submitting job to mapreduce takes so much time in General?

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time
Some process I'm aware:
1. data splitting
2. jar file sharing
A few things to understand about HDFS and M/R that helps understand this latency:
HDFS stores your files as data chunk distributed on multiple machines called datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.
How much time does it take to do a no-operation map-reduce program on a cluster?
Use MRBench for this purpose:
MRbench loops a small job a number of times
Checks whether small job runs are responsive and running efficiently on your cluster.
Its impact on the HDFS layer is very limited
To run this program, try the following (Check the correct approach for latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds.
Another issue is file size.
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.
As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
Calculating splits.
Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
I have seen similar issue and I can state the solution to be broken in following steps :
When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
Try with the data nodes and name nodes:
Stop all the services using stop-all.sh.
Format name-node
Reboot machine
Start all services using start-all.sh
Check data and name nodes.
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.

Resources