I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.
Related
Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.
We have a nice, big, complicated elastic-mapreduce job that has wildly different constraints on hardware for the Mapper vs Collector vs Reducer.
The issue is: for the Mappers, we need tonnes of lightweight machines to run several mappers in parallel (all good there); the collectors are more memory hungry, but it should still be OK to give them about 6GB of peak heap each . . . but, the problem is the Reducers. When one of those kicks off, it will grab up about 32-64GB for processing.
The result it that we get a round-robbin type of task death because the full memory of a box is consumed, which causes that one mapper and reducer to both be restarted elsewhere.
The simplest approach would be if we could somehow specify a way to have the reducer run on a different "group" (a handful of ginormous boxes) while having the mappers/collectors running on smaller boxes. This could also lead to significant cost-savings as well, as we really shouldn't be sizing the nodes mappers are running on to the demands of the reducers.
An alternative would be to "break up" the job so that there's a 2nd cluster that can be spun up to process the mappers collector's output--but, that's obviously "sub-optimal".
So, the question are:
Is there a way do specify what "groups" a mapper or a reducer will
run upon Elastic MapReduce and/or Hadoop?
Is there a way to prevent the reducers from starting until all the mappers are done?
Does anyone have other ideas on how to approach this?
Cheers!
During a Hadoop MapReduce job, Reducers start running after all the Mappers are done. The output from the Map phase is shuffled and sorted before partitioning takes place to decide which Reducer receives which data. So, Reducers start running after the Shuffle/Sort phase has ended (after the mappers are done).
I know that Hadoop divides the work into independent chuncks. But imagine if one mapper finished handling its tasks before other mappers, can the master program give this mapper a work (i.e. some tasks) that was already associated to another mapper? if yes, how?
Read up on speculative execution Yahoo Tutorial-
One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.
The Yahoo Tutorial information, which only covers MapReduce v1, is a little out of date, though the concepts are the same. The new options for MR v2 are now:
mapreduce.map.speculative
mapreduce.reduce.speculative
Is it possible to limit the number of mappers running for a job at any given time using Hadoop Streaming? For example, I have a 28 node cluster that can run 1 task per node. If I have a job with 100 tasks, I'd like to only use say 20 out of the 28 nodes at any point in time. I'd like to do limit some jobs because they may contain many long running tasks and I sometimes want to run some faster running jobs and be sure that they can run immediately, rather than wait for the long running job to finish.
I saw this question and the title is spot on but the answers don't seem to address this particular issue.
Thanks!
While i am not aware about "node-wise" capacity scheduling, there is alternative scheduler built for the very similar case: Capacity Scheduler.
http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html
You should define special queue for potentially long jobs and queue for short jobs and this scheduler will care to have some capacity to be always available for each queue's jobs.
Following option may make sense if the amount of work in each mapper is substantial, since this strategy does involve overhead of reading up to 20 counters in each map invocation.
Create a group of counters and make the groupname MY_TASK_MAPPERS . make the key equal to MAPPER<1..K> where K is the max #of mappers you want. Then in the Mapper iterate through the counters until one of them is found to be 0. Place the machine's un-dotted ip address as a long value in the counter - effectively assigning that one machine to that mapper. If instead all K are already taken, then just quit the mapper without doing anything.
I am currently using wordcount application in hadoop as a benchmark. I find that the cpu usage is fairly nearly constant around 80-90%. I would like to have a fluctuating cpu usage. Is there any hadoop application that can give me this capability? Thanks a lot.
I don't think there's a way to throttle or specify a range for hadoop to use. Hadoop will use the CPU available to it. When I'm running a lot of jobs, I'm constantly in the 90%+ range.
One way you can control the CPU usage is to change the maximum number of mappers/reducers each tasktracker can run simultaneously. This is done through the
mapred.tasktracker.{map|reduce}.tasks.maximum setting in $HADOOP_HOME/conf/core-site.xml.
It will use less CPU on that tasktracker when the number of mapper/reducers is limited.
Another way is to set the configuration value for mapred.tasktracker.{map|reduce}.tasks when setting up the job. This will force that job to use that many mappers/reducers. This number will be split across the available tasktrackers, so if you have 4 nodes and want each node to have 1 mapper you'd set mapred.tasktracker.map.tasks to 4. It's also possible that if a node can run 4 mappers, it will run all 4, I don't know exactly how hadoop will split out the tasks, but forcing a number, per job, is an option.
I hope that helps get you to where you're going. I still don't quite understand what you are looking for. :)