Hadoop Fair Scheduler not assigning tasks to some nodes - hadoop

I'm trying to run the Fair Scheduler, but it's not assigning Map tasks to some nodes with only one job running. My understanding is that the Fair Scheduler will use the conf slot limits unless multiple jobs exist, at which point the fairness calculations kick in. I've also tried setting all queues to FIFO in fair-scheduler.xml, but I get the same results.
I've set the scheduler in all mapred-site.xml files with the mapreduce.jobtracker.taskscheduler parameter (although I believe only the JobTracker needs it) and some nodes have no problem receiving and running Map tasks. However, other nodes either never get any Map tasks, or get one round of Map tasks (ie, all slots filled once) and then never get any again.
I tried this as a prerequisite to developing my own LoadManager, so I went ahead and put a debug LoadManager together. From log messages, I can see that the problem nodes keep requesting Map tasks, and that their slots are empty. However, they're never assigned any.
All nodes work perfectly with the default scheduler. I just started having this issue when I enabled the Fair Scheduler.
Any ideas? Does someone have this working, and has taken a step that I've missed?
EDIT: It's worth noting that the Fair Scheduler web UI page indicates the correct Fair Share count, but that the Running column is always less. I'm using the default per-user pools and only have 1 user and 1 job at a time.

The reason was the undocumented mapred.fairscheduler.locality.delay parameter. The problematic nodes were located on a different rack with HDFS disabled, making all tasks on these nodes non-rack local. Because of this, they were incurring large delays due to the Fair Scheduler's Delay Scheduling algorithm, described here.

Related

How many reducers can simultaneously run?

Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.

Master and Slaves in Hadoop

I know that Hadoop divides the work into independent chuncks. But imagine if one mapper finished handling its tasks before other mappers, can the master program give this mapper a work (i.e. some tasks) that was already associated to another mapper? if yes, how?
Read up on speculative execution Yahoo Tutorial-
One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.
The Yahoo Tutorial information, which only covers MapReduce v1, is a little out of date, though the concepts are the same. The new options for MR v2 are now:
mapreduce.map.speculative
mapreduce.reduce.speculative

For a large mapreduce job, with a few lingering reducers, can this job be safely downsized?

Chris Smith answered this question and said I could post it.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Keep in mind, EMR is a very thin layer over Hadoop. If you were doing distributed computation on Amazon's fabric, you could be a TON more efficient with something customized for its specific needs which would not really resemble Hadoop or Map/Reduce at all. If you're doing a lot of heavy work with Hadoop, you are often better off with your own cluster or at least with a dedicated cluster in the cloud (that way data is already sliced up on local disk and output need only be persisted to local disk). EMR's main virtue is that it is quick and dirty and hooks in nicely to other parts of AWS (like S3).
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
It most definitely is costing you, particularly in terms of runtime. I'd start by being concerned about why the completion times are so non-uniform.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
If you are referring to the output of a job, if you have S3 as your output path for your job configuration, then data from a given task will be written out to S3 before the task exits.
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
Well... it's a bit more complicated than that... When the new node is assigned the job, it has to pull the data from somewhere. That somewhere it typically from the mappers who generated the data in the first place. If they aren't there anymore, the map tasks may need to be rerun (or more likely: the job will fail). Normally the replication factor on map output is 1, so this is an entirely plausible scenario. This is one of a few reasons why Hadoop jobs can have their "% complete" go backwards... mappers can even go back from 100% to <100%.
Related to this: it's conceivable, depending on the stage those reducer jobs are in, that they have yet to receive all of the map output that feeds in to them. Obviously in THAT case killing the wrong mapper is deadly.
I think it is important to highlight the difference between taking offline TaskTracker only nodes, vs. nodes running TaskTracker + DataNode service. If you take off more than a couple of the latter, you're going to lose blocks in HDFS, which is usually not a great thing for your job (unless you really don't use HDFS for anything other than distributing your job). You can take off a couple of nodes at a time, and then run a rebalancer to "encourage" HDFS to get the replication factor of all blocks back up to 3. Of course, this triggers network traffic and disk I/O, which might slow down your remaining tasks.
tl;dr: there can be problems killing nodes. While you can be confident that a completed task, which writes its output to S3, has completely written out all of its output by the time the JobTracker is notified the task has completed, the same can't be said for map tasks, which write out to their local directory and transfer data to reducers asynchronously. Even if all the map output has been transferred to their target reducers, if your reducers fail (or if speculative execution triggers the spinning up of a task on another node), you mail really need those other nodes, as Hadoop will likely turn to them for input data for a reassigned reducer.
--
Chris
P.S. This can actually be a big pain point for non-EMR Hadoop setups as well (instead of paying for nodes longer than you need them, it presents as having nodes sitting idle when you have work they could be doing, along with massive compute time loss due to node failures). As a general rule, the tricks to avoid the problem are: keep your tasks sizes pretty consistent and in the 1-5 minute range, enable speculative execution (really crucial in the EMR world where node performance is anything but consistent), keep replication factors up well above your expected node losses for a given job (depending on your node reliability, once you cross >400 nodes with day long job runs, you start thinking about a replication factor of 4), and use a job scheduler that allows new jobs to kick off while old jobs are still finishing up (these days this is usually the default, but it was a totally new thing introduced ~Hadoop 0.20 IIRC). I've even heard of crazy things like using SSD's for mapout dirs (while they can wear out fast from all the writes, their failure scenarios tend to be less catastrophic for a Hadoop job).

Suspending hadoop nodes temporarily - background hadoop cluster

I wonder if it is possible to install a "background" hadoop cluster. I mean, after all it is meant to be able to deal with nodes being unavailable or slow sometimes.
So assuming some university has a computer lab. Say, 100 boxes, all with upscale desktop hardware, gigabit etherner, probably even identical software installation. Linux is really popular here, too.
However, these 100 boxes are of course meant to be desktop systems for students. There are times where the lab will be full, but also times where the lab will be empty. User data is mostly stored on a central storage - say NFS - so the local disks are not used a lot.
Sounds like a good idea to me to use the systems as Hadoop cluster in their idle time. The simplest setup would be of course to have a cron job start the cluster at night, and shut down in the morning. However, also during the day many computers will be unused.
However, how would Hadoop react to e.g. nodes being shut down when any user logs in? Is it possible to easily "pause" (preempt!) a node in hadoop, and moving it to swap when needed? Ideally, we would give Hadoop a chance to move away the computation before suspending the task (also to free up memory). How would one do such a setup? Is there a way to signal Hadoop that a node will be suspended?
As far as I can tell, datanodes should not be stopped, and maybe replication needs to be increased to have more than 3 copies. With YARN there might also be a problem that by moving the task tracker to an arbitrary node, it may be the one that gets suspended at some point. But maybe it can be controlled that there is a small set of nodes that is always on, and that will run the task trackers.
Is it appropriate to just stop the tasktracker or send a SIGSTOP (then resume with SIGCONT)? The first would probably give hadoop the chance to react, the second would continue faster when the user logs out soon (as the job can then continue). How about YARN?
First of all, hadoop doesn't support 'preempt', how you described it.
Hadoop simply restarts task, if it detects, that task tracker dead.
So in you case, when user logins into host, some script simply kills
tasktracker, and jobtracker will mark all mappers/reducers, which were run
on killed tasktracker, as FAILED. After that this tasks will be rescheduled
on different nodes.
Of course such scenario is not free. By design, mappers and reducers
keep all intermediate data on local hosts. Moreover, reducers fetch mappers
data directly from tasktrackers, where mappers was executed. So, when
tasktracker will be killed, all those data will be lost. And in case
of mappers, it is not a big problem, mapper usually works on relatively
small amount of data (gigabytes?), but reducer will suffer greater.
Reducer runs shuffle, which is costly in terms of network bandwidth and
cpu. If tasktracker runs some reducer, restart of this reducer means,
that all data should be redownloaded once more onto new host.
And I recall, that jobtracker doesn't see immediately, that
tasktracker is dead. So, killed tasks shouldn't restart immediately.
If you workload is light, datanodes can live forever, don't put them offline,
when user login. Datanode eats small amount of memory (256M should be enough
in case small amount of data) and if you workload is light, don't eat much
of cpu and disk io.
As conclusion, you can setup such configuration, but don't rely on
good and predictable job execution on moderated workloads.

Can Hadoop distribute tasks and code base?

I'm starting to play around with hadoop(but don't have access to a cluster yet so just playing around in standalone). My question is, once its in a cluster setup, how are tasks distributed and can the code base be transfered to new nodes?
Ideally, I would like to run large batch jobs and if I need more capacity add new nodes to a cluster but I'm not sure if I'll have to copy the same code thats running locally or do something special so while the batch job is running I can add capacity. I thought I could store my codebase on the HDFS and have it pulled locally to run every time I need it but that still means I need some kind of initial script on the server and need to run it manually first.
Any suggestions or advice on if this is possible would be great!
Thank you.
When you schedule a mapreduce job using the hadoop jar command, the jobtracker will determine how many mappers are needed to execute your job. This is usually determined by the number of blocks in the input file, and this number is fixed, no matter how many worker nodes you have. It then will enlist one or more tasktrackers to execute your job.
The application jar (along with any other jars that are specified using the -libjars argument), is copied automatically to all of the machines running the tasktrackers that are used to execute your jars. All of that is handled by the Hadoop infrastructure.
Adding additional tasktrackers will increase the parallelism of your job assuming that there are as-yet-unscheduled map tasks. What it will not do is automatically re-partition the input to parallelize across additional map capacity. So if you have a map capacity of 24 (assuming 6 mappers on each of 4 data nodes), and you have 100 map tasks with the first 24 executing, and you add another data node, you'll get some additional speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to be aware of data reference locality. Since the data should ideally be processed on the same machines that store it initially, adding new task trackers will not necessarily add proportional processing speed, since the data will not be local on those nodes initially and will need to be copied over the network.
I do not quite agree with Daniel's reply.
Primarily because if "on starting a job, jar code will be copied to all the nodes that the cluster knows of" is true, then even if you use 100 mappers and there are 1000 nodes, code for all jobs will always be copied to all the nodes. Does not make sense.
Instead Chris Shain's reply makes more sense that whenever JobScheduler on JobTracker chooses a job to be executed and identifies a task to be executed by a particular datanode then at this time somehow it conveys the tasktracker from where to copy the codebase.
Initially (before mapreduce job start), the codebase was copied to multiple locations as defined by mapred.submit.replication parameter. Hence, tasktracker can copy the codebase from several locations a list of which may be sent by jobtracker to it.
Before attempting to build a Hadoop cluster I would suggest playing with Hadoop using Amazon's Elastic MapReduce.
With respect to the problem that you are trying to solve, I am not sure that Hadoop is a proper fit. Hadoop is useful for trivially parallelizable batch jobs: parse thousonds (or more) documents, sorting, re-bucketing data). Hadoop Streaming will allow you to create mappers and reducer using any language that you like but the inputs and outputs must be in a fixed format. There are many uses but, in my opinion, process control was not one of the design goals.
[EDIT] Perhaps ZooKeeper is closer to what you are looking for.
You could add capacity to the batch job if you want but it needs to be presented as a possibility in your codebase. For example, if you have a mapper that contains a set of inputs that you want to assign multiple nodes to take the pressure you can. All of this can be done but not with the default Hadoop install.
I'm currently working on a Nested Map-Reduce framework that extends the Hadoop codebase and allows you to spawn more nodes based on inputs that the mapper or reducer gets. If you're interested drop me a line and i'll explain more.
Also, when it comes to the -libjars option, this only works for the nodes that are assigned by the jobtracker as instructed by the job you write. So if you specify 10 mappers, the -libjar will copy your code there. If you want to start with 10, but work your way up, the nodes you add will not have the code.
Easiest way to bypass this is to add your jar to the classpath of the hadoop-env.sh script. That will always when starting a job copy that jar to all the nodes that the cluster knows off.

Resources