My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?

Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.


How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

Job scheduling algorithm for cluster

I'm searching for algorithm suitable for problem below:
There are multiple computers(exact number is unknown). Each computer pulls job from some central queue, completes job, then pulls next one. Jobs are produced by some group of users. Some users submit lots of jobs, some a little. Jobs consume equal CPU time(not really, just approximation).
Central queue should be fair when scheduling jobs. Also, users who submitted lots of jobs should have some minimal share of resources.
I'm searching a good algorithm for this scheduling.
Considered two candidates:
Hadoop-like fair scheduler. The problem here is: where can I take minimal shares here when my cluster size is unknown?
Associate some penalty with each user. Increment penalty when user's job is scheduled. Use probability of scheduling job to user as 1 - (normalized penalty). This is something like stride scheduling, but I could not find any good explanation on it.
when I implemented a very similar job runner (for a production system), I ended having each server up choose jobtypes at random. This was my reasoning --
a glut of jobs from one user should not impact the chance of other users having their jobs run (user-user fairness)
a glut of one jobtype should not impact the chance of other jobtypes being run (user-job and job-job fairness)
if there is only one jobtype from one user waiting to run, all servers should be running those jobs (no wasted capacity)
the system should run the jobs "fairly", i.e. proportionate to the number of waiting users and jobtypes and not the total waiting jobs (a large volume of one jobtype should not cause scheduling to favor it) (jobtype fairness)
the number of servers can vary, and is not known beforehand
the waiting jobs, jobtypes and users metadata is known to the scheduler, but not the job data (ie, the usernames, jobnames and counts, but not the payloads)
I also wanted each server to be standalone, to schedule its own work autonomously without having to know about the other servers
The solution I settled on was to track the waiting jobs by their {user,jobtype} attribute tuple, and have each scheduling step randomly select 5 tuples and from each tuple up to 10 jobs to run next. The selected jobs were shortlisted to be run by the next available runner. Whenever capacity freed up to run more jobs (either because jobs finished or because of secondary restrictions they could not run), ran another scheduling step to fetch more work.
Jobs were locked atomically as part of being fetched; the locks prevented them from being fetched again or participating in further scheduling decisions. If they failed to run they were unlocked, effectively returning them to the pool. The locks timed out, so the server running them was responsible for keeping the locks refreshed (if a server crashed, the others would time out its locks and would pick up and run the jobs it started but didn't complete)
For my use case I wanted users A and B with jobs A.1, A.2, A.3 and B.1 to each get 25% of the resources (even though that means user A was getting 75% to user B's 25%). Choosing randomly between the four tuples probabilistically converges to that 25%.
If you want users A and B to each have a 50-50 split of resources, and have A's A.1, A.2 and A.3 get an equal share to B's B.1, you can run a two-level scheduler, and randomly choose users and from those users choose jobs. That will distribute the resources among users equally, and within each user's jobs equally among the jobtypes.
A huge number of jobs of a particular jobtype will take a long time to all complete, but that's always going to be the case. By picking from across users then jobtypes the responsiveness of the job processing will not be adversely impacted.
There are lots of secondary restrictions that can be added (e.g., no more than 5 calls per second to linkedin), but the above is the heart of the system.
You could try Torque resource management and Maui batch job scheduling software from Adaptive Computing. Maui policies are flexible enough to fit your needs. It supports backfill, configurable job and user priorities and resource reservations.

What's the best Task scheduling algorithm for some given tasks?

We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.
The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.

Hadoop Fairschduler doesn't utilize all map slots

Running a 12-node hadoop cluster with total 48 map-slots available. Submitting bunch of jobs, but never see all map slots being utilized. Maximum number of busy slots is floating around 30-35, but never close to 48. Why?
Here's the configuration of fairscheduler.
<?xml version="1.0"?>
<pool name="big">
<pool name="medium">
<pool name="small">
The idea is that jobs in small queue should always have a priority, the next important queue is 'medium' and the less important is 'big'. Sometimes I see jobs in medium or big queue starve although there are more map slots available that are not used.
I think that the issue can be caused because the maxRunningJobs option is not taken into account while computing shares for jobs. I think that parameter is handled after slots (from the exceeding job) has been already assigned to a tasktracker. That is happening every n seconds from the UpdateThread.update()-> update Runability() method from FairScheduler class. I suppose that in your case after some time jobs from “medium” and “big” pool gets a bigger deficit than jobs from the “small” pool, that means that the next task will be scheduled from the job in medium or big pool. When the task is scheduled the restriction of maxRunningJobs take place and puts the exceeding jobs into a non runnable state. The same thing appears on the following update.
This is just my guess after looking after some source of fscheduler. If you can I would probably try to remove maxRunningJobs from the config and see how the scheduler behaves without that limitation and if it takes all of your slots..
Weigths for the pools in my oppinion seems to be to high. Weigh of 100 would mean that this pool should get 100x more slots than the default pool. I would try to lower this number by few factors if you want to have fair sharing between your pools. Otherwise jobs from others pools will be launched just when they will meet their deficit (it is calculated from the running tasks and minShare)
Another option why jobs are starving is maybe because of delay scheduling that is included in the fsched with the aim of improving computation locality? This can be probably improved by increasing a repclication factor but I do not think this is your case..
some docs on the fairscheduler..
The starvation probably occurs because the priority of the small pool is really really high (2^100 more than big 2^97 more than medium). When all the jobs are are ordered by priority and you have waiting jobs in the small pool. The next job in that pool needs 20 slots and it has higher priority than anything else so the open slots just wait there until a currently running job will free them. there are no "unneeded slots" to divide to other priorities
see highlights from the implementation notes of the fair schedulere:
"The fair shares are calculated by dividing the capacity of the
cluster among runnable jobs according to a "weight" for each job. By
default the weight is based on priority, with each level of priority
having 2x higher weight than the next (for example, VERY_HIGH has 4x
the weight of NORMAL). However, weights can also be based on job sizes
and ages, as described in the Configuring section. For jobs that are
in a pool, fair shares also take into account the minimum guarantee
for that pool. This capacity is divided among the jobs in that pool
according again to their weights."
Finally, when limits on a user's running jobs or a pool's running jobs
are in place, we choose which jobs get to run by sorting all jobs in
order of priority and then submit time, as in the standard Hadoop
scheduler. Any jobs that fall after the user/pool's limit in this
ordering are queued up and wait idle until they can be run. During
this time, they are ignored from the fair sharing calculations and do
not gain or lose deficit (their fair share is set to zero).

How can Oracle User Profiles be put to practical use?

Oracle 10g has Profiles that allow various resources to be limited. Here are some references for clarity - orafaq.com, Oracle Documentation. I am particularly interested in limiting CPU_PER_CALL, LOGICAL_READS_PER_CALL, and COMPOSITE_LIMIT with the goal of preventing a poorly formed statement from destroying performance for every other session.
My problem is that I don't know what values to use for these parameters that will allow your typical long running resource intensive operations while preventing the truly bad ones. I realize that the values will differ based on the hardware, tolerance levels, and queries involved, which is why I am more interested in a method to follow to determine what values are best.
There are a variety of approaches, depending on the situation. The simplest possible approach that has any hope of working is to ask how long the longest running realistic operation would run (that's obviously system-dependent, and depends on whether this is a system you're building or something existing) and to back in to a CPU_PER_CALL based on that time limit and the degree of parallelism. Assuming single-threaded operation, if you can reasonably say that if a query hasn't returned in 30 minutes you want to kill it, you can set CPU_PER_CALL to allocate 30 minutes worth of CPU (obviously most queries aren't going to use 100% constantly, so that 30 minute limit gives you some amount of breathing room).
If this is an existing system, you (or your DBA) can go through AWR/ statspack reports for a reasonable number of days (some systems will need to make sure to look at reports from month/ quarter/ year-end where additional processing may be done) and find the real statements that use the most CPU and I/O. You can then set your profile limits appropriately (i.e. the maximum CPU recorded for a statement in the past month + 30% of breathing room).
Of course, for any limit you pick, someone has to monitor the system to make sure that the limits keep pace. If queries get more and more expensive over time because of increases in data volume, for example, that max + 30% limit might be insufficient in 6 months. You don't want to find that out when the nightly processing aborts, someone has to keep on top of that.
If you are using the enterprise edition, you may be better served looking at Resource Manager rather than profiles. While profiles allow you to kill runaway sessions, Resource Manager allows you to change session priority based on a variety of factors. Rather than killing a query that has used more than 30 minutes of CPU, it may be better to make it lower priority so that it doesn't interfere with other sessions without killing it, in case it is just running long.
