Job scheduling algorithm for cluster - algorithm

I'm searching for algorithm suitable for problem below:
There are multiple computers(exact number is unknown). Each computer pulls job from some central queue, completes job, then pulls next one. Jobs are produced by some group of users. Some users submit lots of jobs, some a little. Jobs consume equal CPU time(not really, just approximation).
Central queue should be fair when scheduling jobs. Also, users who submitted lots of jobs should have some minimal share of resources.
I'm searching a good algorithm for this scheduling.
Considered two candidates:
Hadoop-like fair scheduler. The problem here is: where can I take minimal shares here when my cluster size is unknown?
Associate some penalty with each user. Increment penalty when user's job is scheduled. Use probability of scheduling job to user as 1 - (normalized penalty). This is something like stride scheduling, but I could not find any good explanation on it.

when I implemented a very similar job runner (for a production system), I ended having each server up choose jobtypes at random. This was my reasoning --
a glut of jobs from one user should not impact the chance of other users having their jobs run (user-user fairness)
a glut of one jobtype should not impact the chance of other jobtypes being run (user-job and job-job fairness)
if there is only one jobtype from one user waiting to run, all servers should be running those jobs (no wasted capacity)
the system should run the jobs "fairly", i.e. proportionate to the number of waiting users and jobtypes and not the total waiting jobs (a large volume of one jobtype should not cause scheduling to favor it) (jobtype fairness)
the number of servers can vary, and is not known beforehand
the waiting jobs, jobtypes and users metadata is known to the scheduler, but not the job data (ie, the usernames, jobnames and counts, but not the payloads)
I also wanted each server to be standalone, to schedule its own work autonomously without having to know about the other servers
The solution I settled on was to track the waiting jobs by their {user,jobtype} attribute tuple, and have each scheduling step randomly select 5 tuples and from each tuple up to 10 jobs to run next. The selected jobs were shortlisted to be run by the next available runner. Whenever capacity freed up to run more jobs (either because jobs finished or because of secondary restrictions they could not run), ran another scheduling step to fetch more work.
Jobs were locked atomically as part of being fetched; the locks prevented them from being fetched again or participating in further scheduling decisions. If they failed to run they were unlocked, effectively returning them to the pool. The locks timed out, so the server running them was responsible for keeping the locks refreshed (if a server crashed, the others would time out its locks and would pick up and run the jobs it started but didn't complete)
For my use case I wanted users A and B with jobs A.1, A.2, A.3 and B.1 to each get 25% of the resources (even though that means user A was getting 75% to user B's 25%). Choosing randomly between the four tuples probabilistically converges to that 25%.
If you want users A and B to each have a 50-50 split of resources, and have A's A.1, A.2 and A.3 get an equal share to B's B.1, you can run a two-level scheduler, and randomly choose users and from those users choose jobs. That will distribute the resources among users equally, and within each user's jobs equally among the jobtypes.
A huge number of jobs of a particular jobtype will take a long time to all complete, but that's always going to be the case. By picking from across users then jobtypes the responsiveness of the job processing will not be adversely impacted.
There are lots of secondary restrictions that can be added (e.g., no more than 5 calls per second to linkedin), but the above is the heart of the system.

You could try Torque resource management and Maui batch job scheduling software from Adaptive Computing. Maui policies are flexible enough to fit your needs. It supports backfill, configurable job and user priorities and resource reservations.

Related

Is there any way to know which job will start next in qsub

In our institute (IISc Bangalore)Supercomputer ,we submit jobs using qsub. The jobs will start running according to the following-
(1) Its wall time(Expected completion time)
(2) Its position in the respected queue(small,medium,large etc).
So,it is very difficult to know which job will start after finishing one job which is currently running. But qsub is probably has a list of its own,by which it is starting a new job after finishing another job immediately.
Is there any way to know which job will start next.Is there any command for this.
Thank you.
Unfortunately, there is no clear way to know which job will be run next in a supercomputing system. The job start is depending not only on it's wall time or position in the queue but also many other factors based on the site-level policy, scheduling strategies and priorities. There can be some internal job ranking (priorities) chosen by the institute based on factors like power management, load balancing etc.
On the other side, there are many researches to predict the waiting time for job allocation. TeraGrid systems provides estimated waiting time. Also, see link1, link2 (by SERC) for more information about predicting the waiting time.

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?
Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Difference in type of schedulers and scheduling algorithms

I have studied about the topic of Job Schedulers and there are different types like Long term, medium and short term schedulers and finally got confused with the things.
So my question is, "Among these three schedulers, which scheduler type will make use of the scheduling algorithms(like FCFS, SJF etc.)"
My understanding so far is, "The scheduling algorithm will take the job from the ready queue (which contains the list of jobs to be executed which is in ready more) and keeps the CPU busy as much as possible".
And the Long Term Scheduler is the one which decides what are all the jobs to be allowed in the ready queue.
So, the long term scheduler is the one which is going to make use of those scheduling algols..?.
And also, I have seen the link, https://en.wikipedia.org/wiki/Scheduling_(computing)
where I have seen that,
Note: The following lines is excerpted from Wiki...
"Thus the short-term scheduler makes scheduling decisions much more frequently than the long-term or mid-term schedulers...."
So, whether all these 3 schedulers will make use of the scheduling algol.??
Finally, I got tucked at this point and got confused with the difference between these types of schedulers ..
Could some one kindly do briefly explain this one?
So I can able to understand this one.
Thanks in advance.
So, whether all these 3 schedulers will make use of the scheduling
algo??
Basically, the scheduling algorithms are chosen by all three of them depending on whichever is functional at that point. All of them require some kind of scheduling decisions at any point as all of them are schedulers. So, it all depends on which is executing at what instant (short-term scheduler executes more frequently as compared to others).
Wikipedia is right in mentioning that. I hope you got your answer in short.
Description :
As mentioned in Process Scheduling page on tutorialspoint :-
Schedulers are special system softwares which handles process scheduling in various ways. Their main task is to select the jobs to be submitted into the system and to decide which process to run.
Long Term Scheduler ------> It selects processes from pool and loads them into memory for execution
Medium Term Scheduler -----> It selects those processes which are ready to execute.
Short Term Scheduler ------> It can re-introduce the process into memory and execution can be continued.
The below list (click here for source) shows the function of each of the three types of schedulers (long-term, short-term, and medium-term) for each of three types of operating systems (batch, interactive, and real-time).
batch
longterm -----> job admission based on characteristics and resource
needs
mediumterm -----> usually none—jobs remain in storage until done
shortterm -----> processes scheduled by priority; continue until wait
voluntarily, request service, or are terminated
interactive
longterm -----> sessions and processes normally accepted unless
capacity reached
mediumterm -----> processes swapped when necessary
shortterm -----> processes scheduled on rotating basis; continue until
service requested, time quantum expires, or pre-empted
real-time
longterm -----> processes either permanent or accepted at once
mediumterm -----> processes never swapped
shortterm -----> scheduling based on strict priority with immediate
preemption; may time-share processes with equal priorities

Hadoop Fairschduler doesn't utilize all map slots

Running a 12-node hadoop cluster with total 48 map-slots available. Submitting bunch of jobs, but never see all map slots being utilized. Maximum number of busy slots is floating around 30-35, but never close to 48. Why?
Here's the configuration of fairscheduler.
<?xml version="1.0"?>
<allocations>
<pool name="big">
<minMaps>10</minMaps>
<minReduces>10</minReduces>
<maxRunningJobs>3</maxRunningJobs>
</pool>
<pool name="medium">
<minMaps>10</minMaps>
<minReduces>10</minReduces>
<maxRunningJobs>3</maxRunningJobs>
<weight>3.0</weight>
</pool>
<pool name="small">
<minMaps>20</minMaps>
<minReduces>20</minReduces>
<maxRunningJobs>20</maxRunningJobs>
<weight>100.0</weight>
</pool>
</allocations>
The idea is that jobs in small queue should always have a priority, the next important queue is 'medium' and the less important is 'big'. Sometimes I see jobs in medium or big queue starve although there are more map slots available that are not used.
I think that the issue can be caused because the maxRunningJobs option is not taken into account while computing shares for jobs. I think that parameter is handled after slots (from the exceeding job) has been already assigned to a tasktracker. That is happening every n seconds from the UpdateThread.update()-> update Runability() method from FairScheduler class. I suppose that in your case after some time jobs from “medium” and “big” pool gets a bigger deficit than jobs from the “small” pool, that means that the next task will be scheduled from the job in medium or big pool. When the task is scheduled the restriction of maxRunningJobs take place and puts the exceeding jobs into a non runnable state. The same thing appears on the following update.
This is just my guess after looking after some source of fscheduler. If you can I would probably try to remove maxRunningJobs from the config and see how the scheduler behaves without that limitation and if it takes all of your slots..
Weigths for the pools in my oppinion seems to be to high. Weigh of 100 would mean that this pool should get 100x more slots than the default pool. I would try to lower this number by few factors if you want to have fair sharing between your pools. Otherwise jobs from others pools will be launched just when they will meet their deficit (it is calculated from the running tasks and minShare)
Another option why jobs are starving is maybe because of delay scheduling that is included in the fsched with the aim of improving computation locality? This can be probably improved by increasing a repclication factor but I do not think this is your case..
some docs on the fairscheduler..
The starvation probably occurs because the priority of the small pool is really really high (2^100 more than big 2^97 more than medium). When all the jobs are are ordered by priority and you have waiting jobs in the small pool. The next job in that pool needs 20 slots and it has higher priority than anything else so the open slots just wait there until a currently running job will free them. there are no "unneeded slots" to divide to other priorities
see highlights from the implementation notes of the fair schedulere:
"The fair shares are calculated by dividing the capacity of the
cluster among runnable jobs according to a "weight" for each job. By
default the weight is based on priority, with each level of priority
having 2x higher weight than the next (for example, VERY_HIGH has 4x
the weight of NORMAL). However, weights can also be based on job sizes
and ages, as described in the Configuring section. For jobs that are
in a pool, fair shares also take into account the minimum guarantee
for that pool. This capacity is divided among the jobs in that pool
according again to their weights."
Finally, when limits on a user's running jobs or a pool's running jobs
are in place, we choose which jobs get to run by sorting all jobs in
order of priority and then submit time, as in the standard Hadoop
scheduler. Any jobs that fall after the user/pool's limit in this
ordering are queued up and wait idle until they can be run. During
this time, they are ignored from the fair sharing calculations and do
not gain or lose deficit (their fair share is set to zero).

Fair job processing algorithm

I've got a machine that accepts user uploads, performs some processing on them, and then returns the result. It usually takes a few minutes to process each upload received.
The problem is, a few users can upload a lot of jobs that basically deny processing to other users for a long time. I thought of just setting a hard cap and using priority queues, e.g. after 5 uploads in an hour, all new uploads are given a lower processing priority. I basically want to process ALL jobs, but I don't want the user who uploaded 1000 jobs to make everyone wait.
My question is, is there a better way to do this?
My goal is to minimize the time between the upload and the result being returned. It would be ideal if the algorithm could work in a distributed manner as well.
Thanks
Implementation will vary widely depending on what these jobs are and how long they take and how varied the processing times are, as well as how likely there is to be a fatal error during the process.
That being said, an easy way to maintain an even distribution of jobs across users is to maintain a list of all the users who have submitted jobs. When you are ready to get a new job, rather than just taking the next job out of a random queue, cycle through the users taking the top job from each user each time.
Again, this can be accomplished a number of ways, I would recommend a map from users to their respective list of jobs submitted. Cycle through the keys of the map each time you are ready for a new job. then get the list of jobs for whatever key you are on, and do the first job.
This is assuming that each job is "atomic" in that one job is not dependent on being executed next to the jobs it was submitted with.
Hope that helps, of course I could have completely misunderstood what you are asking for.
You don't have to roll-your-own. There is Sun Grid Engine. An open-source tool that is built to do that sort of thing, and if you are willing to pay, there is Platform LSF, which I use at work.
What is the maximum # of jobs a user can submit? Can users submit 1 job a a time OR is it a batch of jobs?
So your algorithm would go something like this
If the User has submitted jobs Then
Check how many jobs per hour
If the jobs per hour > than the average Then
Modify the users profile to a lower priority
Else
Check Users priority level and restore
End If
If the priority = HIGH
process right away
Else If priority = MEDIUM
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Else If priority = LOW
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Check Queue for Medium Priority
If Medium Priority Found (rerun this loop)
Else Process
Process Queue
End If
You can use a graph algorithm like Edmond's Blossom V to assign all users and jobs to a process. If a user can upload more then another user it would be more simplier for him to find a process. With the Blossom V algorithm you can define a threshold to not exceed the maximum process the server can handle.

Resources