Fair job processing algorithm - algorithm

I've got a machine that accepts user uploads, performs some processing on them, and then returns the result. It usually takes a few minutes to process each upload received.
The problem is, a few users can upload a lot of jobs that basically deny processing to other users for a long time. I thought of just setting a hard cap and using priority queues, e.g. after 5 uploads in an hour, all new uploads are given a lower processing priority. I basically want to process ALL jobs, but I don't want the user who uploaded 1000 jobs to make everyone wait.
My question is, is there a better way to do this?
My goal is to minimize the time between the upload and the result being returned. It would be ideal if the algorithm could work in a distributed manner as well.
Thanks

Implementation will vary widely depending on what these jobs are and how long they take and how varied the processing times are, as well as how likely there is to be a fatal error during the process.
That being said, an easy way to maintain an even distribution of jobs across users is to maintain a list of all the users who have submitted jobs. When you are ready to get a new job, rather than just taking the next job out of a random queue, cycle through the users taking the top job from each user each time.
Again, this can be accomplished a number of ways, I would recommend a map from users to their respective list of jobs submitted. Cycle through the keys of the map each time you are ready for a new job. then get the list of jobs for whatever key you are on, and do the first job.
This is assuming that each job is "atomic" in that one job is not dependent on being executed next to the jobs it was submitted with.
Hope that helps, of course I could have completely misunderstood what you are asking for.

You don't have to roll-your-own. There is Sun Grid Engine. An open-source tool that is built to do that sort of thing, and if you are willing to pay, there is Platform LSF, which I use at work.

What is the maximum # of jobs a user can submit? Can users submit 1 job a a time OR is it a batch of jobs?
So your algorithm would go something like this
If the User has submitted jobs Then
Check how many jobs per hour
If the jobs per hour > than the average Then
Modify the users profile to a lower priority
Else
Check Users priority level and restore
End If
If the priority = HIGH
process right away
Else If priority = MEDIUM
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Else If priority = LOW
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Check Queue for Medium Priority
If Medium Priority Found (rerun this loop)
Else Process
Process Queue
End If

You can use a graph algorithm like Edmond's Blossom V to assign all users and jobs to a process. If a user can upload more then another user it would be more simplier for him to find a process. With the Blossom V algorithm you can define a threshold to not exceed the maximum process the server can handle.

Related

Is there any way to know which job will start next in qsub

In our institute (IISc Bangalore)Supercomputer ,we submit jobs using qsub. The jobs will start running according to the following-
(1) Its wall time(Expected completion time)
(2) Its position in the respected queue(small,medium,large etc).
So,it is very difficult to know which job will start after finishing one job which is currently running. But qsub is probably has a list of its own,by which it is starting a new job after finishing another job immediately.
Is there any way to know which job will start next.Is there any command for this.
Thank you.
Unfortunately, there is no clear way to know which job will be run next in a supercomputing system. The job start is depending not only on it's wall time or position in the queue but also many other factors based on the site-level policy, scheduling strategies and priorities. There can be some internal job ranking (priorities) chosen by the institute based on factors like power management, load balancing etc.
On the other side, there are many researches to predict the waiting time for job allocation. TeraGrid systems provides estimated waiting time. Also, see link1, link2 (by SERC) for more information about predicting the waiting time.

Why Shortest Job First(SJF) algorithm is not used instead of FCFS at final level in Multilevel Feedback Scheduling

In Multilevel Feedback Scheduling at the base level queue, the processes circulate in round robin fashion until they complete and leave the system. Processes in the base level queue can also be scheduled on a first come first served basis.
Why can't they be scheduled on Shortest Job First (SJF) algorithm instead of First Come First Serve (FCFS) algorithm which seems to improve average performance of the algorithm.
One simple reason:
The processes fall in the base level queue after they fail to finish in the time quantum alloted to them in the higher level queues. If you implement SJF algorithm in the base level queue, you may starve a process because shorter job may keep coming before a longer executing process ever gets the CPU.
The SJF algorithm gives more througput, only when processes differ a lot in their burst time. However its not always the case that it will perform better than FCFS. Take a loot at this answer.
Since in Multilevel Feedback Scheduling algorithm, all the processes that are unable to complete execution within defined time quantum of first 2 queues, are put to the last queue having FCFS, its very likely that they all have large CPU bursts and therefore wont differ much in their burst time. Hence, its preferred to have FCFS, scheduling for the last queue.

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?
Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Job scheduling algorithm for cluster

I'm searching for algorithm suitable for problem below:
There are multiple computers(exact number is unknown). Each computer pulls job from some central queue, completes job, then pulls next one. Jobs are produced by some group of users. Some users submit lots of jobs, some a little. Jobs consume equal CPU time(not really, just approximation).
Central queue should be fair when scheduling jobs. Also, users who submitted lots of jobs should have some minimal share of resources.
I'm searching a good algorithm for this scheduling.
Considered two candidates:
Hadoop-like fair scheduler. The problem here is: where can I take minimal shares here when my cluster size is unknown?
Associate some penalty with each user. Increment penalty when user's job is scheduled. Use probability of scheduling job to user as 1 - (normalized penalty). This is something like stride scheduling, but I could not find any good explanation on it.
when I implemented a very similar job runner (for a production system), I ended having each server up choose jobtypes at random. This was my reasoning --
a glut of jobs from one user should not impact the chance of other users having their jobs run (user-user fairness)
a glut of one jobtype should not impact the chance of other jobtypes being run (user-job and job-job fairness)
if there is only one jobtype from one user waiting to run, all servers should be running those jobs (no wasted capacity)
the system should run the jobs "fairly", i.e. proportionate to the number of waiting users and jobtypes and not the total waiting jobs (a large volume of one jobtype should not cause scheduling to favor it) (jobtype fairness)
the number of servers can vary, and is not known beforehand
the waiting jobs, jobtypes and users metadata is known to the scheduler, but not the job data (ie, the usernames, jobnames and counts, but not the payloads)
I also wanted each server to be standalone, to schedule its own work autonomously without having to know about the other servers
The solution I settled on was to track the waiting jobs by their {user,jobtype} attribute tuple, and have each scheduling step randomly select 5 tuples and from each tuple up to 10 jobs to run next. The selected jobs were shortlisted to be run by the next available runner. Whenever capacity freed up to run more jobs (either because jobs finished or because of secondary restrictions they could not run), ran another scheduling step to fetch more work.
Jobs were locked atomically as part of being fetched; the locks prevented them from being fetched again or participating in further scheduling decisions. If they failed to run they were unlocked, effectively returning them to the pool. The locks timed out, so the server running them was responsible for keeping the locks refreshed (if a server crashed, the others would time out its locks and would pick up and run the jobs it started but didn't complete)
For my use case I wanted users A and B with jobs A.1, A.2, A.3 and B.1 to each get 25% of the resources (even though that means user A was getting 75% to user B's 25%). Choosing randomly between the four tuples probabilistically converges to that 25%.
If you want users A and B to each have a 50-50 split of resources, and have A's A.1, A.2 and A.3 get an equal share to B's B.1, you can run a two-level scheduler, and randomly choose users and from those users choose jobs. That will distribute the resources among users equally, and within each user's jobs equally among the jobtypes.
A huge number of jobs of a particular jobtype will take a long time to all complete, but that's always going to be the case. By picking from across users then jobtypes the responsiveness of the job processing will not be adversely impacted.
There are lots of secondary restrictions that can be added (e.g., no more than 5 calls per second to linkedin), but the above is the heart of the system.
You could try Torque resource management and Maui batch job scheduling software from Adaptive Computing. Maui policies are flexible enough to fit your needs. It supports backfill, configurable job and user priorities and resource reservations.

What is a good way to design and build a task scheduling system with lots of recurring tasks?

Imagine you're building something like a monitoring service, which has thousands of tasks that need to be executed in given time interval, independent of each other. This could be individual servers that need to be checked, or backups that need to be verified, or just anything at all that could be scheduled to run at a given interval.
You can't just schedule the tasks via cron though, because when a task is run it needs to determine when it's supposed to run the next time. For example:
schedule server uptime check every 1 minute
first time it's checked the server is down, schedule next check in 5 seconds
5 seconds later the server is available again, check again in 5 seconds
5 seconds later the server is still available, continue checking at 1 minute interval
A naive solution that came to mind is to simply have a worker that runs every second or so, checks all the pending jobs and executes the ones that need to be executed. But how would this work if the number of jobs is something like 100 000? It might take longer to check them all than it is the ticking interval of the worker, and the more tasks there will be, the higher the poll interval.
Is there a better way to design a system like this? Are there any hidden challenges in implementing this, or any algorithms that deal with this sort of a problem?
Use a priority queue (with the priority based on the next execution time) to hold the tasks to execute. When you're done executing a task, you sleep until the time for the task at the front of the queue. When a task comes due, you remove and execute it, then (if its recurring) compute the next time it needs to run, and insert it back into the priority queue based on its next run time.
This way you have one sleep active at any given time. Insertions and removals have logarithmic complexity, so it remains efficient even if you have millions of tasks (e.g., inserting into a priority queue that has a million tasks should take about 20 comparisons in the worst case).
There is one point that can be a little tricky: if the execution thread is waiting until a particular time to execute the item at the head of the queue, and you insert a new item that goes at the head of the queue, ahead of the item that was previously there, you need to wake up the thread so it can re-adjust its sleep time for the item that's now at the head of the queue.
We encountered this same issue while designing Revalee, an open source project for scheduling triggered callbacks. In the end, we ended up writing our own priority queue class (we called ours a ScheduledDictionary) to handle the use case you outlined in your question. As a free, open source project, the complete source code (C#, in this case) is available on GitHub. I'd recommend that you check it out.

Resources