Is there any way to know which job will start next in qsub - cluster-computing

In our institute (IISc Bangalore)Supercomputer ,we submit jobs using qsub. The jobs will start running according to the following-
(1) Its wall time(Expected completion time)
(2) Its position in the respected queue(small,medium,large etc).
So,it is very difficult to know which job will start after finishing one job which is currently running. But qsub is probably has a list of its own,by which it is starting a new job after finishing another job immediately.
Is there any way to know which job will start next.Is there any command for this.
Thank you.

Unfortunately, there is no clear way to know which job will be run next in a supercomputing system. The job start is depending not only on it's wall time or position in the queue but also many other factors based on the site-level policy, scheduling strategies and priorities. There can be some internal job ranking (priorities) chosen by the institute based on factors like power management, load balancing etc.
On the other side, there are many researches to predict the waiting time for job allocation. TeraGrid systems provides estimated waiting time. Also, see link1, link2 (by SERC) for more information about predicting the waiting time.

Related

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?
Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Difference in type of schedulers and scheduling algorithms

I have studied about the topic of Job Schedulers and there are different types like Long term, medium and short term schedulers and finally got confused with the things.
So my question is, "Among these three schedulers, which scheduler type will make use of the scheduling algorithms(like FCFS, SJF etc.)"
My understanding so far is, "The scheduling algorithm will take the job from the ready queue (which contains the list of jobs to be executed which is in ready more) and keeps the CPU busy as much as possible".
And the Long Term Scheduler is the one which decides what are all the jobs to be allowed in the ready queue.
So, the long term scheduler is the one which is going to make use of those scheduling algols..?.
And also, I have seen the link, https://en.wikipedia.org/wiki/Scheduling_(computing)
where I have seen that,
Note: The following lines is excerpted from Wiki...
"Thus the short-term scheduler makes scheduling decisions much more frequently than the long-term or mid-term schedulers...."
So, whether all these 3 schedulers will make use of the scheduling algol.??
Finally, I got tucked at this point and got confused with the difference between these types of schedulers ..
Could some one kindly do briefly explain this one?
So I can able to understand this one.
Thanks in advance.
So, whether all these 3 schedulers will make use of the scheduling
algo??
Basically, the scheduling algorithms are chosen by all three of them depending on whichever is functional at that point. All of them require some kind of scheduling decisions at any point as all of them are schedulers. So, it all depends on which is executing at what instant (short-term scheduler executes more frequently as compared to others).
Wikipedia is right in mentioning that. I hope you got your answer in short.
Description :
As mentioned in Process Scheduling page on tutorialspoint :-
Schedulers are special system softwares which handles process scheduling in various ways. Their main task is to select the jobs to be submitted into the system and to decide which process to run.
Long Term Scheduler ------> It selects processes from pool and loads them into memory for execution
Medium Term Scheduler -----> It selects those processes which are ready to execute.
Short Term Scheduler ------> It can re-introduce the process into memory and execution can be continued.
The below list (click here for source) shows the function of each of the three types of schedulers (long-term, short-term, and medium-term) for each of three types of operating systems (batch, interactive, and real-time).
batch
longterm -----> job admission based on characteristics and resource
needs
mediumterm -----> usually noneā€”jobs remain in storage until done
shortterm -----> processes scheduled by priority; continue until wait
voluntarily, request service, or are terminated
interactive
longterm -----> sessions and processes normally accepted unless
capacity reached
mediumterm -----> processes swapped when necessary
shortterm -----> processes scheduled on rotating basis; continue until
service requested, time quantum expires, or pre-empted
real-time
longterm -----> processes either permanent or accepted at once
mediumterm -----> processes never swapped
shortterm -----> scheduling based on strict priority with immediate
preemption; may time-share processes with equal priorities

Job scheduling algorithm for cluster

I'm searching for algorithm suitable for problem below:
There are multiple computers(exact number is unknown). Each computer pulls job from some central queue, completes job, then pulls next one. Jobs are produced by some group of users. Some users submit lots of jobs, some a little. Jobs consume equal CPU time(not really, just approximation).
Central queue should be fair when scheduling jobs. Also, users who submitted lots of jobs should have some minimal share of resources.
I'm searching a good algorithm for this scheduling.
Considered two candidates:
Hadoop-like fair scheduler. The problem here is: where can I take minimal shares here when my cluster size is unknown?
Associate some penalty with each user. Increment penalty when user's job is scheduled. Use probability of scheduling job to user as 1 - (normalized penalty). This is something like stride scheduling, but I could not find any good explanation on it.
when I implemented a very similar job runner (for a production system), I ended having each server up choose jobtypes at random. This was my reasoning --
a glut of jobs from one user should not impact the chance of other users having their jobs run (user-user fairness)
a glut of one jobtype should not impact the chance of other jobtypes being run (user-job and job-job fairness)
if there is only one jobtype from one user waiting to run, all servers should be running those jobs (no wasted capacity)
the system should run the jobs "fairly", i.e. proportionate to the number of waiting users and jobtypes and not the total waiting jobs (a large volume of one jobtype should not cause scheduling to favor it) (jobtype fairness)
the number of servers can vary, and is not known beforehand
the waiting jobs, jobtypes and users metadata is known to the scheduler, but not the job data (ie, the usernames, jobnames and counts, but not the payloads)
I also wanted each server to be standalone, to schedule its own work autonomously without having to know about the other servers
The solution I settled on was to track the waiting jobs by their {user,jobtype} attribute tuple, and have each scheduling step randomly select 5 tuples and from each tuple up to 10 jobs to run next. The selected jobs were shortlisted to be run by the next available runner. Whenever capacity freed up to run more jobs (either because jobs finished or because of secondary restrictions they could not run), ran another scheduling step to fetch more work.
Jobs were locked atomically as part of being fetched; the locks prevented them from being fetched again or participating in further scheduling decisions. If they failed to run they were unlocked, effectively returning them to the pool. The locks timed out, so the server running them was responsible for keeping the locks refreshed (if a server crashed, the others would time out its locks and would pick up and run the jobs it started but didn't complete)
For my use case I wanted users A and B with jobs A.1, A.2, A.3 and B.1 to each get 25% of the resources (even though that means user A was getting 75% to user B's 25%). Choosing randomly between the four tuples probabilistically converges to that 25%.
If you want users A and B to each have a 50-50 split of resources, and have A's A.1, A.2 and A.3 get an equal share to B's B.1, you can run a two-level scheduler, and randomly choose users and from those users choose jobs. That will distribute the resources among users equally, and within each user's jobs equally among the jobtypes.
A huge number of jobs of a particular jobtype will take a long time to all complete, but that's always going to be the case. By picking from across users then jobtypes the responsiveness of the job processing will not be adversely impacted.
There are lots of secondary restrictions that can be added (e.g., no more than 5 calls per second to linkedin), but the above is the heart of the system.
You could try Torque resource management and Maui batch job scheduling software from Adaptive Computing. Maui policies are flexible enough to fit your needs. It supports backfill, configurable job and user priorities and resource reservations.

Actual processing time of hadoop job

My cluster is currently occupied by a job A that takes long time and has VERY_LOW priority.
I started another job B yesterday while A was already running and I think it should have ran quite fast.
However, I saw it took 47 minutes at the job details.
I don't think this is the actual processing time.
I'm trying to find out when the job really started.
Where can I look?
I cant seem to find anywhere which states exactly what you're after, but you could look into the job in the job tracker on port 50030 and look at the individual mapper and reducer details. On there you can see how long each individual mapper and reducer took to complete their tasks from their start and end times.
If there weren't any mappers or reducers free when you started the second job, the second job wouldnt be able to make any progress until the first job released them, which might explain why it claimed to take so long, as they might not have actually been running simultaneously. The time of the job being started and the first actual mapper starting should give you an indication of whether it was just waiting around for resources, which means you can deduct the period of time between the job and mapper's start times from the overall 47 minutes.

Fair job processing algorithm

I've got a machine that accepts user uploads, performs some processing on them, and then returns the result. It usually takes a few minutes to process each upload received.
The problem is, a few users can upload a lot of jobs that basically deny processing to other users for a long time. I thought of just setting a hard cap and using priority queues, e.g. after 5 uploads in an hour, all new uploads are given a lower processing priority. I basically want to process ALL jobs, but I don't want the user who uploaded 1000 jobs to make everyone wait.
My question is, is there a better way to do this?
My goal is to minimize the time between the upload and the result being returned. It would be ideal if the algorithm could work in a distributed manner as well.
Thanks
Implementation will vary widely depending on what these jobs are and how long they take and how varied the processing times are, as well as how likely there is to be a fatal error during the process.
That being said, an easy way to maintain an even distribution of jobs across users is to maintain a list of all the users who have submitted jobs. When you are ready to get a new job, rather than just taking the next job out of a random queue, cycle through the users taking the top job from each user each time.
Again, this can be accomplished a number of ways, I would recommend a map from users to their respective list of jobs submitted. Cycle through the keys of the map each time you are ready for a new job. then get the list of jobs for whatever key you are on, and do the first job.
This is assuming that each job is "atomic" in that one job is not dependent on being executed next to the jobs it was submitted with.
Hope that helps, of course I could have completely misunderstood what you are asking for.
You don't have to roll-your-own. There is Sun Grid Engine. An open-source tool that is built to do that sort of thing, and if you are willing to pay, there is Platform LSF, which I use at work.
What is the maximum # of jobs a user can submit? Can users submit 1 job a a time OR is it a batch of jobs?
So your algorithm would go something like this
If the User has submitted jobs Then
Check how many jobs per hour
If the jobs per hour > than the average Then
Modify the users profile to a lower priority
Else
Check Users priority level and restore
End If
If the priority = HIGH
process right away
Else If priority = MEDIUM
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Else If priority = LOW
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Check Queue for Medium Priority
If Medium Priority Found (rerun this loop)
Else Process
Process Queue
End If
You can use a graph algorithm like Edmond's Blossom V to assign all users and jobs to a process. If a user can upload more then another user it would be more simplier for him to find a process. With the Blossom V algorithm you can define a threshold to not exceed the maximum process the server can handle.

Resources