Prevent execution of non-SGE programs - job-scheduling

From the point of view of the system administration of an SGE node, is it possible to force users to run long-running programs through qsub instead of running it stand-alone?
The problem is that the same machine is acting as the control node and the computation node. So, I can't distinguish a long-running program from a user who is compiling with "gcc". Ideally, I would like to force users to submit long-running jobs (i.e., more than an hour) through qsub. I don't even mind being a bit mean and killing jobs that have run longer than an hour but weren't submitted through qsub.
Until now, all that I can do is send e-mails out asking users to "Please use qsub!"...
I've looked through the SGE configuration and nothing seems relevant. But maybe I've just missed something...any help would be appreciated! Thanks!

I'm a little confused about your setup, but I'm assuming users are submitting jobs by logging into what is also a computation node. Here are some ideas, best to worst:
Obviously, the best thing is to have a separate control node for users.
Barring that, run a resource-limited VM as the control node.
Configure user-level resource limits (e.g. ulimit) on the nodes. You can restrict CPU, memory, and process usage, which are probably what you care about rather than clock time.
It sounds like the last one may be best for you. It's not hard, either.

Related

Is there any way to know which job will start next in qsub

In our institute (IISc Bangalore)Supercomputer ,we submit jobs using qsub. The jobs will start running according to the following-
(1) Its wall time(Expected completion time)
(2) Its position in the respected queue(small,medium,large etc).
So,it is very difficult to know which job will start after finishing one job which is currently running. But qsub is probably has a list of its own,by which it is starting a new job after finishing another job immediately.
Is there any way to know which job will start next.Is there any command for this.
Thank you.
Unfortunately, there is no clear way to know which job will be run next in a supercomputing system. The job start is depending not only on it's wall time or position in the queue but also many other factors based on the site-level policy, scheduling strategies and priorities. There can be some internal job ranking (priorities) chosen by the institute based on factors like power management, load balancing etc.
On the other side, there are many researches to predict the waiting time for job allocation. TeraGrid systems provides estimated waiting time. Also, see link1, link2 (by SERC) for more information about predicting the waiting time.

Is there a reason I shouldn't intentionaly hit walltime?

I am submitting a job (via qsub) that is not in anyway the worse off for being killed part way through, and the more time it runs for the better. Results are output at it goes along.
It will be submitted to a large cluster that is well managed. It is safe to assume that whoever has control over the cluster has set it up in a reasonable and sensible way.
From my point of view it is more useful to tell it to loop more times than will fit inside walltime and let it be killed, than to tell it to loop less and have it finish before walltime. If it finished before walltime it did not do as many loops as it possibly could have done.
Is there any problem or annoyance caused by this approach? It's working well but I'm worried I could be upsetting someone.
So I put this question to the administrators of my cluster. The answer I got was;
Thanks for your query. I can see no reason that this would be an issue - simply ensuring you get as much run from you job as possible in the requested time.

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?
Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Difference in type of schedulers and scheduling algorithms

I have studied about the topic of Job Schedulers and there are different types like Long term, medium and short term schedulers and finally got confused with the things.
So my question is, "Among these three schedulers, which scheduler type will make use of the scheduling algorithms(like FCFS, SJF etc.)"
My understanding so far is, "The scheduling algorithm will take the job from the ready queue (which contains the list of jobs to be executed which is in ready more) and keeps the CPU busy as much as possible".
And the Long Term Scheduler is the one which decides what are all the jobs to be allowed in the ready queue.
So, the long term scheduler is the one which is going to make use of those scheduling algols..?.
And also, I have seen the link, https://en.wikipedia.org/wiki/Scheduling_(computing)
where I have seen that,
Note: The following lines is excerpted from Wiki...
"Thus the short-term scheduler makes scheduling decisions much more frequently than the long-term or mid-term schedulers...."
So, whether all these 3 schedulers will make use of the scheduling algol.??
Finally, I got tucked at this point and got confused with the difference between these types of schedulers ..
Could some one kindly do briefly explain this one?
So I can able to understand this one.
Thanks in advance.
So, whether all these 3 schedulers will make use of the scheduling
algo??
Basically, the scheduling algorithms are chosen by all three of them depending on whichever is functional at that point. All of them require some kind of scheduling decisions at any point as all of them are schedulers. So, it all depends on which is executing at what instant (short-term scheduler executes more frequently as compared to others).
Wikipedia is right in mentioning that. I hope you got your answer in short.
Description :
As mentioned in Process Scheduling page on tutorialspoint :-
Schedulers are special system softwares which handles process scheduling in various ways. Their main task is to select the jobs to be submitted into the system and to decide which process to run.
Long Term Scheduler ------> It selects processes from pool and loads them into memory for execution
Medium Term Scheduler -----> It selects those processes which are ready to execute.
Short Term Scheduler ------> It can re-introduce the process into memory and execution can be continued.
The below list (click here for source) shows the function of each of the three types of schedulers (long-term, short-term, and medium-term) for each of three types of operating systems (batch, interactive, and real-time).
batch
longterm -----> job admission based on characteristics and resource
needs
mediumterm -----> usually noneā€”jobs remain in storage until done
shortterm -----> processes scheduled by priority; continue until wait
voluntarily, request service, or are terminated
interactive
longterm -----> sessions and processes normally accepted unless
capacity reached
mediumterm -----> processes swapped when necessary
shortterm -----> processes scheduled on rotating basis; continue until
service requested, time quantum expires, or pre-empted
real-time
longterm -----> processes either permanent or accepted at once
mediumterm -----> processes never swapped
shortterm -----> scheduling based on strict priority with immediate
preemption; may time-share processes with equal priorities

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

Resources