Scheduling Policies and Mechanics
- Main page: Job Control
Most cluster users are interested in questions like
- How long will I have to wait for my job to finish?
- Or for that matter, to start?
- Is there anything I can do to make it start sooner?
These questions cannot be precisely answered on a shared resource like ACENET, but understanding the scheduling system, how it works and why, might let you make useful estimates and adjustments.
Every job that runs gets assigned to a queue or queues. The queues available to general users are given below. Other queues may be present which are restricted to certain users - these are not described here.
- Contains jobs with a run time <=48 hours
- Contains jobs with a run time <=336 hours (2 weeks)
- Contains jobs with a run time <=720 hours (1 month)
- Contains test jobs (all test jobs should request <=1 hour)
- For jobs which can be arbitrarily suspended; see Subordinate Queues
Depending on the run time you specify, your job may be classified either short (less than two days, 48 hours), medium (less than two weeks, 336 hours), or long (less than one month, 720 hours). Short jobs are eligible to run almost anywhere. Medium jobs are limited to no more than about 30% of the slots in a given cluster, and long jobs are limited to no more than about 10% of each cluster.
Note that short jobs are eligible to run in the
long.q cluster queues, and medium jobs are likewise eligible to run in
long.q. It is not necessary to specify
long.q in the job script or at submission time. Let the Grid Engine assign the jobs based on the run time you have supplied and the changing availability of resources.
Queue policies and run time estimates
ACENET determines job priority by the so-called Fairshare method, which attempts to equalize access to CPU time by each project or research group. Perfect equality is not to be expected since the computing requirements of different projects vary from one another, and over time. In addition to the basic Fairshare priority model, the following queue policies are in force. These are intended to balance the competing criteria of (i) fairness, (ii) minimal waiting times, and (iii) maximum system usage or "efficiency".
- All jobs must be submitted with an estimated run time with the flag
-l h_rtin the submission script.
- Approximately 70% of each cluster will be dedicated to jobs with less than a 48-hour run time.
- Approximately 20% of each cluster will be dedicated to jobs with less than a two-week run time (336 hours).
- The remaining 10% of each cluster will accept jobs with up to a 1-month run time (720 hours).
- Jobs with longer than a 1-month run time will not be accepted without administrator intervention. Please contact support if you have jobs that need to run longer than 1 month.
- A small segment of each cluster will be set aside for test jobs which are limited to 1 hour of run time.
- Users with large parallel jobs (i.e. jobs requiring many slots) are encouraged to begin using the reservation option to
-R yfor their largest jobs.
The precise allocations may vary somewhat from 70-20-10 due to circumstances such as demand or allocations.
Leave yourself a good margin of error when estimating the run times. There is no penalty for overestimating your run time, unless this forces your job into one of the longer-running queues and it therefore has to wait longer to be scheduled. The penalty for underestimating your run time, however, is that your job will be killed when the assigned time has elapsed.
"Why shouldn't I just pick the maximum run time for each queue then?" you might ask. If you must, you can, but remember first that this is a shared resource and it's polite to use it efficiently. The more accurate information the scheduler has, the more efficiently it operates.
For example, when large parallel jobs are reserving slots the scheduler can use jobs with short run times to backfill the schedule. Besides the improved efficiency for everyone, if your jobs have short predicted run times they may get scheduled as backfill and therefore run sooner.
How priority is determined
Each job in the waiting list has a priority assigned to it by the Grid Engine, which is visible in qstat under the label "prior":
job-ID prior name user state submit/start at queue slots ------------------------------------------------------------------------- 8393 0.25002 qtest rdickson qw 02/22/2008 15:48:38 1
Jobs already running have priorities too, but they don't matter much. Priorities are calculated via a method we call "fairshare", which attempts to equalize the amount of CPU time available to each project or research group. A record of past CPU usage is kept, and projects which have used less CPU get a higher priority. The record is tempered by a time-decay function with a one-month half-life, so unused priority "tickets" cannot be saved up indefinitely.
Because priority is calculated relative to other submitted jobs, the priority value of any given job will change as other jobs finish and new ones are submitted.
But simply having the highest priority in the waiting list does not guarantee that your job will be the next one to run. In order to maximize the use of the many CPUs and other resources, the Grid Engine will try to fill any newly available slots with the highest-priority job that can use those resources. If there aren't enough CPUs or memory to run your job, then it may be passed over in favour of a "smaller" job.
Users with "large" parallel jobs, e.g. those requiring a large number of slots, or a great deal of memory per slot, may find that their jobs take a long time to get scheduled. This is a known consequence of the "greedy" scheduling algorithm sketched in the previous section, and it even has a name: Parallel job starvation.
What constitutes "large"? The number varies dramatically not only with the resources available on a cluster but also with the number and nature of competing jobs. Typical numbers, though, might be greater than 16 slots for an MPI job, or greater than 8 slots for an OpenMP job, or greater than 4G memory per slot.
The solution is to have the parallel job reserve slots. If there is a reserving job at the top of the waiting list, the scheduler will "save up" free slots as jobs end, and not assign them to lower-priority jobs until the reserving job has enough to start running.
Reservation is rather demanding for the scheduler, and of course it's not very efficient of the shared resources, so reservation does not happen automatically. You must explicitly request reservation for a job with the "-R y" option to
qalter (that's capital R, not lower-case r).
For example, if you have a wide parallel job which seems to be getting pre-empted by narrower jobs despite its high priority value, you can turn on reservation like this:
$ qalter -R y job_id
or you can modify your submission script and add the following line:
#$ -R y
You can check whether reservation is turned on for a job with
qstat -j job_id | grep reserv, or with showq.
If all jobs have known time limits, the scheduler can tell how long it will take in the worst case until there are enough slots free for a reserving job. If there are narrower jobs around that are qualified to run and their run times are short enough, then the scheduler can backfill the short jobs into the slots that are being reserved for the wide job. This should be an inducement to supply accurate run lengths for your short jobs.
What's a queue?
To most English speakers, a "queue" suggests a waiting line, or a first-in-first-out data structure to those who've taken some computer science courses. The term has a rather different technical meaning in Grid Engine, and that can be a source of confusion.
A queue is a notional container for jobs. At ACENET, it is a container for jobs which meet certain time constraints. Short.q is an abstract container for any job with less than a 48-hour run time, medium.q is a container for jobs with less than a two-week run time, long.q is a container for jobs longer than two weeks. The containers are restricted to hold no more than a fixed number of slots (10% of the cluster for long.q, 20% for medium.q). A single node can be associated with several queues simultaneously, as in the case of Subordinate Queues. So it is not always true that a specific node belongs to a specific queue.
A job is not associated with a specific queue until it begins to run. There might be only one queue that a job is eligible for, but even so Grid Engine does not assign the job to that queue until the job starts. Notice also that any job which qualifies for short.q with a run-time limit of less than 48 hours also qualifies for the medium.q and long.q, and it is even possible for a parallel job to be spread across several queues simultaneously.
In summary, a queue is a notional space for running jobs which share certain features (i.e., time limits). It is not strictly a collection of nodes, nor is it a waiting list.
Two associated terms may also appear:
- Queue instance: An individual host as part of a queue. Since a host can be part of more than one queue, it's not exactly the same as saying host, but it's pretty close.
- Cluster queue: A synonym for queue as defined above, to distinguish it from queue instance.