The subordinate queue "sub.q" is associated with the Green ACENET program. Briefly, Green ACENET is an arrangement whereby researchers purchase computing equipment which is administered on their behalf by ACENET. In return for this service, the "spare cycles" are made available to general ACENET users under the condition that the purchasing researchers can always get priority access to their own resources.
This transaction is mediated by the Sun Grid Engine (SGE). Within SGE, Green hardware is organized into groups (technically "cluster queues") to which the owning group members have sole access. For example, members of the CMMS research group are the only ones whose jobs will run in the "
cmms.q" cluster queue.
However, the same hardware is also a member of the
sub.q cluster queue. When one of these hosts has no job running on it for the owning group, then jobs in
sub.q can run there. If a member of the owning group starts a job via SGE, then any jobs running in sub.q on that host are automatically suspended.
Is it for me?
- have serial jobs or shared-memory parallel jobs which run for a long time, and
- can make use of intermediate results without interrupting the job which generates them, or
- don't mind your jobs being interrupted and resumed arbitrarily,
then you might be able to use the subordinate queue profitably.
How to use sub.q
A job must explicitly request the "suspendable" resource in order to qualify for the subordinate queue. For example:
#$ -l h_rt=1000:0:0,susp=true ./my_application
Conversely, no job with "
susp=true" will go into the regular production queues.
You can probe the availability of sub.q nodes with
qsum, coarsely, or in more detail with
qstat -f -q sub.q. Nodes marked "S" are suspended, nodes with nothing in the "state" column are available.
Run times and memory
The hard run time limit
h_rt is a wall-clock limit, not a CPU time limit. Because jobs in sub.q can be suspended and therefore an unpredictable amount of time can elapse before they complete, there is no restriction on what h_rt you can request. But you must still supply one!
Another practical limitation is that Grid Engine will not start a job that it expects to run into the next scheduled outage. Scheduled outages are announced on the Cluster Status page and in the login message for each cluster. Take these into account when choosing
h_rt for a suspendable job.
Jobs in sub.q cannot reserve more than 2G memory per slot (
The subordinate queue is suspended and resumed host-by-host. It is difficult to ensure that a parallel job will behave properly if it is suspended on one host while other parts continue to run. Therefore shared-memory jobs (
-pe openmp and
-pe gaussian) are the only kind of parallel jobs permitted in
You can be notified when one of your sub.q jobs is suspended by setting the "s" flag to the "-m" Grid Engine directive:
-M firstname.lastname@example.org -m eas
You may find this useful if you wish to manually interrupt and resubmit a suspended job.
Termination instead of suspension
If you want your jobs to be terminated when they are pre-empted, instead of being suspended indefinitely, you should submit jobs to the subordinate queue with
qsub -notify. This will cause Grid Engine to send a SIGUSR1 signal to your application 15 seconds before sending a SIGSTOP. The intended purpose of this function is to allow your application to trap SIGUSR1 and save state, but the default action on receipt of SIGUSR1 is to terminate the application.
man 7 signal and