Job Control
All production-class jobs must be submitted via the scheduler, which manages the available computing resources and assigns them to the waiting jobs. The scheduler used on all ACEnet clusters is Sun Grid Engine (SGE).
- Old version of this page: click here
Contents |
Main commands
The three most important SGE commands are:
qsub- Submits a batch job
qstat- Shows the status of jobs and queues
qdel- Deletes or kills a job
Submitting a simple job
Write a script that describes your job. Here's a trivial example:
#$ -cwd #$ -j y #$ -l h_rt=00:03:00 echo Hello from inside a Grid Engine job running on `hostname` echo Job beginning at `date` sleep 120 echo Job ending at `date`
Note that this is just a shell script — a list of commands (echo, sleep) to be executed in order. The comment lines beginning with #$ provide extra information to the job scheduler. More about them below. The default execution shell is bash unless specified otherwise with the -S option.
Save the script with some name, like trivial.sh. Then submit the script to the scheduler by typing
$ qsub trivial.sh
The system will reply something like
Your job 7635 ("trivial.sh") has been submitted
and queue up your job to wait its turn. The number is called the JOB_ID and will be different for every job.
Monitoring jobs
How can you tell if your job has run? The usual way is with
$ qstat
The output from qstat is very wide and looks something like this:
job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------- 7635 0.5404 trivial jdoe r 11/18/2011 23:16:10 short.q@cl061 1
While your job is waiting to run there will be "qw" in the state column.
When it starts to run it changes to "r".
When the job ends, either because it finished or because it crashed, it disappears
from the list and qstat will return nothing at all --- unless you have other jobs submitted.
When the job is done, the output appears in a file with a name like trivial.sh.o7635. The components of the output file name are the job name (trivial.sh in our example), the job id (7635), and between them ".o" for "output".
There are other utilities that can show you some useful information about your jobs:
-
qsum, for a simplified view of the entire system load -
showq, for insight into when your job might begin running - More on
qstat, including the meaning of job status codes
Parameters
- Complete parameter list: Sun Grid Engine
Here are the most commonly-used job parameters:
| Option | Description |
|---|---|
-l h_rt=time |
Run time limit either in seconds or in hh:mm:ss format |
-l h_vmem=mem |
Hard virtual memory limit; mem specifier may include k, K, m, M, g, G; details at man queue_conf
|
-cwd |
Start the job script in the same directory it was submitted from, the "current working directory". If absent, job will start in your home directory. |
-j y |
Join the stderr output stream to the stdout stream. Error messages will be mixed in with the job script standard output. If absent then standard error will go into job_name.ejob_id |
-N name |
Assigns a name to the job other than the name of the job script |
-o file |
Redirects the standard output to the named file |
-S shell |
Shell to interpret the job script: /bin/bash (default) or /bin/csh
|
Every job must be submitted with a run time limit, h_rt. This is a hard limit, which means your job will be killed after it has been running for that length of time, so you should give yourself a margin of error. If you really don't know what run time to set, 48 hours is an acceptable choice. All other parameters are optional.
There are three ways to set a parameter or supply an option to a job:
- With
#$directives inside the job script, as shown above - With flags to
qsubwhen the job is submitted - With flags to
qalterwhile the job is waiting to run
The second and third methods follow this pattern:
$ qsub -l h_rt=0:1:0 trivial.sh
Your job 7636 ("trivial.sh") has been submitted
$ qalter -l h_rt=0:2:0 7636
Options to the qsub command override any conflicting options set with directives inside the job script. So job 7636 in this example will initially have a run-time limit of one minute (0:1:0) regardless of what is given inside the script, and then two minutes after the qalter command. Also note that when using qsub the script name (and any arguments to the script) must appear after all the Grid Engine flags.
Further reading
- Parallel Jobs
- Interactive Jobs
- Test Jobs
- Memory Management
- Managing Many Tasks
- Tips on writing Job Scripts
- Scheduling Policies and Mechanics, or, How It Works
- FAQ