Job Control

From ACEnet
Jump to: navigation, search

All production-class jobs must be submitted via the scheduler, which manages the available computing resources and assigns them to the waiting jobs. The scheduler used on all ACEnet clusters is Sun Grid Engine (SGE).

Old version of this page: click here

Contents

Main commands

The three most important SGE commands are:

qsub
Submits a batch job
qstat
Shows the status of jobs and queues
qdel
Deletes or kills a job

Submitting a simple job

Write a script that describes your job. Here's a trivial example:

 #$ -cwd
 #$ -j y
 #$ -l h_rt=00:03:00

 echo Hello from inside a Grid Engine job running on `hostname`
 echo Job beginning at `date`
 sleep 120
 echo Job ending at `date`

Note that this is just a shell script — a list of commands (echo, sleep) to be executed in order. The comment lines beginning with #$ provide extra information to the job scheduler. More about them below. The default execution shell is bash unless specified otherwise with the -S option.

Save the script with some name, like trivial.sh. Then submit the script to the scheduler by typing

 $ qsub trivial.sh

The system will reply something like

Your job 7635 ("trivial.sh") has been submitted

and queue up your job to wait its turn. The number is called the JOB_ID and will be different for every job.

Monitoring jobs

How can you tell if your job has run? The usual way is with

$ qstat

The output from qstat is very wide and looks something like this:

job-ID  prior   name     user  state   submit/start at     queue          slots  ja-task-ID
-------------------------------------------------------------------------------------------
  7635  0.5404  trivial  jdoe    r     11/18/2011 23:16:10 short.q@cl061     1

While your job is waiting to run there will be "qw" in the state column. When it starts to run it changes to "r". When the job ends, either because it finished or because it crashed, it disappears from the list and qstat will return nothing at all --- unless you have other jobs submitted.

When the job is done, the output appears in a file with a name like trivial.sh.o7635. The components of the output file name are the job name (trivial.sh in our example), the job id (7635), and between them ".o" for "output".

There are other utilities that can show you some useful information about your jobs:

  • qsum, for a simplified view of the entire system load
  • showq, for insight into when your job might begin running
  • More on qstat, including the meaning of job status codes

Parameters

Complete parameter list: Sun Grid Engine

Here are the most commonly-used job parameters:

Option Description
-l h_rt=time Run time limit either in seconds or in hh:mm:ss format
-l h_vmem=mem Hard virtual memory limit; mem specifier may include k, K, m, M, g, G; details at man queue_conf
-cwd Start the job script in the same directory it was submitted from, the "current working directory". If absent, job will start in your home directory.
-j y Join the stderr output stream to the stdout stream. Error messages will be mixed in with the job script standard output. If absent then standard error will go into job_name.ejob_id
-N name Assigns a name to the job other than the name of the job script
-o file Redirects the standard output to the named file
-S shell Shell to interpret the job script: /bin/bash (default) or /bin/csh

Every job must be submitted with a run time limit, h_rt. This is a hard limit, which means your job will be killed after it has been running for that length of time, so you should give yourself a margin of error. If you really don't know what run time to set, 48 hours is an acceptable choice. All other parameters are optional.

There are three ways to set a parameter or supply an option to a job:

  1. With #$ directives inside the job script, as shown above
  2. With flags to qsub when the job is submitted
  3. With flags to qalter while the job is waiting to run

The second and third methods follow this pattern:

$ qsub -l h_rt=0:1:0 trivial.sh
Your job 7636 ("trivial.sh") has been submitted
$ qalter -l h_rt=0:2:0 7636

Options to the qsub command override any conflicting options set with directives inside the job script. So job 7636 in this example will initially have a run-time limit of one minute (0:1:0) regardless of what is given inside the script, and then two minutes after the qalter command. Also note that when using qsub the script name (and any arguments to the script) must appear after all the Grid Engine flags.

Further reading

Resources
User Support
News and Events
Organization
About Us