Job Scripts

From ACENET
Jump to: navigation, search
Main page: Job Control

A job script is merely a shell script with (optionally) some Grid Engine directives among the comment lines. It may also use certain environment variables which are set specially by Grid Engine. By default the script is interpreted with /bin/bash, but the user can change this with the -S directive.

Below are comments on a few aspects of job scripts.

Troubleshooting

If you are writing a script any more complicated than the simple examples show under Job Control then you may find it necessary to debug them. Here are a few hints:

  • Dump diagnostic information about the job environment by including commands like
which command
echo $JOB_ID
echo $PATH
hostname
cat $PE_HOSTFILE
  • Use the test queue to run tests with rapid turnaround.
  • Get an interactive session, perhaps in the test queue, and try stepping through your script manually.

Environment variables

The "-V" parameter is enforced for every job, meaning that all the current environment variables from your login shell will be exported to your jobs automatically.

If you are using bash as an execution shell in your submission script (the default) then all the environment variables set when you run qsub will be seen in the job context.

However, if you are using csh as an execution shell in your submission script (with the -S option) then you should expect your ~/.cshrc to be read anew, which is the default behavior for login and non-login shells in csh (see man csh), which in turn may overwrite some of the important variables you intended to export from your login shell via enforced -V.

Checkpointing

Even though we endeavor to keep our systems running as smoothly as possible, unplanned outages occur. The weather and the power grid are beyond our control, and this is a shared resource for research computing. Somebody is always trying something new, and despite our best efforts occasionally it goes wrong in a way that impinges upon other users.

So you are advised never to let a job run for too long without "checkpointing", or saving state. In order not to waste your research time and the community resource, please ensure that every so often your program writes sufficient data to disk that you can restart lengthy calculations somewhere in the middle.

As an added incentive, the queue policies are intentionally biased in favour of short jobs. If you can break your work up into short-running chunks you will get faster and more predictable access to CPUs, as will all other researchers sharing the resource with you.

It is possible to write a job script that will submit its own followup job. This is a form of checkpointing that has certain advantages for both you the job owner and for other users on the system:

  • If the total computation will take longer than 48 hours but can be broken by checkpointing into individual jobs which take less, then you can get access to the generous short.q resources. This may result in faster time-to-results than you would get by waiting for space in medium.q or long.q.
  • The number of distinct jobs in the system is smaller than if one used a job dependency list for a similar purpose, which reduces the demand on the scheduler.

Such a self-propagating job script carries the risk of a runaway chain of jobs, which must be carefully guarded against. Please consult a Computational Research Consultant for help with this or other checkpointing strategies.