Managing Many Tasks

From ACEnet
Jump to: navigation, search
Main page: Job Control

Many interesting problems can be addressed with a large number of independent computations. Examples include:

  • parameter scanning
  • statistical or "Monte Carlo" sampling
  • image processing (multiple images)

This page talks about how to handle problems which can be broken into chunks which do not need to communicate with each other during the computation. Terms such as "perfectly parallel", "embarrassingly parallel", "parametric parallelism", "serial farming", and "high throughput computing" have been applied.

Array jobs

An arbitrary number of tasks which differ only in some parameter can be conveniently submitted as an array job. An array job has a single JOB_ID, but in addition also has a number of tasks numbered according to the range you supply with the -t parameter. For example:

$ qsub -t 1-10 runme
Your job-array 54321.1-10:1 ("runme") has been submitted

Job 54321 will be scheduled as 10 independent tasks which may start at different times on different hosts. The individual tasks are differentiated by the value of an environment variable $SGE_TASK_ID. The script can reference $SGE_TASK_ID to select an input file, for example, or to set a command-line argument for the application code:

my_app <input.$SGE_TASK_ID
my_app $SGE_TASK_ID some_arg another_arg 0

Click here for more on array jobs, with examples. Still more examples can be found in this presentation.

Using an array job instead of a large number of separate serial jobs has advantages for other users and the system as a whole. A waiting array job only produces one line of output in qstat or showq, making these tools easier to use. The scheduler does not have to analyze job requirements for all the array tasks separately, so it runs more efficiently too.

Job dependency list

You might have a series of jobs to be run in a specific order. For example, one program might generate a file which will be used as input to another program, or a later run of the same program. With Grid Engine you can define a job dependency list with the option -hold_jid. This option can be specified in the command line with the qsub command or in the submission script like so:

$ qsub -hold_jid job1 job2 

The job dependency list consists of comma-separated job IDs, job names, or job name patterns. You can get more information from man qsub - search for "hold_jid".

For example, assume that job3 should be run only after both job1 and job2 have completed. This is how it can be accomplished using command line options:

$ qsub job1
Your job 6789 ("job1") has been submitted
$ qsub job2
Your job 6790 ("job2") has been submitted
$ qsub -hold_jid job1,job2 job3   # or qsub -hold_jid 6789,6790 job3

Grid Engine will put job3 on hold, marking it with an "h" in the qstat output. Once the first two are finished, job3 can be scheduled.

One can use qstat -j to get detailed information on the job's predecessors required to complete.

IMPORTANT
If a referenced job fails while running (e.g. by exceeding its h_rt limit), it does not prevent the dependent jobs from executing. The eligibility criterion for the dependent jobs is the mere fact of completion of the referenced jobs — successful or not. However, if the referenced job failed because it could not get started, then Grid Engine will recognize this, and the job will be marked with the 'E' symbol indicating the error state. A job in the error state has exit_code 100, and according to the documentation (man qsub), "If any of the referenced jobs exits with exit code 100, the submitted job will remain ineligible for execution".
Resources
User Support
News and Events
Organization
About Us