Managing Many Tasks

From ACENET
Jump to: navigation, search
Achtung.png Legacy documentation

This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca.

Main page: Job Control

Many interesting problems can be addressed with a large number of independent computations. Examples include:

  • parameter scanning
  • statistical or "Monte Carlo" sampling
  • image processing (multiple images)

This page talks about how to handle problems which can be broken into chunks which do not need to communicate with each other during the computation. Terms such as "perfectly parallel", "embarrassingly parallel", "parametric parallelism", "serial farming", and "high throughput computing" have been applied.

In other words, this page is primarily for those users who are running serial jobs. It explains how to handle them efficiently without putting a great burden on the scheduler.

Array jobs

An arbitrary number of tasks which differ only in some parameter can be conveniently submitted as an array job. An array job has a single JOB_ID, but in addition also has a number of tasks numbered according to the range you supply with the -t parameter. For example:

$ qsub -t 1-10 runme
Your job-array 54321.1-10:1 ("runme") has been submitted

Job 54321 will be scheduled as 10 independent tasks which may start at different times on different hosts. The individual tasks are differentiated by the value of an environment variable $SGE_TASK_ID. The script can reference $SGE_TASK_ID to select an input file, for example, or to set a command-line argument for the application code:

my_app <input.$SGE_TASK_ID
my_app $SGE_TASK_ID some_arg another_arg 0

Click here for more on array jobs, with examples. Still more examples can be found in this presentation.

Using an array job instead of a large number of separate serial jobs has advantages for other users and the system as a whole. A waiting array job only produces one line of output in qstat or showq, making these tools easier to use. The scheduler does not have to analyze job requirements for all the array tasks separately, so it runs more efficiently too.

The Grid Engine environment variable $TMPDIR will point to a local directory with a unique name for each task.

Packing serial jobs

A number of serial jobs with the same expected run-time can be packed together into one submission script to run concurrently on the same host. Here is a sample script that packs 3 serial jobs, and starts them with a 10-sec interval from each other:

#$ -cwd
#$ -l h_rt=01:00:00
#$ -pe openmp 3

./serial_app1 < data1.dat &
sleep 10
./serial_app2 < data2.dat &
sleep 10
./serial_app3 < data3.dat &
wait

Note that every execution line is ended with the ampersand (to run the binaries in background), allowing the shell to proceed on parsing the script until it reaches the 'wait' command, which prevents the shell from quitting until all currently active child processes are complete. This results in a concurrent execution of the five launched binaries.

We also recommend copying your input files to a local directory, before launching your executables. Copying your data once before running multiple binaries is especially important if you are using the same input files for each run. This greatly reduces the stress on the network and the storage system, and prevents your jobs from wasting wall-clock time being in the I/O waiting state. Here is a sample script that you could modify to your needs:

#$ -cwd
#$ -l h_rt=01:00:00
#$ -pe openmp 5

cd $TMPDIR/
cp $HOME/data*.dat ./
cp $HOME/serial_app* ./

./serial_app1 < data1.dat &
./serial_app2 < data2.dat &
./serial_app3 < data3.dat &
./serial_app4 < data4.dat &
./serial_app5 < data5.dat &
wait

It's also efficient to write results to $TMPDIR, then copy them to network storage at the end of the job: A single large write across the network is more efficient than many small writes. But please remember that after a job is complete, Grid Engine will automatically delete the $TMPDIR directory.

Job dependency list

You might have a series of jobs to be run in a specific order. For example, one program might generate a file which will be used as input to another program, or a later run of the same program. With Grid Engine you can define a job dependency list with the option -hold_jid. This option can be specified in the command line with the qsub command or in the submission script like so:

$ qsub -hold_jid job1 job2 

The job dependency list consists of comma-separated job IDs, job names, or job name patterns. You can get more information from man qsub - search for "hold_jid".

For example, assume that job3 should be run only after both job1 and job2 have completed. This is how it can be accomplished using command line options:

$ qsub job1
Your job 6789 ("job1") has been submitted
$ qsub job2
Your job 6790 ("job2") has been submitted
$ qsub -hold_jid job1,job2 job3   # or qsub -hold_jid 6789,6790 job3

Grid Engine will put job3 on hold, marking it with an "h" in the qstat output. Once the first two are finished, job3 can be scheduled.

One can use qstat -j to get detailed information on the job's predecessors required to complete (look for jid_predecessor_list).

IMPORTANT
If a referenced job fails while running (e.g. by exceeding its h_rt limit), it does not prevent the dependent jobs from executing. The eligibility criterion for the dependent jobs is the mere fact of completion of the referenced jobs — successful or not. However, if the referenced job failed because it could not get started, then Grid Engine will recognize this, and the job will be marked with the 'E' symbol indicating the error state. A job in the error state has exit_code 100, and according to the documentation (man qsub), "If any of the referenced jobs exits with exit code 100, the submitted job will remain ineligible for execution".