Parallel Jobs

From ACENET
Jump to: navigation, search
Main page: Job Control

Contents

Parallel environments

A parallel environment is essentially a collection of CPU slots, along with suitable set-up and tear-down code, for running a certain class of parallel program. The parallel environment you request for a job dictates how Grid Engine will start and stop the job and how it will distribute the processes to nodes.

There are three principal groups of parallel environments offered at ACENET:

Parallel Environment Description
openmp For jobs assigned to a single host (including shared memory OpenMP and MPI jobs)
ompi* For distributed jobs, e.g. MPI
4per*, 16per* For distributed jobs that require a fixed number of cores per host

Other parallel environments may be available. These can be listed with qconf -spl, but we do not recommend you use these unless they are specifically mentioned on another wiki page (e.g. Gaussian) or you have consulted with a Computational Research Consultant.

OpenMP or threaded jobs

Since OpenMP is for shared memory programming, OpenMP programs must execute on only one node -- they cannot span nodes like MPI programs do. Consequently, they need a distinct parallel environment under the Grid Engine, -pe openmp np. The maximum number of slots you can request with this parallel environment on our clusters is 16.

#$ -cwd
#$ -l h_rt=01:00:00
#$ -pe openmp 4

export OMP_NUM_THREADS=$NSLOTS
./openmp_job

Grid Engine provides an environment variable, NSLOTS, which is always set to the number of parallel slots your job has been assigned --- four slots in this example. It's best practice always to use $NSLOTS to set OMP_NUM_THREADS as shown above.

Some applications may be multi-threaded but not use OpenMP. Typically such an application will let you control the number of threads with a command line parameter. Use such a parameter in combination with $NSLOTS, like this:

#$ -cwd
#$ -l h_rt=01:00:00
#$ -pe openmp 4

./threaded_application -threads $NSLOTS
Note
You can also run an MPI program with the openmp parallel environment if you want all processes to reside on the same host.

MPI jobs

Here is an example script that runs a 16-process MPI application mpi_app from the current directory:

#$ -cwd
#$ -l h_rt=01:00:00
#$ -pe ompi* 16

mpirun mpi_app

There is no need to specify the list of hosts and/or the number of processes for the mpirun command because Open MPI will obtain this information directly from Grid Engine.

You can use the openmp environment if you want all the MPI processes to run on a single host.

If you submit an MPI job that requests a lot of slots, (e.g. more than 16), you are encouraged to turn on the Reservation option:

#$ -R yes

For more on how Reservation works please see the page on Scheduling Policies and Mechanics.

IMPORTANT
In some cases (whether due to cluster design changes or reconfigurations, or hardware limitations for some sites) MPI job cannot span all nodes, so there may be a parallel environment for each block of nodes. When we assign names to those parallel environments, they always start with ompi. That is why it is important that you used a wildcard * like so ompi* in order to allow your job to run on any part of the cluster. On the command line, care has to be taken to "escape" the wildcard or use quotes like so:
$ qsub -pe 'ompi*' 16 jobscript
Please note that at Placentia, using ompi* may result in placing your job on the Green ACENET resources that are not equipped with InfiniBand. In order to avoid this, do not use the trailing wildcard character, e.g. -pe ompi 16.

Large jobs and memory fragmentation

Large MPI jobs, typically those with 64 slots or more, may be prone to apparently random failures due to memory fragmentation. Read more about it on the Memory Management page.

4per* and 16per* parallel environments

$ qsub -pe 4per\* 32 jobscript
$ qsub -pe 16per\* 32 jobscript

These parallel environments will guarantee that your job receives exactly 4 or 16 slots per host, respectively. You must request an integral multiple of 4 or 16 slots.

Using one of these parallel environments constrains the scheduler more strongly than using ompi* and so will typically result in a longer wait for the job to schedule. They are therefore recommended only if you have a good reason to use them. One such reason might be if you are running very large MPI jobs and per-connection memory overhead is a problem.

There is no general way to request the exclusive use of entire hosts from Grid Engine; -pe 16per* and -pe openmp are the best you can do:

  • At sites built entirely from 4-core machines (Mahone), 4per* will also give you exclusive use of complete hosts.
  • At sites where there are 4-core as well as 16-core machines (Placentia, Fundy, Glooscap), if you use 4per*, the scheduler may assign your job to a 16-core machine; if you use 16per*, the scheduler will not be able to assign your job to 4-core nodes.

Variable process counts

You can ask Grid Engine for "as many slots as are available" (within reason) by using a range descriptor instead of a simple number for the slot count.

#$ -pe ompi* 8,16

This example will schedule when the system can provide either 16 or 8 slots. Grid Engine tries the largest number first, but if it can schedule the job sooner with the smaller number of slots it will do that.

The range descriptor can also be a range with a hyphen rather than a list:

#$ -pe openmp 4-16

This example requests anywhere from 4 to 16 slots on a single host --- As many slots as possible, but don't wait for a larger number.

User Support
Resources