Local Scratch

From ACENET
Jump to: navigation, search
Main page: Storage System

Each compute node has its own disk (or in some cases, solid state memory) which is not shared with other compute nodes. We refer to this as local disk. If it is used to store temporary files for an individual job, then we refer to that as "local scratch storage".

Grid Engine and Local Scratch

Local scratch is not organized consistently across all clusters and hosts. In most cases it is in /scratch/tmp, but there are some hosts where /scratch/tmp doesn't exist or where another location is preferred. Grid Engine provides an environment variable TMPDIR which points to a local disk location which always exists and is unique to a job or a task in a job array, hence

$ cd $TMPDIR

should always succeed inside your submission script.

However, the amount of space in $TMPDIR varies from cluster to cluster and from host to host, so while it always exists, it may or may not be large enough for your purposes. In particular the X6440 "Blade Servers" introduced in late 2009 at Placentia and Glooscap have small local scratch (Placentia nodes cl135-cl266, Glooscap nodes cl059-cl097). If you want to see how much space is available,

$ qstat -F localscratch

will give you a list of all hosts and the local scratch space on each.

You can request localscratch from Grid Engine as a custom resource, much like you request time or memory. For example, a job submitted with:

$ qsub -l localscratch=10G job.script

will only be assigned to hosts with at least 10 gigabytes in $TMPDIR.

  • Obviously, the more accurately you can predict how much space your code needs, the better this will work: Ask for too much and your job might not schedule quickly (or at all); ask for too little and the job could fail.
  • As with other requestable resources like h_vmem, the space is allocated per process for a parallel job. See below for more about parallel jobs.
  • Grid Engine only knows about the total disk space in the filesystem --- used plus unused --- and about any other jobs which explicitly request localscratch. If other jobs write to the same filesystem without requesting localscratch then there could be less free space than requested.

For this last reason you might also wish to have your job script check the available space in order to avoid "File system full" or "No space left on device" errors. Here's a script fragment that prints the available space in $TMPDIR in gigabytes:

$ df --block-size=1G $TMPDIR | awk 'END {print $4}'

This can be used in a conditional:

#$ -S /bin/bash
scratchdir=$TMPDIR
freespace=`df --block-size=1G $scratchdir | awk 'END {print $4}'`
if (( $freespace < 10 )); then
    echo "Not enough free space in TMPDIR $TMPDIR, using /nqs..."
    scratchdir=/nqs/$USER/$JOB_ID
    mkdir $scratchdir
fi

We strongly recommend that you use $TMPDIR and localscratch if you want to use node-local disk. $TMPDIR is unique to each job, and Grid Engine deletes the directory at the end of the job. Therefore your script should ensure that any output files written to $TMPDIR are copied to Main Storage before the end of the job script.

If you choose not to use $TMPDIR you must:

  1. check for the existence of /scratch/tmp;
  2. create a subdirectory with your username, /scratch/tmp/$USER, or Grid Engine job number, /scratch/tmp/$JOB_ID;
  3. ensure at the end of the job that the directory is cleaned up and deleted.

It can be tricky to write job scripts that clean up all files under every possible failure. Therefore you should also manually patrol your Local Scratch directories to ensure that the space is not occupied by outdated files from failed or finished jobs.

Multiple hosts

Grid Engine only creates $TMPDIR on the master or "shepherd" host for a job. Therefore the above comments technically apply to serial jobs, or jobs where all the processes reside on a single host (e.g. Gaussian). If you are using the Open MPI library for running your application, please note that Open MPI will use $TMPDIR for its own purposes and thus will create the directory on those hosts where it is missing, effectively providing $TMPDIR on all hosts assigned to the job. Nonetheless, if your application expects each process to write and read to local disk on different hosts, you need to manage things much more carefully.

The hosts attached to a parallel job are described in the hostfile. The name of the hostfile is given to the job script in the environment variable PE_HOSTFILE. You can, for example, fill a shell variable with a list of hostnames with code like this:

hostlist=`awk '{print $1}' $PE_HOSTFILE`

The Grid Engine requestable resource lscratch can and should be used with parallel jobs. While $TMPDIR is only created on the master host, lscratch will check for sufficient disk space on the same filesystem (typically /scratch/tmp) on each host when scheduling the job. There is no reason you couldn't create and destroy directories named $TMPDIR on worker hosts to match the one on the master host:

#$ -l h_rt=0:1:0,test=true
#$ -pe ompi* 2

hostlist=`awk '{print $1}' $PE_HOSTFILE`
echo "------ host ------- GB free"
for host in $hostlist; do
   ssh $host mkdir -p $TMPDIR
   freespace=`ssh $host df --block-size=1G $TMPDIR | awk 'END {print $4}'`
   echo "$host   $freespace"
done
# .... do some computing here ....
# then clean up:
for host in $hostlist; do
   ssh $host rm -r $TMPDIR
done

To further complicate matters, Open MPI does create $TMPDIR on each subordinate host when mpirun is invoked, but then also destroys these $TMPDIRs when the child MPI processes complete.