Storage System

From ACEnet
Jump to: navigation, search
Main page: User Guide

The storage implemented by ACEnet is Sun's SAM-QFS, a hierarchical storage system using RAID 5 technology. The objective of a hierarchical storage system is large storage volume at low cost, while not sacrificing speed of access. The objectives of the RAID 5 technology are speed and fault tolerance.

Although there are features available in SAM-QFS that allow for backup and recovery of user data, SAM-QFS is not configured as a backup system at ACEnet! ACEnet endeavours to protect users from the effects of hardware failures, but we do not protect from accidental overwriting or deletion.

Given that ACEnet does not provide backup services therefore users are strongly encouraged to make off-site (or multi-site) copies of their critical data, and to observe their home institution's data storage policies. Some institutions offer network backup facilities which you might be able to take advantage of.

Contents

Layout

There are four areas of disk space available to the user on most ACEnet clusters: two areas are part of the Permanent Storage, and two others are in Temporary Storage. The general outline of the ACEnet storage system is given below.

Permanent Storage system on every cluster
Name Location Function Resource type
Main (home) /home/<username> critical data and code shared
Global Scratch (working dir) /globalscratch/<username> working data, large volumes of data shared

On Glooscap, Global Scratch does not exist.

Temporary Storage system on every cluster
Name Location Function Resource type
No-quota Scratch /nqs/<username> temporary data, large data shared
Local Scratch /scratch/tmp temporary data, fast read/write access data node-local

Permanent Storage

Main Storage

Home Directory

Main storage is your personal and permanent space for research-critical data and code. This is where you should put your data prior to and after computations, and where you should keep source code and executables. It is located in /home/<username>, where your username will replace <username>. When you log in, this is your current working directory. You may create whatever subdirectories you like here. The Main storage is a networked storage shared among all compute nodes via NFS (Network File System).

How it works
The data will be first written on SATA storage. A copy of the files will be done after 8 hours of the last modification time to the tape library. After 2 months of inactivity on a file (not modified or read), the copy on the SATA disk will be released and the inode will point directly to the tape cartridge. The file will only reside on the tape at that point. If the SATA-based filesystem usage reaches a "high water mark", currently defined as 80% full, the system looks for files that have been copied to tape but not yet released and releases them from SATA disks, continuing until usage drops to the "low water mark", 50% full.

Global Scratch

Working Directory

Global Scratch is found in /globalscratch/<username>. It is designed as a storage volume for working data and data that is required by computations. The name 'scratch' is an unfortunate historical name and it does not imply that this is temporary storage. We do not delete or "clean" any user data from this volume. It enjoys the same level of protection as Main Storage. The Global Scratch is considerably larger than the main storage and is shared among all compute nodes via NFS. You may have the scratch symlink in your home directory pointing to this location, or you can create one yourself with the following command:

$ cd
$ ln -s /globalscratch/<username> scratch
How it works
The data will be first written on the fast (FC) storage. A copy of the files will be done after 3 days of the last modification time to the slower SATA disk storage. A copy of the files will be done after 6 days of the last modification time to the tape library. After two weeks of inactivity on a file (not modified or read), the copy on the fast (FC) disk will be released and the inode will point directly to the copy on the SATA disk. After 8-16 weeks of inactivity on a file (not modified or read), the copy on the SATA disk will be recycled and the inode will only point the copy on tape. The file will only reside on the tape at that point. If the FC-based filesystem usage reaches a "high water mark", currently defined as 80% full, the system looks for files that have been copied but not yet released and releases them from FC disks, continuing until usage drops to the "low water mark", 50% full.

Quotas

Storage quotas are implemented at all clusters except Courtenay. These are the default values:

Location Limit type Brasdor Fundy Mahone Placentia Glooscap
/home/<username> per user quota 47 Gb 47 Gb 13 Gb 13 Gb 61 Gb
/globalscratch/<username> per user quota 238 Gb 238 Gb 238 Gb 238 Gb --

The quota covers both "online" (disk) and "offline" (tape) storage. Your usage and limit information can be found with the command quota. You can also use du to determine how much online space your files occupy:

$ du -h --max-depth=1 /home/<username>/
$ du -h --max-depth=1 /globalscratch/<username>/

Living within your quota:

  • The du command will report only those files that have not been released to the second level disks or tapes (see "How it works" above), while the quota is set for all of your files.
  • You can see your offline usage with quota, or with the web app https://webmo.ace-net.ca/uqs/login.pl.
  • The disk allocation unit (DAU) ranges from 32K to 2.7M at different sites and storage areas. The DAU is the smallest unit of disk that a file can occupy, so this number can affect your total storage usage strongly if you have a large number of files smaller than the DAU.
  • In current versions of SAM-QFS, a file that is newly created or appended to may have a much larger footprint than the DAU for about 30 seconds after the creation or extension. If your application creates a large number of small files very rapidly you might find that you have to introduce a delay into the process to avoid running over quota.
  • Where there are two numbers specified for the DAU in the table below, like so X (Y) KB, then the first 8 blocks of a file will be X KB each, and the rest of the blocks will be Y KB each. For example, a 97 KB file at Brasdor in /home will occupy 8*4+64+64=160 KB of the disk space. This feature of having a dual-DAU allows to save disk space when working with many small files.
Disk Allocation Unit (DAU) sizes
Location Brasdor Fundy Mahone Placentia Glooscap
/home 4 (64) KB 2.7 MB 4 (32) KB 4 (64) KB 4 (64) KB
/globalscratch 1.5 MB 1.6 MB 192 KB 2.6 MB --

Temporary Storage

No-quota Scratch

No-quota Scratch (NQS) is temporary network storage that has no per-user quota limit, but gets cleaned periodically to get rid of old files. It's available at /nqs/<username>/ on every cluster to users who have requested access to it.

Note
If you want to use NQS, you should contact support stating that you understand the terms of use and would like NQS turned on. Also, if you want notification for when files will be deleted, please let us know that you want notification turned on and tell us where you want those emails to be sent.

NQS is designed to allow you to store large amounts of data on a temporary basis, for example, files generated and consumed during a single job that cannot be stored on Main Storage or Global Scratch due to the per-user quotas. It is not a hierarchical storage system, it only consists of disk drives. Because no quotas are enforced, there is an irreducible risk that the filesystem will fill up. Should that occur existing data on /nqs may be unrecoverable. This means it is unsuitable for storage of critical data. Long-term storage of data --- critical or not --- is also not appropriate since this increases the risk of the filesystem filling up during its intended use.

You are expected to delete your files from /nqs once the associated job or jobs are complete. Technical staff also reserve the right to delete files manually in the event of a manifest risk of a fill-up emergency.

To ensure that these guidelines are followed and /nqs/ stays usable for its intended purpose, files which have not been accessed for 31 days are automatically deleted. The deletion routine will notify you seven days in advance of removing any of your files if you keep a file named /home/username/.nqs in your home directory with these contents:

U_EMAIL=user@some.address.foo
U_QUIET=no
Size of /nqs at each site
Mahone Brasdor Placentia Glooscap Courtenay Fundy
13T 13T 13T 2.7T 2T 13T

If you want to check how much space is used or available in NQS then use the following command:

$ df -h /nqs/$USER

To examine the last access time of your files:

$ ls -lu /nqs/$USER     # in the given directory
$ ls -luR /nqs/$USER    # in subdirectories too, recursively

To find files recursively which have not been accessed for the last e.g. 24 days:

$ find /nqs/$USER -type f -atime +24

Local Scratch

Main page: Local Scratch

Each compute node has its own disk (or in some cases, solid state memory) which is not shared with other compute nodes. We refer to this as local disk. If it is used to store temporary files for an individual job, then we refer to that as "local scratch storage".

Local scratch is not organized consistently across all clusters and hosts. In most cases it is found in /scratch/tmp, but there are some hosts where /scratch/tmp doesn't exist. Grid Engine provides an environment variable TMPDIR which points to a local disk location which always exists, hence

$ cd $TMPDIR

should always succeed inside your submission script.

The size of local scratch space varies from cluster to cluster and from host to host. In particular the X6440 "Blade Servers" introduced in late 2009 have small local scratch. You may wish to have your job script check the size before it decides where to write scratch files, in order to avoid "File system full" errors. Here's a script fragment that prints the available space in $TMPDIR in kilobytes:

$ df --block-size=1024 $TMPDIR | awk 'END {print $4}'

Parallel users will want to be even more careful, since available space may vary from host to host within a single job.

$TMPDIR is unique to each job, and Grid Engine deletes the directory at the termination of a job. We strongly recommend that you use $TMPDIR if you want to use node-local disk. If you do not use $TMPDIR we recommend that you

  1. check for the existence of /scratch/tmp;
  2. create a subdirectory with your username, /scratch/tmp/$USER, or Grid Engine job number, /scratch/tmp/$JOB_ID;
  3. ensure at the end of the job that the directory is cleaned up and deleted.

For a parallel job you should do this for each host in $PE_HOSTFILE.

If you write output files to Local Scratch, your script should ensure that they are copied to Main Storage at the end of the job. If you write temporary files to Local Scratch, please ensure that they are deleted at the end of the job. You should also manually patrol your Local Scratch directories to ensure that the space is not occupied by outdated files from failed or finished jobs.

Resources
User Support
News and Events
Organization
About Us