Memory Management

From ACENET
Jump to: navigation, search
Achtung.png Legacy documentation

This page describes a service provided by a retired ACENET system. Most ACENET services are currently provided by national systems, for which please visit https://docs.computecanada.ca.


Grid Engine

The usage of memory of every job is managed by Grid Engine by setting a hard limit on the maximum amount of memory a job can utilize. Grid Engine relies on the control of the memory resources provided by a shell. The limit is set on a job's shell and all processes started by it, including the Grid Engine infrastructure for starting the job, as well as any auxiliary processes of the job itself, for example those of an MPI job.

Users can request memory resources (i.e. set limits) via the Grid Engine parameters in their submission scripts. Such a parameter specifies the amount of memory required per slot (or CPU core). Grid Engine will use it to calculate the actual value for every host assigned to a job, and will enforce it via setting the shell limits. So, if a parallel job is split across several nodes, the memory reserved on each node is the per slot limit (requested in the submission script) times the number of slots assigned to that node (see Examples below). While users can set the former, they may not always be able to control the latter.

Virtual memory limit (h_vmem)

The amount of virtual memory available to a job is requested with the h_vmem parameter in a submission script. By using this parameter as described above, Grid Engine sets the hard limit of the maximum amount of virtual memory available on every host assigned to the job.

A job that exceeds its memory allocation on a node may produce no output, or the output may be silently truncated somewhere in the middle, because the job will be terminated by its shell. See the section below on how to use qacct to analyze virtual memory usage.

If a user does not specify h_vmem in their submission script, then the job receives a default allocation of memory per slot, which varies from cluster to cluster, and is given in the following table:

Cluster Default h_vmem
fundy 2 GB
glooscap 1 GB
mahone 2 GB
placentia 2 GB

Choosing a value for h_vmem

If you request more than the default h_vmem, keep in mind that whatever you request is unavailable to other jobs while yours is running. It is not uncommon for there to be idle CPUs on a host because jobs using only some of the CPUs have reserved all the memory on the machine. You can see this, for instance, with the utility memslots:

$ memslots 5G
                            -- SLOTS --
QUEUE INSTANCE     MB FREE  FREE USABLE
   short.q@cl176         0    5     0
  medium.q@cl249     18432    3     3
       ...             ...    .     .
   short.q@cl180         0   11     0
   short.q@cl181         0   11     0
Total usable slots:                99
for h_vmem= 5120.0 MB

If a job uses nearly all its memory before it finishes this is not a problem. But if a job requests a great deal of memory it doesn't need and never uses, then it prevents other users from getting their research done in a timely fashion. Technical staff try to look out for these occurrences, and may take corrective action if they deem it necessary.

Estimating the memory required for your job is your responsibility. If you are the code developer you can make a fair estimate from considering the dimension, number and type of your large data structures, plus some extra for other variables and the code itself. If you are using a 3rd-party program you should read the program's documentation carefully — almost all major research software packages have a section dealing with memory requirements.

However, even these methods are not always successful. The remainder of this page discusses how to use Grid Engine itself, and some utility programs, to analyze the memory requirements of completed jobs and use that information to trim your memory requests.

There is some irreducible memory overhead for any job, so if you are too conservative with h_vmem your job may not run, either. Empirically we find h_vmem has to be at least 25M, but it seems to depend on what else your job script contains. We recommend setting h_vmem no smaller than 100M.

Analyze virtual memory usage with qacct

The Grid Engine command qacct -j job_id can be used to examine the records of cpu time and memory usage of completed jobs. qacct returns a lot of data, the maximum amount of virtual memory used appears in the 'maxvmem' field. You can extract just maxvmem with grep, for example:

$ qacct -j 12345 | grep maxvmem
maxvmem      3.017G

If your job was killed for using too much memory, this number will probably be close to the memory limit that was assigned rather than telling you how much it needs. This is, by the way, one way to determine if a job was killed for exceeding its h_vmem. But you must use a successful run of the code to get a correct record of how much memory it needs.

For parallel jobs qacct normally emits one record per node. Each node may have one or more processes assigned to it, and it helps to know how many processes were assigned to each node in order to interpret the results. Unfortunately that information is not given by qacct. You can work around this by having your job script save the host list in its standard output:

$ cat $PE_HOSTFILE

For jobs run with Open MPI, qacct also emits an additional record for the shepherd host, i.e. one record more than there are nodes. That record shows how much memory the shepherd process needed to start all MPI processes, which is an overhead that you need to take into account when specifying h_vmem. As a result, you have to ask for more h_vmem than the MPI processes need, in order to accommodate the shepherd host's need. To complicate matters still further size of this overhead depends in a complicated way on the number of nodes assigned by Grid Engine, which is hard to predict.

A utility jobmem is available which tries to carry out this analysis as best it can:

$ jobmem job_id

Because of the missing slots-per-host information, jobmem should only be used on (i) serial jobs, (ii) shared-memory jobs, or (iii) message-passing parallel jobs which ran successfully and which use CPU efficiently, i.e. no idle CPUs waiting for others to catch up.

If you have more than one mpirun call in the job script there will be additional records in the qacct output. This will also confound jobmem.

Examples

Here is a couple of examples to illustrate how Grid Engine sets the memory limits according to the options in the submission script.

OpenMP job

With the following parameters, your five-thread OpenMP job will be run in a shell that limits virtual memory at 10GB.

#$ -pe openmp 5
#$ -l h_vmem=2G
MPI job
#$ -pe ompi* 5
#$ -l h_vmem=2G

Assuming your job got 2 slots (or CPU cores) on one host, and 3 slots on another, the virtual memory limit will be as follows:

  • Host #1 (two slots): two MPI processes will be launched in a shell with the the virtual memory limit of 4 GB;
  • Host #2 (three slots): three MPI processes will be launched in a shell with the the virtual memory limit of 6 GB.

Stack size limit (h_stack)

IMPORTANT
Most users do not need to set this parameter. We do not encourage changing h_stack unless you are an advanced user. If you change it your jobs may start failing mysteriously.

The size of stack memory available to a job is requested with the h_stack parameter in a submission script (here is more information on setting stack memory size for OpenMP jobs). By using this parameter as described above, Grid Engine sets the hard limit of the maximum stack size available on every host assigned to the job. The default h_stack value is 10MB, which is the same value as in your login Linux shell, and it's set for every job. Grid Engine will not allow you to set h_stack larger than h_vmem.

A job that exceeds its stack size may crash with various error messages (including segfaults) that may not be traceable with debugging options. Contrary to h_vmem, Grid Engine cannot provide information on the stack memory usage by your job.

Note
Advanced users should be aware that some applications may spawn a sub-shell with a raised stack size limit, which is required for proper execution of the application. Usually, such an approach is employed by the developers in order to make it easier for an end user to run their applications. The downside here is that often users are not aware of this hidden requirement on the stack size. As a result, when such a job is submitted through Grid Engine it may fail because it could not raise the stack size limit behind the scene. The error message usually is very clear and points to a failed 'ulimit' operation. One may wonder why there is an issue at all, if by default the stack size limit set by Grid Engine is the same as in the Linux login shell, which is 10MB. The answer is that while the default value for the stack size limit in a Linux shell is 10MB, it's not explicitly set by a user. If a user were to set it explicitly to, say, the same value of 10MB, they would not be able to raise it anymore. In other words, a user is free to set an arbitrary stack size limit once, afterwards that limit can only be lowered. When a job is submitted, Grid Engine does this initial setting of the stack size limit and a user's application cannot raise it anymore. The only solution is to supply an increased h_stack upfront explicitly in a submission script. For those applications that require an unlimited stack size limit, a few gigabyte limit on h_vmem and h_stack is usually enough.

Core files size limit (h_core)

IMPORTANT
Most users do not need to set this parameter. We do not encourage changing h_core unless you are an advanced user.

The maximum size of core files allowed to be created by a failed job is requested with the h_core parameter in a submission script. By default, its value is set to zero for every job, which matches the login shell value. Here is an example how to set h_core:

#$ -l h_core=10G

MPI job fragmentation

Sometimes, available CPU resources are highly fragmented among the compute nodes assigned to an MPI job (for example 10 cores can be fragmented across 10 nodes at 1 core per node). When Grid Engine assigns these resources to a parallel job, the latter may suffer due to lack of memory and fail. We usually observe this for a job of 64 slots of more.

The problem is due to a memory overhead on the shepherd host that has to launch connections to other hosts assigned to the job. Those connections consume some of the h_vmem*slots on the shepherd host. You may see some random failures depending on how many slots, and hence memory, was available on the shepherd host itself when the job started. The problem most often occurs when there is just a single slot assigned to the shepherd host. The more processes (slots) your job has to launch, the greater the chance it will experience this issue. It is also possible that multiple jobs submitted in a row will suffer the same way, because Grid Engine is likely to assign the same set of fragmented resources to each of them.

Specifically, the memory required on the shepherd host to spawn connections is approximately 103M + 105M x number_of_assigned_hosts.

Let's consider an example. You submitted a 64-proc job with h_vmem=2G. Your job was highly fragmented and was assigned to 15 hosts, but the shepherd host had only one slot assigned to it and so the amount of memory available on it was h_vmem*1 = 2G. Fifteen processes establishing connections to other hosts consumed 103M + 15*105M = 1.6G, leaving only 0.4G for the head node to carry out computations, which have 2G of space available on the other hosts. Had the shepherd host been assigned more than one slot, you would not have likely seen the failure. For example, if Grid Engine assigned 2 slots to the shepherd, it would have had 4G available there to spawn connections to other hosts and to run two processes, and thus the job might not have failed.

The solution is either to increase h_vmem (you can safely go up to 2G, but increasing it past 2G will likely overestimate your real memory requirements and waste CPU cores to satisfy memory requests), or to try to allocate the resources by the node (recommended), or by groups of slots in order to reduce the number of connections from the shepherd host. You can read some more details here.

Here is an example how one could pack their job onto four 16-core nodes. It is also recommended to turn on the reservation:

#$ -pe 16per* 64
#$ -R y

OpenMP

If you OpenMP code does not have enough stack memory, it might crash with a segmentation fault error message. Here is how you can adjust the limits. The main OpenMP thread will get the stack size from the shell (set with h_stack in a submission script, see Memory Management for details). The stack size for all additional threads is defined with the environment variable OMP_STACKSIZE. Therefore, if you code requires additional stack memory, you might need to set both parameters. Here is an example of a submission script:

#$ -cwd
#$ -l h_rt=1:0:0
#$ -l hvmem=2G,h_stack=512M

export OMP_NUM_THREADS=$NSLOTS
export OMP_STACKSIZE=512M

./openmp_code

Java VM

Common issue

The most common issue with running the Java VM (or JVM) on a cluster is a crash due to lack of memory. Here is a typical error message you may see:

 Error occurred during initialization of VM
 Could not reserve enough space for object heap
 Could not create the Java virtual machine.

The main reason for this type of a failure is that the Java virtual machines expects to have access to all the physical memory on a given compute node, while the scheduler imposes its own memory limits (often different) according to the submission script specifications. In our shared computing environment, these limits ensure that finite resources, such as memory and CPU cores, do not get exhausted by one job at the expense of another.

Java memory options

When the Java VM starts, it sets two memory parameters according to the amount of physical rather than available memory, as follows:

  • Initial heap size of 1/64 of physical memory
  • Maximum heap size of 1/4 of physical memory

These two parameters can be explicitly controlled on the command line by following either syntax below:

 java -Xms256m -Xmx4g -version

or

 java -XX:InitialHeapSize=256m -XX:MaxHeapSize=4g -version

You can see all the command line options the JVM is going to run with by specifying the following flag -XX:+PrintCommandLineFlags, like so:

$ java -Xms256m -Xmx4g -XX:+PrintCommandLineFlags -version
-XX:InitialHeapSize=268435456 -XX:MaxHeapSize=4294967296 -XX:ParallelGCThreads=4 -XX:+PrintCommandLineFlags -XX:+UseCompressedOops -XX:+UseParallelGC

Typical failure

Let's consider an example when the same job gets assigned to a compute node with 64G and with 16G of RAM. See the submissions script below.

Note
For demonstration purposes we assume to be running a threaded Java job. If yours is serial then do not specify the parallel environment option.
#$ -cwd
#$ -j y
#$ -l h_rt=1:0:0
#$ -pe openmp 4
#$ -l h_vmem=2G

module load java
java -XX:+PrintCommandLineFlags -version

The parallel environment requests four slots on a single host and the amount of memory available for the job will be 8G (see Memory Management for details).

a 16G node

When the job is run on a node with 16G of RAM, the JVM will set

  • Initial heap size 256K
  • Maximum heap size 4G

Here is the output:

-XX:InitialHeapSize=262706496 -XX:MaxHeapSize=4203303936 -XX:ParallelGCThreads=4 -XX:+PrintCommandLineFlags -XX:+UseCompressedOops -XX:+UseParallelGC
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
a 64G node

When the job is run on a node with 64G of RAM, the JVM will set

  • Initial heap size 1G
  • Maximum heap size 16G

Here is the output:

-XX:InitialHeapSize=1054615744 -XX:MaxHeapSize=16873851904 -XX:ParallelGCThreads=4 -XX:+PrintCommandLineFlags -XX:+UseCompressedOops -XX:+UseParallelGC
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

As you can see the job has stopped, because the maximum heap size exceeds the 8G memory limit imposed by the scheduler.

The result appears counter-intuitive, because the error message claims lack of memory on a node with more physical memory.

Solution

The solution to this problem is to launch the JVM specifying its memory parameters explicitly on the command lines. Here is the submission script that will run on any hardware assigned by the scheduler:

#$ -cwd
#$ -j y
#$ -l h_rt=1:0:0
#$ -pe openmp 4
#$ -l h_vmem=2G

module load java
java -Xms256m -Xmx4g -XX:+PrintCommandLineFlags -version

Here is the output, which is the same no matter what machine the job was assigned to:

-XX:InitialHeapSize=268435456 -XX:MaxHeapSize=4294967296 -XX:ParallelGCThreads=4 -XX:+PrintCommandLineFlags -XX:+UseCompressedOops -XX:+UseParallelGC
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

Alternatively, you can use the _JAVA_OPTIONS environment variable to set the run-time options rather that passing them on the command line. This is especially convenient if you launch multiple Java calls, or call a Java program from another Java program. Here is an example how to do it:

 export _JAVA_OPTIONS="-Xms256m -Xmx1g"

When your Java program is run, it will produce a diagnostic message like this one "Picked up _JAVA_OPTIONS", verifying that the options have been picked up.

Please remember that the Java virtual machine itself creates a memory usage overhead. We recommend specifying h_vmem 1-2GB more than your setting on the Java command line option -Xmx. In the example above, the scheduler memory limit of 8G is well above the 4G limit on the Java command line.

Garbage collector

By default, the Java VM uses a parallel Garbage collector (GC) and sets a number of GC threads equal to the number of CPU cores on a given node, whether a Java job is threaded or not. Each GC thread consumes memory. Moreover, the amount of memory each GC thread consumes is proportional to the amount of physical memory. Therefore, we highly recommend matching the number of GC threads to the number of slots you requested from the scheduler in your submission script, like so -XX:ParallelGCThreads=$NSLOTS. You can also use the serial garbage collector by specifying the following option -XX:+UseSerialGC, whether your job is parallel or not. Here is an example of a submission script to demonstrate both options:

#$ -cwd
#$ -j y
#$ -l h_rt=1:0:0
#$ -pe openmp 4
#$ -l h_vmem=2G

module load java
java -Xms256m -Xmx4g -XX:ParallelGCThreads=$NSLOTS -version
# or 
java -Xms256m -Xmx4g -XX:+UseSerialGC -version