Memory Management
- Main page: Job Control
Contents |
Overview
The usage of memory of every job is managed by Grid Engine by setting a hard limit on the maximum amount of memory a job can utilize. Grid Engine relies on the control of the memory resources provided by a shell. The limit is set on a job's shell and all processes started by it, including the Grid Engine infrastructure for starting the job, as well as any axillary processes of the job itself, for example those of an MPI job.
Users can request memory resources (i.e. set limits) via the Grid Engine parameters in their submission scripts. Such a parameter specifies the amount of memory required per slot (or CPU core). Grid Engine will use it to calculate the actual value for every host assigned to a job, and will enforce it via setting the shell limits. So, if a parallel job is split across several nodes, the memory reserved on each node is the per slot limit (requested in the submission script) times the number of slots assigned to that node (see Examples below). While users can set the former, they may not always be able to control the latter.
Virtual memory limit (h_vmem)
The amount of virtual memory available to a job is requested with the h_vmem parameter in a submission script. By using this parameter as described above, Grid Engine sets the hard limit of the maximum amount of virtual memory available on every host assigned to the job.
A job that exceeds its memory allocation on a node may produce no output, or the output may be silently truncated somewhere in the middle, because the job will be terminated by its shell. See the section below on how to use qacct to analyze virtual memory usage.
If a user does not specify h_vmem in their submission script, then the job receives a default allocation of memory per slot, which varies from cluster to cluster, and is given in the following table:
| Cluster | Default h_vmem |
|---|---|
| mahone | 2 GB |
| placentia | 2 GB |
| fundy | 2 GB |
| brasdor | 500 MB |
| glooscap | 500 MB |
| courtenay | no limit |
Choosing a value for h_vmem
If you request more than the default h_vmem, keep in mind that whatever you request is unavailable to other jobs while yours is running. It is not uncommon for there to be idle CPUs on a host because jobs using only some of the CPUs have reserved all the memory on the machine. You can see this, for instance, with the utility memslots:
$ memslots 5G
-- SLOTS --
QUEUE INSTANCE MB FREE FREE USABLE
short.q@cl176 0 5 0
medium.q@cl249 18432 3 3
... ... . .
short.q@cl180 0 11 0
short.q@cl181 0 11 0
Total usable slots: 99
for h_vmem= 5120.0 MB
If a job uses nearly all its memory before it finishes this is not a problem. But if a job requests a great deal of memory it doesn't need and never uses, then it prevents other users from getting their research done in a timely fashion. Technical staff try to look out for these occurrences, and may take corrective action if they deem it necessary.
Estimating the memory required for your job is your responsibility. If you are the code developer you can make a fair estimate from considering the dimension, number and type of your large data structures, plus some extra for other variables and the code itself. If you are using a 3rd-party program you should read the program's documentation carefully — almost all major research software packages have a section dealing with memory requirements.
However, even these methods are not always successful. The remainder of this page discusses how to use Grid Engine itself, and some utility programs, to analyze the memory requirements of completed jobs and use that information to trim your memory requests.
There is some irreducible memory overhead for any job, so if you are too conservative with h_vmem your job may not run, either. Empirically we find h_vmem has to be at least 25M, but it seems to depend on what else your job script contains. We recommend setting h_vmem no smaller than 100M.
Analyze virtual memory usage with qacct
The Grid Engine command qacct -j job_id can be used to examine the records of cpu time and memory usage of completed jobs. qacct returns a lot of data, the maximum amount of virtual memory used appears in the 'maxvmem' field. You can extract just maxvmem with grep, for example:
$ qacct -j 12345 | grep maxvmem maxvmem 3.017G
If your job was killed for using too much memory, this number will probably be close to the memory limit that was assigned rather than telling you how much it needs. This is, by the way, one way to determine if a job was killed for exceeding its h_vmem. But you must use a successful run of the code to get a correct record of how much memory it needs.
For parallel jobs qacct normally emits one record per node. Each node may have one or more processes assigned to it, and it helps to know how many processes were assigned to each node in order to interpret the results. Unfortunately that information is not given by qacct. You can work around this by having your job script save the host list in its standard output:
$ cat $PE_HOSTFILE
For jobs run with Open MPI, qacct also emits an additional record for the shepherd host, i.e. one record more than there are nodes. That record shows how much memory the shepherd process needed to start all MPI processes, which is overhead that you need to take into account when specifying h_vmem. As a result, you have to ask for more h_vmem than the MPI processes need, in order to accommodate the shepherd host's need. To complicate matters still further size of this overhead depends in a complicated way on the number of nodes assigned by Grid Engine, which is hard to predict.
A utility jobmem is available which tries to carry out this analysis as best it can:
$ jobmem job_id
Because of the missing slots-per-host information, jobmem should only be used on (i) serial jobs, (ii) shared-memory jobs, or (iii) message-passing parallel jobs which ran successfully and which use CPU efficiently, i.e. no idle CPUs waiting for others to catch up.
If you have more than one mpirun call in the job script there will be additional records in the qacct output. This will also confound jobmem.
Examples
Here is a couple of examples to illustrate how Grid Engine sets the memory limits according to the options in the submission script.
- OpenMP job
With the following parameters, your five-thread OpenMP job will be run in a shell that limits virtual memory at 10GB.
#$ -pe openmp 5 #$ -l h_vmem=2G
- MPI job
#$ -pe ompi* 5 #$ -l h_vmem=2G
Assuming your job got 2 slots (or CPU cores) on one host, and 3 slots on another, the virtual memory limit will be as follows:
- Host #1 (two slots): two MPI processes will be launched in a shell with the the virtual memory limit of 4 GB;
- Host #2 (three slots): three MPI processes will be launched in a shell with the the virtual memory limit of 6 GB.
Stack size limit (h_stack)
- IMPORTANT
- Most users do not need to set this parameter. We do not encourage changing
h_stackunless you are an advanced user. If you change it your jobs may start failing mysteriously.
The size of stack memory available to a job is requested with the h_stack parameter in a submission script. By using this parameter as described in Overview, Grid Engine sets the hard limit of the maximum stack size available on every host assigned to the job. The default h_stack value is 10MB, which is the same value as in your login Linux shell, and it's set for every job. Grid Engine will not allow you to set h_stack larger than h_vmem.
A job that exceeds its stack size may crash with various error messages. Contrary to h_vmem, Grid Engine cannot provide information on the stack memory usage by your job.
Advanced users should be aware that some applications may spawn a sub-shell with a raised stack size limit, which is required for proper execution of the application. Usually, such an approach is employed by the developers in order to make it easier for an end user to run their applications. The downside here is that often users are not aware of this hidden requirement on the stack size. As a result, when such a job is submitted through Grid Engine it may fail because it could not raise the stack size limit behind the scene. The error message usually is very clear and points to a failed 'ulimit' operation. One may wonder why there is an issue at all, if by default the stack size limit set by Grid Engine is the same as in the Linux login shell, which is 10MB. The answer is that while the default value for the stack size limit in a Linux shell is 10MB, it's not explicitly set by a user. If a user were to set it explicitly to, say, the same value of 10MB, they would not be able to raise it anymore. In other words, a user is free to set an arbitrary stack size limit once, afterwards that limit can only be lowered. When a job is submitted, Grid Engine does this initial setting of the stack size limit and a user's application cannot raise it anymore. The only solution is to supply an increased h_stack upfront explicitly in a submission script. For those applications that require an unlimited stack size limit, a few gigabyte limit on h_vmem and h_stack is usually enough.
Core files size limit (h_core)
- IMPORTANT
- Most users do not need to set this parameter. We do not encourage changing
h_coreunless you are an advanced user.
The maximum size of core files allowed to be created by a failed job is requested with the h_core parameter in a submission script. Usually, users who are interested in using this parameter set it to zero.