Tracking paid accounts

From ACENET
Jump to: navigation, search

If your firm or organization has a paid contract with ACENET for compute time on Siku, then this page explains how your computing is measured and tracked. Submitting and troubleshooting jobs is discussed elsewhere.

Terminology

  • "User", that's obvious. One person, one login name.
  • "Account" is not the same as "user" in this context. Your firm or organization has a contract with ACENET (or you probably wouldn't be reading this.) On Siku, that contract is represented by an "account" with a name like "pd-abc-123".
  • "QoS" stands for "Quality of Service", but it might be better to think of a "QoS" as a software object which remembers how many CPU hours etc. an account is allowed to use, and how much has been used already. Each paid account is associated with its own QoS, and the QoS has the same name as the account, like "pd-abc-123".
  • "Billing units" measure the use of the system. One CPU-minute and associated RAM is worth one billing unit. A GPU-minute is worth 35 billing units. See below for a formula, and examples.

How much computing can I do?

You can see the number of billing units available to you through your QoS by running the utility acct-tool. To make them somewhat easier to interpret, they're translated from minutes into hours and referred to as "CPU hours". The output will look something like this:

Available QoSs: pd-abc-123
Default QoS:    pd-abc-123

For QoS 'pd-abc-123' this quarter:
 CPU hours cap:     2190000.0   (131400000 billing units)
 CPU hours used:         57.7   (3461 billing units)

When your team gets close to the limit, you may find that your jobs are not starting, but instead staying in PD (pending) state with "QOSGrpBillingMinutes" showing in the "Reason" field of squeue (or sq). Or, you may receive this pair of messages on trying to submit a job with with sbatch:

sbatch: error: QOSGrpBillingMinutes
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

In either case this is because the job would put you over your billing limit if it ran for the time requested. Have your Principal Investigator contact support@ace-net.ca to discuss raising your billing limit.

Who has been using our account, and how much?

To see the billing units charged against your account (your QoS) broken down by user, run acct-history:

$ acct-history -S 2019-11-01
Usage report on QoS 'pd-abc-123':
     Start time:  2019-11-01
       End time:  2019-11-14T20:27:50

CPU hours per user:
           198.4  alice
          9817.9  bob
  ------------------------
         10016.3  TOTAL

Breakdowns by individual jobs can be obtained from sacct with suitable options. See the sacct man page or contact support for help.

A report containing similar information will be sent at the beginning of each month to the senior contact in the organization (i.e., the principal investigator or contract signer).

Billing units formula

BillingUnits = MAX( CPUs, RAM_GB * 0.215, GPUs * 35.0 ) * minutes

A job that reserves one CPU and 4G of RAM and runs for one minute consumes 1 billing unit. One CPU and 10G of RAM for one minute? 2.15 billing units.

Our GPU-equipped nodes have 40 CPUs, 186G of RAM, and two GPUs. To use one of these nodes for 24 hours would cost MAX( 40, 186*0.215, 2*35 ) * 24 * 60 = 70 * 24 * 60 = 100800 billing units.

The rate per GB of RAM is chosen so that using either all the CPUs or all the RAM on a basic node costs the same 40 billing units. Requesting all the memory on a high-memory node costs 80 billing units per minute.

You can see the billing rate for a live job like so:

[you@login1 ~]$ scontrol show job 7976 | grep billing
  TRES=cpu=12,mem=108000M,node=1,billing=22

This job will be billed at a rate of 22 billing units per minute of elapsed time. You can derive the billing unit consumed by a completed job with

[you@login1 ~]$ sacct -X --format=AllocTRES%40,Elapsed --noheader -j 7402
       billing=40,cpu=40,mem=10G,node=1   00:02:09

...multiplying the billing rate (40) by the elapsed time (2 + 9/60) in minutes.

Why the funny word, QoS?

You may well ask, "Why have QoSs at all, why not just use accounts?" That has to do with Slurm internals. We would like you to be able to think in terms of a "bank account" of computing time, but to implement that we had to use Slurm's QoS mechanism. If we were then to call a QoS a "bank account", when the term "account" in Slurm means something slightly different but closely related, that would cause great confusion if and when you ever have to consult the generic Slurm documentation at https://slurm.schedmd.com.

What if I have more than one account or QoS?

A user typically only has access to one account and one QoS, and your jobs are automatically associated with that account and QoS. You could have more than one account if, for example, your firm made two separate contracts with ACENET for separate projects, and you are working on both, or if you were a freelancer and working for two different firms with ACENET contracts. In that case it will be up to you to assign each job you submit to the correct QoS using the --qos= option to sbatch, salloc, or srun.