Nefelibata

From ACENET
Jump to: navigation, search


Nefelibata is a computer cluster located at Dalhousie University, managed by ACENET on behalf of researchers who have purchased the equipment.

The two machines known as Humus are integrated with Nefelibata.

Access

You may only access Nefelibata with the permission of one of the contributing Principal Investigators. If you don't have access to Nefelibata and believe you should, please write to support@ace-net.ca and say so. Copy the PI with whom you are associated, and mention in the email your Digital Research Alliance of Canada username.

When your Alliance username has been added to the access control list for Nefelibata, you can log in like so:

ssh username@nefelibata.ace-net.ca

Supply your Alliance password when prompted. We highly recommend installing an SSH public key and using SSH key pair access after your first login.

Use the login node for data transfers.

Software

Nefelibata uses the same modules system as Alliance clusters, providing access to the same list of available software.

Job scheduling

Users will normally access the individual machines comprising the cluster via the job scheduler, Slurm. If you have used Alliance resources you may already be familiar with job scheduling with Slurm. Being a much smaller cluster, Nefelibata has a much simpler configuration than the large Alliance clusters, but the basic commands are the same. For example,

sbatch --time=24:0:0 --ntasks=32 jobscript.sh

...to submit a job script jobscript.sh.

Known issues

Interactive jobs fail with:

srun: error:  mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix_v3: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix_v3
srun: error: invalid MPI type 'pmix_v3', --mpi=list for acceptable types

We are investigating causes and solutions. (2023-02-16)

To run a multi-host MPI calculation, add

export UCX_NET_DEVICES=ib0

before calling mpirun.