Cluster Status

From ACENET
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Cluster Status Planned Outage Notes
Siku Online No outages
Placentia Online Restricted since March 2019
Nefelibata Online - No scheduler
Argo Online - still shaking out bugs

For national clusters (Arbutus, Beluga, Cedar, Graham, Narval, Niagara) see status.alliancecan.ca

Services

Service Status Planned Outage Notes
WebMO Retired End of service 2019 Mar 31 Retired with Placentia
Account creation Manual No outages Write support
PGI and Intel licenses Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Jobs will not be scheduled with a run time (--time=) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • There are currently no planned outages.


Siku

2024

  • There was an unplanned power outage between 16h15 and 16h30 UTC (13h45 and 14h00 NDT), during which many but not all jobs were lost. Normal operation was resumed about 18h00 UTC (15:30 NDT).
15:38, March 26, 2024 (NDT)
  • Slurm job scheduler was off-line Monday March 25, 2024, beginning at 11h00 UTC (08h30 NDT) until 12h45 UTC (10h15 NDT) for a second urgent maintenance on the machine running the Slurm controller. This was now completed and normal operation has resumed.
10:23, March 25, 2024 (NDT)
  • Siku scheduler is available again.
    The emergency maintenance was completed and normal operation has resumed at 11h50 NDT (14h20 UTC).
12:00, March 19, 2024 (NDT)
  • Slurm job scheduler will be off-line Tuesday March 19, 2024, beginning at 13h30 UTC for emergency maintenance on the machine running the Slurm controller. We anticipate an outage of approximately two hours. New jobs are being accepted but none will be launched until after the outage. Access to the cluster will still be permitted and storage will remain accessible.

For older outages see: Previous outages

  • Our newest cluster, Siku, is now in production. Access is currently restricted to invited users only. Access request form.
13:00, December 10, 2019 (NST)

Placentia

  • Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.

Nefelibata

  • Nefelibata has had its shared storage replaced, but Slurm scheduler service has not yet been restored. This is waiting for personnel to become available from other work.
2024-03-18
  • Nefelibata will be unavailable on 2023 September 5, Tuesday, for operating system and driver updates. We expect return-to-service on Wednesday Sept 6.
Update at 2023-09-07 12:00 NDT: Outage complete, Nefelibata back in service.