Cluster Status

From ACENET
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Cluster Status Planned Outage Notes
Siku Online 2024 Mar 19 Slurm to be down a few hours
Placentia Online Restricted since March 2019
Nefelibata Online - No scheduler

For national clusters (Arbutus, Beluga, Cedar, Graham, Narval, Niagara) see status.alliancecan.ca

Services

Service Status Planned Outage Notes
WebMO Retired End of service 2019 Mar 31 Retired with Placentia
Account creation Manual No outages Write support
PGI and Intel licenses Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Jobs will not be scheduled with a run time (--time=) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • Siku: Slurm job scheduler will be off-line Tuesday March 19, 2024, beginning at 13h30 UTC for emergency maintenance on the machine running the Slurm controller. We anticipate an outage of approximately two hours. New jobs are being accepted but none will be launched until after the outage. Access to the cluster will still be permitted and storage will remain accessible.


Siku

2023

  • UPDATE: Siku is available again.
    The two V100-GPU nodes are known to boot very slowly and still unavailable at this time. We will return them to service later today.
11:00, December 22, 2023 (NST)
  • Newfoundland power has advised us of a planned power outage of Memorial University's south campus in order to facilitate relocation of an overhead powerline & pole on Thursday, December 21st 2023. We will start shutting down Siku at 1100h Nfld (1030h Atlantic) that day and are planning to have Siku up and running again around noon on Friday, December 22nd.
15:30, December 18, 2023 (NST)
  • Last night there was a power outage in the data-centre that hosts Siku. Currently the whole system is unavailable, however we are actively working on booting everything up again and expect Siku to be operational again later today.
09:10, December 14, 2023 (NST)
UPDATE at 13:00, December 14, 2023 (NST): We have completed recovery after this unplanned power outage and Siku is operational again.
The two V100-GPU nodes are known to boot very slowly and still unavailable at this time. We will return them to service later today.
  • Facilities Management has advised us of a short power outage on the morning of Thursday, August 3rd 2023, for which we need to shut down all compute-nodes.
    Access to the login-nodes and storage system is expected to be maintained throughout the outage, though a short (<5min) network outage may be experienced around 6 am NDT (8:30 am UTC).
    The scheduler won't start any jobs that won't finish by 4:30 am NDT (7 am UTC) on Thu Aug 3rd 2023.
    We expect to resume normal operations later the same day.
16:20, July 28, 2023 (NDT)
UPDATE at 09:45h, August 3, 2023 (NDT): The power outage was completed and Siku is operational again.
  • On the morning of July 17th we noticed that our air conditioning (A/C) unit was leaking water and had to be turned off. Without a working A/C unit we are now powering off all compute nodes in order to reduce heat in the data centre.
    We will post an update here as soon as we have a better estimate about when service can be resumed.
10:45, July 17, 2023 (NDT)
UPDATE at 2023-06-17 11:50 NDT: The issue (a clogged drain) has been resolved and we are in the process of powering up the compute nodes again. We will provide an update as soon as Siku is available again.
UPDATE at 2023-06-17 12:50 NDT: Siku is available again. Jobs have resumed 30 minutes ago and users can log-in again. Three of our GPU nodes are still offline, but we are working on putting them back into service later today.
  • On June 26 there was a brief power interruption in the MUN data centre that caused several compute nodes to reboot and the cluster as well as internal network interruptions. We are in the process of resolving the issues caused by this and making all resources available again.
11:00, June 27, 2023 (NDT)
UPDATE at 2023-06-27 12:00 NDT: Siku is fully available again. Unfortunately we had to reboot all compute nodes to resolve filesystem issues that were caused by the power-event.
  • Siku outage has started at 07:30 NDT (10h00 UTC). We anticipate restoring service by Wednesday May 10 at 20:00 UTC, sooner if possible.
7:47, May 8, 2023 (NDT)
UPDATE at 2023-05-10 19:00 NDT: We are still experiencing several issues with Siku. Expected return to service is now Thursday, 11 May 2023.
UPDATE #2 at 2023-05-12 09:12 NDT: The outage was successfully completed. We informed all Siku users via email.

For older outages see: Previous outages

  • Our newest cluster, Siku, is now in production. Access is currently restricted to invited users only. Access request form.
13:00, December 10, 2019 (NST)

Placentia

  • Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.

Nefelibata

  • Nefelibata has had its shared storage replaced, but Slurm scheduler service has not yet been restored. This is waiting for personnel to become available from other work.
2024-03-18
  • Nefelibata will be unavailable on 2023 September 5, Tuesday, for operating system and driver updates. We expect return-to-service on Wednesday Sept 6.
Update at 2023-09-07 12:00 NDT: Outage complete, Nefelibata back in service.