Cluster Status

From ACENET
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Please click on the name of the cluster below in the table to quickly get to the corresponding section of this page. The outage schedule section is a single place where data about all scheduled outages are represented.

Cluster Status Planned Outage Notes
Mahone Online No outages
Placentia Online No outages
Fundy Online No outages
Glooscap Online 20-23 Feb 2018

Services

Service Status Planned Outage Notes
WebMO Online No outages
Account creation Online No outages
PGI and Intel licenses Online No outages
Videoconferencing (IOCOM Server) Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Grid Engine will not schedule any job with a run time (h_rt) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • ALL LOGIN NODES will be rebooted at various times on Monday, January 8, 2018, to update operating systems in response to recent security developments. Compute nodes and running jobs will be unaffected. Downtime is expected to be less than an hour in each case.
  • Groups which have not registered their "Transition Ready" status are blocked from submitting new jobs at Mahone and Fundy. See New Systems Migration for more information.
  • Glooscap will be off-line for electrical power work beginning 07h00 Tuesday, February 20, 2018. Return-to-service expected early Friday, February 23.

Mahone

  • Mahone is back in service after this weekend's electrical power event. Some compute nodes must remain off-line due to the lack of a power distribution bar. This represents a reduction in capacity of about 80 cores.
13:15, December 4, 2017 (AST)
  • A power distribution bar shorted out in one of the racks, which tripped the 150a breaker in the UPS and took out one entire panel. We are working on bringing the servers up.
08:48, December 4, 2017 (AST)
  • Mahone has been returned to service.
15:47, November 8, 2017 (AST)
  • An unplanned overnight power outage at the SMU has caused all nodes - including the storage system - to crash. The sysadmins are in the process of powering everything up again and assessing any damage.
08:56, November 7, 2017 (AST)

Placentia

  • Placentia is back online after a planned power outage in the data centre that required shutting the cluster down between Nov 23 and Nov 27. Waiting jobs have been restarted.
13:19, November 27, 2017 (AST)
  • The cooling problems in the Placentia machine room have been resolved and the associated outage ended without loss of jobs.
12:01, October 23, 2017 (ADT)

Fundy

  • No recent issues

Glooscap

  • Interactive response of the head node is very slow for many operations. Technical staff are investigating.
16:36, December 6, 2017 (AST)
  • The metadata server was hung all night March 7-8. It was rebooted this morning and Glooscap is operating once again, although technical staff continue to be cautious about its future behaviour. To try to alleviate the load on the metadata server we are withdrawing compute nodes cl002 through cl058 from service. This represents a reduction of 188 cores in the capacity of the cluster.
11:24, March 8, 2017 (AST)
User Support
Resources