[Resolved] Unexpected downtime on compute nodes

[update]   We think we’re back up at this point. If you see odd behavior, please send a support request directly to the PACE team via email to pace-support@oit.gatech.edu.

The issue seems to have been an inadvertent switching off of a circuit breaker by an electrician, and is not expected to recur.

====================

We’ve had a power problem in the data center this afternoon that caused a loss of power to three of our racks.  This has affected some (or all) portions of the following clusters:

Apurimac

Prometheus

Cygnus

Granulous

ECE

Monkeys

Isabella

CEE

Aryabhata

Optimus

Atlas

BioCluster

 

We’re looking into the cause of the problem, and have already started bringing up compute nodes.