Phoenix Project & Scratch Storage Cable Replacement

[Update 09/16/2022 2:18 PM]

Work has been completed on Sep 15 as scheduled in the original post.

[Original post: 09/12/2022 3:40PM]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

Details: A cable connecting one enclosure of the Phoenix Lustre device, hosting project and scratch storage, to one of its controllers needs to be replaced, beginning around 1PM Thursday, September 15th, 2022. After the replacement, pools will need to rebuild over the course of about a day.

Impact: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

[Update 09/16/2022 2:18 PM]

Work has been completed on Sep 15 as scheduled in the original post.

[Original post: 09/12/2022 3:40PM]

Summary: Hive project & scratch storage cable replacement and potential for an outage.

Details: Two cables connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10AM Thursday, September 15th, 2022.

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Data Center Power outage

[Update 09/07/2022 10:23 AM]

Cooling has been fully restored to the datacenter, and operations have resumed on all PACE clusters, including Phoenix, Hive, Firebird, Buzzard, PACE-ICE, and COC-ICE. Jobs are now running, and new jobs may be submitted via the command line or Open OnDemand.

Any jobs that were running at the time of the outage have been cancelled and must be resubmitted by the researcher.

Refunds will be issued for jobs that were cancelled on clusters with charge accounts (Phoenix and Firebird).

Thank you for your patience as emergency repairs were completed. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 09/06/2022 4:51 PM]

Summary: Unfortunately, the same issue that happened yesterday with the primary cooling loop happened today with the secondary cooling loop. OIT operations and Databank requested that we power off all the compute nodes to repair the secondary cooling loop. We have captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix.

Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or from Open Ondemand webserver, the existing running jobs will be killed, and computing credits will be refunded if applied. Your partially ran jobs will lose their calculation if checkpointing is not supported in the job.

Sorry for the inconvenience and thank you for your patience; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

 

[Update 09/06/2022 11:10 AM]

Cooling has been restored to the datacenter, and operations have resumed on all PACE clusters, including Phoenix, Hive, Firebird, Buzzard, PACE-ICE, and COC-ICE. Jobs are now running, and new jobs may be submitted via the command line or Open OnDemand. Any jobs that were running at the time of the outage have been cancelled and must be resubmitted by the researcher. Refunds will be issued for jobs that were cancelled on clusters with charge accounts (Phoenix and Firebird).

Thank you for your patience as emergency repairs were completed to avoid damage to the datacenter. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 09/06/2022 9:58 AM]

Around 8:00PM on 09/05/2022, OIT operations and Databank requested that PACE powers off all the compute nodes to avoid additional issues. PACE had captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix after they have been brought back online.

Currently, the cooling has been restored in Coda datacenter, and as of about 7:00am (09/06/22), PACE has been cleared to online and test the clusters before releasing them to users.

 

[Update 09/05/2022 8:45 PM]

Summary: Unfortunately, cooling tower issues continue. OIT operations and Databank requested that we power off all the compute nodes to avoid additional issues. We have captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix.
Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or from Open Ondemand webserver, the existing running jobs will be killed, and computing credits will be refunded if applied. Your partially ran jobs will lose their calculation if checkpointing is not supported in the job.
Sorry for the inconvenience and thank you for your patience; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

[Original post]

Summary: One of the cooling towers in the CODA data center has issues, and the temperature is rising. We need to pause all PACE cluster schedulers, and possibly power down all compute nodes.

Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or the Open OnDemand web server; the existing jobs should continue running.

Thank you for your patience this afternoon; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

Phoenix scheduler outage

[Update 9/2/22 5:23 PM]

The PACE team has identified an issue with the Phoenix scheduler and restored functionality. The scheduler is currently back up and running and new jobs can be submitted. We will continue to monitor scheduler performance and we appreciate your patience as we work through this. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 9/2/22 4:33 PM]

Summary: The Phoenix scheduler became non-responsive this afternoon around 4pm.

Details: The Torque resource manager on the Phoenix scheduler shut down unexpectedly around 4:00 PM. The PACE team is working on resolution.

Impact: Commands such as “qsub” and “qstat” are impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs are not expected to be interrupted, but no new jobs can be submitted, and currently running jobs can not be queried.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.