[Complete] Hive Project & Scratch Storage Cable Replacement – Partnership for an Advanced Computing Environment

[Update 11/12/21 11:30 AM]

The pool rebuilding has completed on Hive GPFS storage, and normal performance has returned.

[Update 11/10/21 11:30 AM]

The cable replacement has been completed without interruption to the storage system. Rebuilding of the pools is now in progress.

[Original Post 11/9/21 5:00 PM]

Summary: Hive project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers has failed and needs to be replaced, beginning around 10:00 AM tomorrow (Wednesday). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.