Resolved – Scratch Space Outage on the Phoenix Cluster

[Update 11/6/2023 at 12:26 pm]

Dear Phoenix users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused, and the jobs that have been put on hold are now resumed.  

Details: The PACE support team has upgraded the different components (controller software, disk firmware) of the Scratch storage system according to the plan provided by the hardware vendor (DDN). We have tested the performance of the file system and the tests have passed.  

Impact: Please continue using the Phoenix cluster as usual. In case of issues, please contact us at pace-support@oit.gatech.edu. Also, please keep in mind that the cluster will be offline tomorrow (November 7) from 8am until 8pm so the PACE team can work on fixing the project storage (which is an unrelated issue). 

Thank you and have a great day!

The PACE Team

[Update 11/6/2023 at 9:27 am]

Dear Phoenix users, 

Summary: Storage performance on Phoenix scratch space is degraded. 

Details: Around 11pm on Saturday (November 4, 2023), the scratch space on the Phoenix cluster became unresponsive. Currently, the scratch space is inaccessible to the users. The PACE team is investigating the situation and applying an upgrade recommended by the vendor to improve stability. The PACE team paused the scheduler on Phoenix at 8:13am on Monday, November 6, to prevent additional job failures. The upgrade is estimated to take until 12pm on Monday. After the upgrade is installed, the scheduler will be released, and the paused jobs will resume executing. This issue is not related to the issue of the slowness of the Phoenix project storage which was reported last week and will be addressed during the Phoenix outage tomorrow (November 7). 

Impact: The users of the Phoenix cluster are currently unable to access the scratch storage. The jobs on the Phoenix cluster have been paused, and the new jobs will not start until the scheduler is resumed. Other PACE clusters (ICE, Hive, Firebird, Buzzard) are not affected. 

We apologize for the multiple issues that have been observed on the Phoenix cluster related to storage access. We are continuing to engage with the storage vendor to improve the performance of our system. The recommended upgrade is in process, and the cluster will be offline tomorrow to address the project filesystem issue. 

Thank you for your patience!

The PACE Team