[Update – 10/05/2020 8:18]
Thank you for your patience as we worked through this emergency to restore cooling in the CODA datacenter’s Research Hall. At this time, we have Hive, COC-ICE, PACE-ICE, Testflight-CODA and Phoenix clusters back online with users’ previously queued jobs having started.
What has happened and what we did: At 4:30pm today, the main chiller for the research computing failed fully in CODA datacenter’s Research Hall side. PACE had urgently shutdown the compute nodes for the Hive, COC-ICE, PACE-ICE, Testflight-CODA and Phoenix clusters. Storage and login nodes were not impacted during this outage. Working with DataBank, we were able to restore enough cooling using economizer module that can handle all cooling in the Research Hall. At 6:30pm, we had onlined Hive cluster, and since then we have continued to bring back up the remaining cluster’s compute nodes for COC-ICE, PACE-ICE, Testflight-CODA, and Phoenix clusters while maintaining normal operating temperatures. At about 7:00pm vendor has arrived, and is working on chiller, and no interruption should occur when the fixed chiller is brought online. Our storage did not experience data loss, but users’ running jobs were interrupted by this emergency shutdown. We encourage users to check on their jobs and resubmit any jobs that may have been interrupted. Currently, previously queued user jobs are running on the clusters.
What we will continue to do: PACE team continue to monitor the situation, and report accordingly as needed.
For your reference we are including OIT’s status page link and blog post:
Status page: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5f7b9062cb294e04bbe8cbda
Blog post: http://blog.pace.gatech.edu/?p=6931
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.
Thank you for your patience and attention to this emergency.
[Original Post – 10/05/2020 6:16]
The cooling has failed in CODA datacenter’s research hall. We have initiated and completed emergency shutdown of all resources in CODA research hall that includes: Hive, COC-ICE, PACE-ICE, Testflight-CODA, and the Phoenix clusters.
What is happening and what we have done: We have urgently completed emergency shutdown of all the clusters in CODA datacenter. Research data and cluster headnodes are fine, but all running user jobs will have been interrupted by this outage. At this time, we are using economizer module to provide some cooling, and we are beginning to bring back up Hive cluster while closely monitoring the temperatures.
What we will continue to do: This is an active situation, and we will follow up with updates as they become available.
Also, please follow the updates on the OIT’s status page: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5f7b9062cb294e04bbe8cbda
Additionally, we are tracking the updates in our blog at: http://blog.pace.gatech.edu/?p=6931
This emergency work does not impact any of the resources in Rich datacenter.
Thank you for your attention to this urgent message.