Cooling Failure in Coda Datacenter

[Update 4/3/25 9:55 AM]
Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.  
 
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown.  Clusters will be released tomorrow as testing is completed for each system.  
 
We will provide updates on progress via status.gatech.edu and share announcements via specific mailing lists as clusters become available or the situation changes significantly.

[Update 4/2/25 5:50 PM]

Due to continued high temperatures, all Phoenix and Firebird compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.

[Original Post 4/2/25 5:20 PM]

Summary: The controller for the cooling system in the Coda Datacenter has failed. Many PACE nodes have been turned off given the significantly reduced cooling capacity in the datacenter. No jobs can start on research clusters.

Details: The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.

Impact: No new jobs can start on PACE’s research clusters (Phoenix, Hive, Buzzard, and Firebird). All Hive and Buzzard compute nodes have been turned off, and running jobs were cancelled. There is not yet an impact to ICE, but we may need to shut down ICE nodes as well as we monitor temperatures.

Please visit https://status.gatech.edu for ongoing updates as the situation evolves. Please contact pace-support@oit.gatech.edu with any questions.