PACE Maintenance Complete

Maintenance on the Phoenix, Hive, ICE, and Firebird clusters is complete.  Some maintenance work is ongoing for the OSG Buzzard cluster, but jobs are running. The physical datacenter work to allow for installation of a 2nd cooling pump in the research hall was successfully completed, and we expect the new pump to be brought online in October during our next Maintenance Period, which will be October 6-8th, 2025.  

The Phoenix, Hive, ICE, and Firebird clusters are back in production and ready for research and instruction; all jobs that were held by the scheduler have been released, Globus and Open OnDemand services have resumed, and access to login nodes is restored.  
 
Potential Issues 

  1. The TensorFlow 2.16 module is now incompatible with the up-to-date CUDA drivers on GPU nodes. An updated TensorFlow 2.17 module is targeted for release next week.  We have not observed issues with other CUDA-dependent modules such as PyTorch and CUDA C/C++ apps. 
  1. Phoenix users of the Ansys Fluent GUI should use the dedicated Ansys Workbench application in Open OnDemand, which was introduced to increase the stability and usability of Ansys products on Phoenix. Ansys Fluent 2025R1 in Interactive Desktop may produce MPI errors while 2024R2 works as expected. Hive users must continue using the Interactive Desktop to run Ansys Fluent.  
  1. The OSG Buzzard cluster is expected to resume full functionality midway through next week, though jobs scheduled through the OSPool and project-specific pools are being accepted and run. 

Thank you and happy computing! 

The PACE Team 

[Maintenance] PACE Maintenance August 5th-8th

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, August 5th, 08/05/2025, and is tentatively scheduled to conclude by 11:59PM on Friday August 9th, 08/08/2025. The additional day is needed to accommodate physical work being done in the datacenter to allow for installation of an additional pump into the research hall cooling system. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.  

 
WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] We will be decommissioning the login-phoenix-slurm alias, which may cause errors if you haven’t moved using to the login-phoenix version (which points to the same login nodes) 

ITEMS NOT REQUIRING USER ACTION: 

  • [all] DataBank will perform cooling tower maintenance requiring all machines in the research hall to be powered off 
  • [all] DataBank will install piping to prepare to add a spare pump to the cooling system; all cooling to the research hall will be interrupted. 
  • [all] Upgrade system-wide monitoring software 
  • [all] Apply maintenance patches to all compute nodes 
  • [all] Upgrade firmware on all ethernet switches 
  • [Phoenix, Hive, ICE] Upgrade Open OnDemand to 3.1.14 
  • [Phoenix, ICE] Upgrade storage system VMs for Scratch storage.  
  • [ICE] Enable use of Globus for ICE data 
  • [Phoenix, ICE] Upgrade Lustre versions to latest available 
  • [Phoenix] Run data consistency checks on Phoenix Lustre project file system 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team