Cooling Failure in Coda Datacenter

[Update 4/3/25 9:55 AM]
Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.  
 
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown.  Clusters will be released tomorrow as testing is completed for each system.  
 
We will provide updates on progress via status.gatech.edu and share announcements via specific mailing lists as clusters become available or the situation changes significantly.

[Update 4/2/25 5:50 PM]

Due to continued high temperatures, all Phoenix and Firebird compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.

[Original Post 4/2/25 5:20 PM]

Summary: The controller for the cooling system in the Coda Datacenter has failed. Many PACE nodes have been turned off given the significantly reduced cooling capacity in the datacenter. No jobs can start on research clusters.

Details: The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.

Impact: No new jobs can start on PACE’s research clusters (Phoenix, Hive, Buzzard, and Firebird). All Hive and Buzzard compute nodes have been turned off, and running jobs were cancelled. There is not yet an impact to ICE, but we may need to shut down ICE nodes as well as we monitor temperatures.

Please visit https://status.gatech.edu for ongoing updates as the situation evolves. Please contact pace-support@oit.gatech.edu with any questions.

Phoenix storage performance degraded

[Update 3/21/25 12:30 PM]

Following the completion of the rebuild and copyback processes on the impacted redundant storage pool, Phoenix project storage performance has returned to normal. Please contact pace-support@oit.gatech.edu if you encounter any further issues.

[Original post 3/19/25 5:00 PM]

Summary: Performance of Phoenix project storage is currently degraded.

Details: Multiple redundant disks failed yesterday and today, and storage is slowed while the redundant pool rebuilds.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until the process is complete. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.

Please visit https://status.gatech.edu for updates and contact pace-support@oit.gatech.edu with any questions.

New GPUs for Phoenix, V100s being Replaced 

[Additional Message 11/7/24]

As we prepare to remove 12 of the V100 servers from Phoenix next week in preparation for the arrival of new GPU nodes in December, we would like to inform you of another set of new GPUs available on the cluster through the embers backfill QOS.

There are 8 nodes, each with 8 L40S GPUs, providing 64 GPUs that have been available exclusively on embers (due to the ownership of this equipment) since late September in the Phoenix RHEL9 environment.

Visit our Phoenix Slurm guide on GPU requests to learn how to request them. Be sure to include a request for the embers QOS when requesting L40S architecture, at least until the additional L40S nodes for general use become available in December on inferno. You must make the request from the RHEL9 environment. Access via Phoenix OnDemand is not yet available.

Please contact pace-support@oit.gatech.edu with any questions.

[Original Post 10/31/24]

We’re happy to announce that there are will be 6 new H200 machines coming to Phoenix for general usage, with 8x NVIDIA H200 GPUs each, along with 2x L40S machines, each with 8x NVIDIA L40S GPUs. These will be available on the RHEL 9 operating system on Phoenix, which is required to support the new hardware. 

12 of the existing V100 servers will be REMOVED from the Phoenix RHEL7 environment to make room for the new L40S hardware, due to having reached the end-of-life on vendor support. The overall impact will be to greatly increase both the number and power of GPUs available on Phoenix – 24 V100 GPUs will be replaced with 16 L40S and 48 H200 GPUs. 
 
This change will begin on Nov. 11th, when the V100 machines will be removed, and we will 
begin installing the new servers, which we hope to release by December 6th
 
The new machines will be available via both the Inferno QoS and Embers on RHEL9. Jobs using the new H200 machines will be charged at a rate of $0.673 per GPU Hour ($1.4571 for GTRI), matching the current H100 rate. The rate for the new L40S GPUs will be shared prior to their release, as we’re working through approvals. 

Phoenix project storage outage

[Update 7/9/24 12:00 PM]

Phoenix project storage has been repaired, and the scheduler has resumed. All Phoenix services are now functioning.

We have updated a parameter to throttle the number of operations on the metadata servers to improve stability.

Please contact us at pace-support@oit.gatech.edu if you encounter any remaining issues.

[Original Post 7/8/24 4:40 PM]

Summary: Phoenix project storage is currently inaccessible. We have paused the Phoenix scheduler, so no new jobs will start.

Details: Phoenix Lustre project storage has experienced slowness and been intermittently unresponsive at times throughout the day today. The PACE team identified a few user jobs causing high workload on the storage system, but the load remained high on one metadata server, which eventually stopped responding. Our storage vendor recommended a failover to a different metadata server as part of a repair, but the system has been left fully unresponsive. PACE and our storage vendor continue to work on restoring full access to project storage.

Impact: The Phoenix scheduler has been paused to prevent new jobs from hanging, so no new jobs can start. Currently-running jobs may not make progress and should be cancelled if stuck. Home and scratch directories remain accessible, but an ls of the full home directory may hang due to the symbolic link to project storage.

Thank you for your patience as we work to restore Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions. You may visit https://status.gatech.edu/ for additional updates.

IDEaS Storage Outage Resolved

Summary: PACE’s IDEaS storage was unreachable early this morning. Access was restored at approximately 9:00 AM.

Details: One controller on the IDEaS IntelliFlash storage became unresponsive, and the resource could not switch to the redundant controller. Rebooting both controllers restored access. PACE is working with our storage vendor to identify the cause.

Impact: IDEaS storage could not be reached during the outage from PACE and external mounts. Any jobs on Phoenix or Hive running on IDEaS storage would have failed. If you had a job on Phoenix running on IDEaS storage that failed, please email pace-support@oit.gatech.edu to request a refund.

Thank you for your patience as we resolved the issue this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Firebird scheduler outage resolved

Summary: A configuration issue with the Firebird scheduler caused failures to Firebird jobs over the weekend and this morning as storage was not accessible on compute nodes. The issue was resolved by 2:00 PM today.

Details: Changes to the Firebird scheduler configuration were made during last week’s maintenance period (May 7-9) in order to facilitate future updates to Firebird. A repair was made on Friday, after which jobs were running successfully. Over the weekend, a different issue occurred, and jobs were launched on compute nodes without the proper storage being mounted. We have fully reverted the Firebird configuration changes to their state prior to the maintenance period, and jobs should no longer face any errors.

Impact: Some jobs launched on Firebird over the last three days may have failed due to missing home and project storage on the compute nodes with messages like “no such file or directory” or an absent output file. Jobs attempted mid-day on Monday, May 13, may have been queued for an extended period while repairs were made to the scheduler configuration.

Thank you for your patience as we resolved this issue. Please contact us at pace-support@oit.gatech.edu with questions or if you continue to experience errors.

Phoenix A100 CPU:GPU Ratio Change

On Phoenix, the default number of CPUs assigned to jobs requesting an Nvidia Tensor Core A100 GPU has recently changed. Now, jobs requesting one or more A100 GPUs will be assigned 8 cores per GPU by default, rather than 32 cores per GPU. You may still request up to 32 cores per GPU if you wish by using the --ntasks-per-node flag in your SBATCH script or salloc command to specify the number of CPUs per node your job requires. Any request with a CPU:GPU ratio of at most 32 will be honored.

12 of our Phoenix A100 nodes host 2 GPUs and 64 CPUs (AMD Epyc 7513), supporting a CPU:GPU ratio up to 32, and can be allocated through both the inferno (default priority) and embers (free backfill) QOSs. We have recently added 1 more A100 node with 8 GPUs and 64 CPUs (AMD Epyc 7543), requiring this change to the default ratio. This new node is available only to jobs using the embers QOS due to the funding for its purchase.

Please visit our documentation to learn more about GPU requests and QOS or about compute resources on Phoenix and contact us with any questions about this change.

PACE Clusters Unreachable

[3/18/24 10:00 AM]

Full functionality of all PACE clusters has been restored, and the schedulers have resumed launching queued jobs. Please resubmit any jobs that may have failed over the weekend.

A migration of GT’s DNS services on Saturday from BlueCat to Efficient IP caused widespread outages over the weekend to PACE and other campus services. DNS records began to disappear at 5 PM on Saturday and were patched late Saturday night, with PACE login access reappearing on Sunday morning as changes propagated.

All jobs running on Phoenix and Firebird between 5:30 PM on Saturday, March 16, and 9:00 AM on Monday, March 18, will be refunded.

Thank you for your patience as we recovered from the DNS outage.

[3/16/24 7:15 PM]

Summary: All PACE clusters (Phoenix, Hive, ICE, Firebird, and Buzzard) are currently unreachable due to a domain name resolution (DNS) issue.

Details: We are investigating a DNS issue that has left all PACE clusters unreachable. No further information is known at this time. We are pausing the scheduler on all clusters to prevent additional jobs from starting.

Impact: It will not be possible to access any PACE cluster via ssh or OnDemand at this time. Running jobs may be impacted on all clusters except Firebird. If you are already connected to a PACE cluster, scheduler and other commands may fail with address resolution errors on all clusters except Firebird.

Thank you for your patience as we work to restore access to PACE clusters. Please contact us at pace-support@oit.gatech.edu with any questions. Please visit status.gatech.edu for updates.

PACE Spending Deadlines for FY24

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY24 on June 30, 2024, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 19, 2024. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2024, will be held for processing in July, in FY25. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2024.
    1. State funds (DE worktags) expiring on June 30, 2024, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2024, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

Intermittent Scratch Access from Phoenix OnDemand File Browser

Summary: Phoenix scratch storage may not be accessible from the OnDemand file browser. There is no impact to scratch access or performance from login nodes, running jobs (including those launched via OnDemand apps), or Globus. The Globus File Manager may serve as an alternative.

Details: Over the past several weeks, researchers and the PACE team have identified intermittent failure in ability to access their Phoenix scratch directory from the “Files” tab in Phoenix OnDemand. “Permission denied” or other error messages may display. The PACE team is working to repair reliable access. The issue has been isolated to the way the OnDemand web server accesses scratch storage and therefore does not have wider impact.

Researchers wishing to use a graphical web-based file browser to manage files in their Phoenix scratch directories are encouraged to use the File Manager in Globus, which has similar capabilities. It is not necessary to install the Globus Connect Personal client on a local computer if you only wish to manage files on Phoenix rather than transfer them. Visit KB0041890 for more information about using Globus. KB0042390 provides information about using the Globus File Manager.

Impact: The impact is only to the file browser in Phoenix OnDemand. There is no impact to accessing scratch for job launched via the “Interactive Apps” or “IDEs” in OnDemand, which run on compute nodes. Similarly, access to scratch from login nodes, jobs on compute nodes, and Globus is normal. There is no performance impact.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with questions or concerns.