Phoenix Login Outages

Summary: An issue with DNS caused researchers to receive error messages when attempting to ssh to Phoenix or to open a shell in Phoenix OnDemand beginning late Thursday evening. A workaround has been activated to restore access, but researchers may still encounter intermittent issues.

Details: The load balancer receiving ssh requests to the Phoenix login node began routing to incorrect servers late Thursday evening. The PACE team deployed a workaround at approximately 10:15 AM on Friday that is still populating in DNS servers.

Impact: Researchers may receive “man-in-the-middle” warnings and be presented with ssh fingerprints that do not match those published by PACE for verification. Overriding the warning might lead to further errors as an incorrect server was reached. Researchers using the cluster shell access in Phoenix OnDemand may receive a connection closed error.

It is possible to get around this outage by ssh to a specific Phoenix login node (-1 through -6). There is no specific workaround for the OnDemand shell, though it is possible to request an Interactive Desktop job and use the terminal within it.

Thank you for your patience as we identified the cause and are working to resolve the issue. Please email pace-support@oit.gatech.edu with any questions or concerns. You may visit status.gatech.edu for ongoing updates.

Phoenix project storage outage, impacting login

Summary: An outage of the metadata servers on Phoenix project storage (Lustre) is preventing access to that storage and may also prevent login by ssh, access to Phoenix OnDemand, and some Globus access on Phoenix. The PACE team is working to repair the system.

Details: During the afternoon of Saturday, July 19, one of the metadata servers for Phoenix Lustre project storage stopped responding. The failover to the other metadata server was not successful. The PACE team has not yet been able restore access and has engaged our storage vendor.

Impact: Files on the Phoenix Lustre project storage system are not accessible, and researchers may not be able to log in to Phoenix by ssh nor via the OnDemand web interface. Globus on Phoenix may time out, but researchers can type another path into the Path box to bypass the home directory and enter a subdirectory directly (e.g., typing ~/scratch will allow access to the scratch storage). Research groups that have already migrated to VAST project storage may not be impacted. VAST project, scratch, and CEDAR storage may still be reachable this way.

Thank you for your patience as we work to restore access to Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[UPDATE Mon 21 Jul, 11:00]

Phoenix project storage outage is over

The outage on the Lustre project storage is over; the scheduler has been released and is accepting jobs. The access through the head nodes, Globus, and Open OnDemand is restored.  

The diagnostic check of the metadata volumes, performed over the weekend, completed successfully. As a precaution, we are running a thorough check to data volumes to verify there are no other issues. In an unlikely event of data loss, it will be restored from the backups. Scratch, home, VAST and CEDAR storage systems were not affected by the outage. The cost of the jobs that were terminated due to the outage will be refunded.  

We are continuing to work with the storage vendors to prevent project storage outages. The ongoing migration of project storage from Lustre to VAST systems will reduce the impact when one of the shared file systems has issues.  

Cooling Failure in Coda Datacenter

[Update 4/3/25 9:55 AM]
Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.  
 
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown.  Clusters will be released tomorrow as testing is completed for each system.  
 
We will provide updates on progress via status.gatech.edu and share announcements via specific mailing lists as clusters become available or the situation changes significantly.

[Update 4/2/25 5:50 PM]

Due to continued high temperatures, all Phoenix and Firebird compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.

[Original Post 4/2/25 5:20 PM]

Summary: The controller for the cooling system in the Coda Datacenter has failed. Many PACE nodes have been turned off given the significantly reduced cooling capacity in the datacenter. No jobs can start on research clusters.

Details: The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.

Impact: No new jobs can start on PACE’s research clusters (Phoenix, Hive, Buzzard, and Firebird). All Hive and Buzzard compute nodes have been turned off, and running jobs were cancelled. There is not yet an impact to ICE, but we may need to shut down ICE nodes as well as we monitor temperatures.

Please visit https://status.gatech.edu for ongoing updates as the situation evolves. Please contact pace-support@oit.gatech.edu with any questions.

Phoenix storage performance degraded

[Update 3/21/25 12:30 PM]

Following the completion of the rebuild and copyback processes on the impacted redundant storage pool, Phoenix project storage performance has returned to normal. Please contact pace-support@oit.gatech.edu if you encounter any further issues.

[Original post 3/19/25 5:00 PM]

Summary: Performance of Phoenix project storage is currently degraded.

Details: Multiple redundant disks failed yesterday and today, and storage is slowed while the redundant pool rebuilds.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until the process is complete. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.

Please visit https://status.gatech.edu for updates and contact pace-support@oit.gatech.edu with any questions.

New GPUs for Phoenix, V100s being Replaced 

[Additional Message 11/7/24]

As we prepare to remove 12 of the V100 servers from Phoenix next week in preparation for the arrival of new GPU nodes in December, we would like to inform you of another set of new GPUs available on the cluster through the embers backfill QOS.

There are 8 nodes, each with 8 L40S GPUs, providing 64 GPUs that have been available exclusively on embers (due to the ownership of this equipment) since late September in the Phoenix RHEL9 environment.

Visit our Phoenix Slurm guide on GPU requests to learn how to request them. Be sure to include a request for the embers QOS when requesting L40S architecture, at least until the additional L40S nodes for general use become available in December on inferno. You must make the request from the RHEL9 environment. Access via Phoenix OnDemand is not yet available.

Please contact pace-support@oit.gatech.edu with any questions.

[Original Post 10/31/24]

We’re happy to announce that there are will be 6 new H200 machines coming to Phoenix for general usage, with 8x NVIDIA H200 GPUs each, along with 2x L40S machines, each with 8x NVIDIA L40S GPUs. These will be available on the RHEL 9 operating system on Phoenix, which is required to support the new hardware. 

12 of the existing V100 servers will be REMOVED from the Phoenix RHEL7 environment to make room for the new L40S hardware, due to having reached the end-of-life on vendor support. The overall impact will be to greatly increase both the number and power of GPUs available on Phoenix – 24 V100 GPUs will be replaced with 16 L40S and 48 H200 GPUs. 
 
This change will begin on Nov. 11th, when the V100 machines will be removed, and we will 
begin installing the new servers, which we hope to release by December 6th
 
The new machines will be available via both the Inferno QoS and Embers on RHEL9. Jobs using the new H200 machines will be charged at a rate of $0.673 per GPU Hour ($1.4571 for GTRI), matching the current H100 rate. The rate for the new L40S GPUs will be shared prior to their release, as we’re working through approvals. 

Phoenix project storage outage

[Update 7/9/24 12:00 PM]

Phoenix project storage has been repaired, and the scheduler has resumed. All Phoenix services are now functioning.

We have updated a parameter to throttle the number of operations on the metadata servers to improve stability.

Please contact us at pace-support@oit.gatech.edu if you encounter any remaining issues.

[Original Post 7/8/24 4:40 PM]

Summary: Phoenix project storage is currently inaccessible. We have paused the Phoenix scheduler, so no new jobs will start.

Details: Phoenix Lustre project storage has experienced slowness and been intermittently unresponsive at times throughout the day today. The PACE team identified a few user jobs causing high workload on the storage system, but the load remained high on one metadata server, which eventually stopped responding. Our storage vendor recommended a failover to a different metadata server as part of a repair, but the system has been left fully unresponsive. PACE and our storage vendor continue to work on restoring full access to project storage.

Impact: The Phoenix scheduler has been paused to prevent new jobs from hanging, so no new jobs can start. Currently-running jobs may not make progress and should be cancelled if stuck. Home and scratch directories remain accessible, but an ls of the full home directory may hang due to the symbolic link to project storage.

Thank you for your patience as we work to restore Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions. You may visit https://status.gatech.edu/ for additional updates.

IDEaS Storage Outage Resolved

Summary: PACE’s IDEaS storage was unreachable early this morning. Access was restored at approximately 9:00 AM.

Details: One controller on the IDEaS IntelliFlash storage became unresponsive, and the resource could not switch to the redundant controller. Rebooting both controllers restored access. PACE is working with our storage vendor to identify the cause.

Impact: IDEaS storage could not be reached during the outage from PACE and external mounts. Any jobs on Phoenix or Hive running on IDEaS storage would have failed. If you had a job on Phoenix running on IDEaS storage that failed, please email pace-support@oit.gatech.edu to request a refund.

Thank you for your patience as we resolved the issue this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Firebird scheduler outage resolved

Summary: A configuration issue with the Firebird scheduler caused failures to Firebird jobs over the weekend and this morning as storage was not accessible on compute nodes. The issue was resolved by 2:00 PM today.

Details: Changes to the Firebird scheduler configuration were made during last week’s maintenance period (May 7-9) in order to facilitate future updates to Firebird. A repair was made on Friday, after which jobs were running successfully. Over the weekend, a different issue occurred, and jobs were launched on compute nodes without the proper storage being mounted. We have fully reverted the Firebird configuration changes to their state prior to the maintenance period, and jobs should no longer face any errors.

Impact: Some jobs launched on Firebird over the last three days may have failed due to missing home and project storage on the compute nodes with messages like “no such file or directory” or an absent output file. Jobs attempted mid-day on Monday, May 13, may have been queued for an extended period while repairs were made to the scheduler configuration.

Thank you for your patience as we resolved this issue. Please contact us at pace-support@oit.gatech.edu with questions or if you continue to experience errors.

Phoenix A100 CPU:GPU Ratio Change

On Phoenix, the default number of CPUs assigned to jobs requesting an Nvidia Tensor Core A100 GPU has recently changed. Now, jobs requesting one or more A100 GPUs will be assigned 8 cores per GPU by default, rather than 32 cores per GPU. You may still request up to 32 cores per GPU if you wish by using the --ntasks-per-node flag in your SBATCH script or salloc command to specify the number of CPUs per node your job requires. Any request with a CPU:GPU ratio of at most 32 will be honored.

12 of our Phoenix A100 nodes host 2 GPUs and 64 CPUs (AMD Epyc 7513), supporting a CPU:GPU ratio up to 32, and can be allocated through both the inferno (default priority) and embers (free backfill) QOSs. We have recently added 1 more A100 node with 8 GPUs and 64 CPUs (AMD Epyc 7543), requiring this change to the default ratio. This new node is available only to jobs using the embers QOS due to the funding for its purchase.

Please visit our documentation to learn more about GPU requests and QOS or about compute resources on Phoenix and contact us with any questions about this change.

PACE Clusters Unreachable

[3/18/24 10:00 AM]

Full functionality of all PACE clusters has been restored, and the schedulers have resumed launching queued jobs. Please resubmit any jobs that may have failed over the weekend.

A migration of GT’s DNS services on Saturday from BlueCat to Efficient IP caused widespread outages over the weekend to PACE and other campus services. DNS records began to disappear at 5 PM on Saturday and were patched late Saturday night, with PACE login access reappearing on Sunday morning as changes propagated.

All jobs running on Phoenix and Firebird between 5:30 PM on Saturday, March 16, and 9:00 AM on Monday, March 18, will be refunded.

Thank you for your patience as we recovered from the DNS outage.

[3/16/24 7:15 PM]

Summary: All PACE clusters (Phoenix, Hive, ICE, Firebird, and Buzzard) are currently unreachable due to a domain name resolution (DNS) issue.

Details: We are investigating a DNS issue that has left all PACE clusters unreachable. No further information is known at this time. We are pausing the scheduler on all clusters to prevent additional jobs from starting.

Impact: It will not be possible to access any PACE cluster via ssh or OnDemand at this time. Running jobs may be impacted on all clusters except Firebird. If you are already connected to a PACE cluster, scheduler and other commands may fail with address resolution errors on all clusters except Firebird.

Thank you for your patience as we work to restore access to PACE clusters. Please contact us at pace-support@oit.gatech.edu with any questions. Please visit status.gatech.edu for updates.