Posts

Phoenix Login Outages

Summary: An issue with DNS caused researchers to receive error messages when attempting to ssh to Phoenix or to open a shell in Phoenix OnDemand beginning late Thursday evening. A workaround has been activated to restore access, but researchers may still encounter intermittent issues.

Details: The load balancer receiving ssh requests to the Phoenix login node began routing to incorrect servers late Thursday evening. The PACE team deployed a workaround at approximately 10:15 AM on Friday that is still populating in DNS servers.

Impact: Researchers may receive “man-in-the-middle” warnings and be presented with ssh fingerprints that do not match those published by PACE for verification. Overriding the warning might lead to further errors as an incorrect server was reached. Researchers using the cluster shell access in Phoenix OnDemand may receive a connection closed error.

It is possible to get around this outage by ssh to a specific Phoenix login node (-1 through -6). There is no specific workaround for the OnDemand shell, though it is possible to request an Interactive Desktop job and use the terminal within it.

Thank you for your patience as we identified the cause and are working to resolve the issue. Please email pace-support@oit.gatech.edu with any questions or concerns. You may visit status.gatech.edu for ongoing updates.

[Maintenance] PACE Maintenance August 5th-8th

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, August 5th, 08/05/2025, and is tentatively scheduled to conclude by 11:59PM on Friday August 9th, 08/08/2025. The additional day is needed to accommodate physical work being done in the datacenter to allow for installation of an additional pump into the research hall cooling system. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.  

 
WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] We will be decommissioning the login-phoenix-slurm alias, which may cause errors if you haven’t moved using to the login-phoenix version (which points to the same login nodes) 

ITEMS NOT REQUIRING USER ACTION: 

  • [all] DataBank will perform cooling tower maintenance requiring all machines in the research hall to be powered off 
  • [all] DataBank will install piping to prepare to add a spare pump to the cooling system; all cooling to the research hall will be interrupted. 
  • [all] Upgrade system-wide monitoring software 
  • [all] Apply maintenance patches to all compute nodes 
  • [all] Upgrade firmware on all ethernet switches 
  • [Phoenix, Hive, ICE] Upgrade Open OnDemand to 3.1.14 
  • [Phoenix, ICE] Upgrade storage system VMs for Scratch storage.  
  • [ICE] Enable use of Globus for ICE data 
  • [Phoenix, ICE] Upgrade Lustre versions to latest available 
  • [Phoenix] Run data consistency checks on Phoenix Lustre project file system 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

Phoenix project storage outage, impacting login

Summary: An outage of the metadata servers on Phoenix project storage (Lustre) is preventing access to that storage and may also prevent login by ssh, access to Phoenix OnDemand, and some Globus access on Phoenix. The PACE team is working to repair the system.

Details: During the afternoon of Saturday, July 19, one of the metadata servers for Phoenix Lustre project storage stopped responding. The failover to the other metadata server was not successful. The PACE team has not yet been able restore access and has engaged our storage vendor.

Impact: Files on the Phoenix Lustre project storage system are not accessible, and researchers may not be able to log in to Phoenix by ssh nor via the OnDemand web interface. Globus on Phoenix may time out, but researchers can type another path into the Path box to bypass the home directory and enter a subdirectory directly (e.g., typing ~/scratch will allow access to the scratch storage). Research groups that have already migrated to VAST project storage may not be impacted. VAST project, scratch, and CEDAR storage may still be reachable this way.

Thank you for your patience as we work to restore access to Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[UPDATE Mon 21 Jul, 11:00]

Phoenix project storage outage is over

The outage on the Lustre project storage is over; the scheduler has been released and is accepting jobs. The access through the head nodes, Globus, and Open OnDemand is restored.  

The diagnostic check of the metadata volumes, performed over the weekend, completed successfully. As a precaution, we are running a thorough check to data volumes to verify there are no other issues. In an unlikely event of data loss, it will be restored from the backups. Scratch, home, VAST and CEDAR storage systems were not affected by the outage. The cost of the jobs that were terminated due to the outage will be refunded.  

We are continuing to work with the storage vendors to prevent project storage outages. The ongoing migration of project storage from Lustre to VAST systems will reduce the impact when one of the shared file systems has issues.  

Degraded performance on Phoenix storage

Dear Phoenix users,

Summary: The project storage system on Phoenix (/storage/coda1) is slower than normal, due to heavy use and hard drive failures. The rebuild process to spare hard drives is ongoing; until it is complete, some users might experience slower file access on the project storage.

Details: Two hard drives that support the /storage/coda1 project storage failed on 1-July at 3:30am and 9:20am forcing a rebuild of the data to spare drives. This rebuild usually takes 24-30 hours to complete. We are closely monitoring the rebuilding process, which we expect to complete on July 2 around noon. In addition, we are temporarily moving file services from one metadata server to another and back to rebalance the load across all available systems.

Impact: Access to files is slower than usual during the drive rebuild and metadata server migration process. There is no data loss for any users. For the affected users, the degradation of performance can be observed on the login as well as compute nodes. The file system will continue to be operational while the rebuilds are running in the background. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working to solve the problem.

[Maintenance] Reminder – May 5th-May 9th 2025

[Update] May 9, 2025 at 5:34 pm

Dear PACE Community, 
 
While all PACE clusters are up, have passed tests, and are accepting jobs, you may encounter errors due to the packages installed across our systems. We are aware of minor inconsistencies in the list of packages installed on compute nodes of the same type and are working on addressing this as quickly as possible.  

Please let us know via email to pace-support@oit.gatech.edu if you encounter any unusual job errors.   

We will continue working to resolve the situation and provide updates as we learn more.  
 
The PACE Team

[Update] May 9, 2025 at 5:16 pm

Dear Firebird users,  

The Firebird cluster is back in production and has resumed running jobs.  

As previously mentioned, this cluster is now only running the RHEL9 operating system. Please reference our prior emails about SSH keys on Firebird if you experience any trouble logging in!  

1 RTX6000 GPU node is currently unavailable, but all other GPU types (A100 and H200) are available – we will work to repair this node next week.  

Thank you for your patience as we continue to work on the Firebird cluster.  

Best, 

The PACE Team 

[Update] May 9, 2025 at 12:15 pm

Dear PACE users,   

Maintenance on the Hive, Buzzard, ICE and Phoenix clusters is complete. These clusters are back in production, and all jobs held by the scheduler have been released.  

The Firebird cluster is still under maintenance; these users will be notified separately once work is complete.  

We are happy to share that all PACE clusters are now running the RHEL9 operating system and that other important security updates are complete.  

The update to IDEaS storage is ongoing – the storage is currently accessible, but it is still necessary to use the `newgrp` command to set the order of your group membership just as before maintenance.  

If you are building or running MPI applications on Phoenix’s H100/H200 nodes, please be aware that the MVAPICH2 and OpenMPI modules are no longer compatible with system updates to the H100/H200 nodes. We highly recommend using HPC-X for MPI, as it provides numerous benefits for MPI + GPU workloads. To use it, load the nvhpc/24.5 and hpcx/2.19-cuda modules. This will not affect the vast majority of single-node Python workflows, which typically do not use MPI. 

Another goal for this maintenance period was the replacement of the problematic cooling system pump. While this system was rigorously tested and calibrated prior to installation, the DataBank datacenter staff were required to remove the new pump and replace it with the original as it did not pass inspection upon installation. We share your frustration in this matter. However, operating a safe and reliable datacenter is of the utmost priority and we will continue doing our best to keep PACE resources stable until DataBank is able to successfully replace the cooling pump. We are continuing to work with Georgia Tech leadership on long term solutions to improve the overall reliability to meet the expectations of our users. 

At this time, we have extended the next maintenance period August 5-8, 2025 to allow for reinstalling a new cooling pump. We will share additional information as it becomes available. 

Thank you, 

The PACE Team 

[Maintenance] April 28, 2025 at 9:42am

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Monday May 5th, 05/06/2025, and is tentatively scheduled to conclude by 11:59PM on Friday May 9th, 05/09/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.  
 
WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [Firebird] The Firebird system will completely migrate to the RHEL9 Operating system  

ITEMS NOT REQUIRING USER ACTION: 

  • Change IDEaS storage user authentication from AD to LDAP 
  • Run filesystem checks on all lustre filesystems. 
  •  Upgrade IDEaS storage 
  •  Upgrade Phoenix Project storage servers and controllers 
  •  Upgrade Phoenix scratch storage servers and controllers
  • Upgrade ICE scratch storage servers and controllers
  •  Move ice-shared from NetApp to VAST storage 
  •  Rebuild ondemand-ice on physical hardware to handle increased usage 
  • Move ICE pace-apps to separate storage volume
  • Firebird storage and scheduler improvements 
  • Upgrade ddn insight (for monitoring storage system performance) 
  • Databank: replace cooling pump assembly 
  • Databank: Cooling tower cleanup 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. This particular instance is allowing for the complete replacement of a problematic cooling system pump in the datacenter. 

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

PACE Spending Deadlines for FY25

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY25 on June 30, 2025, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 30, 2025. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2025, will be held for processing in July, in FY26. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2025.
    1. State funds (DE worktags) expiring on June 30, 2025, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2025, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

[Maintenance] Maintenance window EXTENDED – May 5th-9th

As part of the work needed to mitigate cooling issues in the Coda datacenter, there will be a full replacement of the cooling system water pump in the research hall of the datacenter. While we previously hoped to handle the maintenance from May 6-9th, we are now planning to start one day earlier on May 5th at 6am ET due to the volume of physical work that must be carried out in the datacenter.

Due to this being the final day for instructors to submit grades, we will ensure that the ICE system remains available to instructors. Reservations on all clusters have been set to prevent jobs from running into the maintenance window as usual.

We will follow up with a full list of the planned activities during this maintenance window in our two-week reminder.

Cooling Failure in Coda Datacenter

[Update 4/3/25 9:55 AM]
Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.  
 
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown.  Clusters will be released tomorrow as testing is completed for each system.  
 
We will provide updates on progress via status.gatech.edu and share announcements via specific mailing lists as clusters become available or the situation changes significantly.

[Update 4/2/25 5:50 PM]

Due to continued high temperatures, all Phoenix and Firebird compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.

[Original Post 4/2/25 5:20 PM]

Summary: The controller for the cooling system in the Coda Datacenter has failed. Many PACE nodes have been turned off given the significantly reduced cooling capacity in the datacenter. No jobs can start on research clusters.

Details: The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.

Impact: No new jobs can start on PACE’s research clusters (Phoenix, Hive, Buzzard, and Firebird). All Hive and Buzzard compute nodes have been turned off, and running jobs were cancelled. There is not yet an impact to ICE, but we may need to shut down ICE nodes as well as we monitor temperatures.

Please visit https://status.gatech.edu for ongoing updates as the situation evolves. Please contact pace-support@oit.gatech.edu with any questions.

[Update] [storage] Phoenix Project storage degraded performance

[Updated March 31, 2025 at 414pm]

Dear Phoenix researchers,

As the Phoenix project storage system has stabilized, we have restored login access via ssh and resumed starting jobs.

The cost for the jobs running during the performance degradation will not count towards the March usage.

The Phoenix OnDemand portal can again be used to access project and scratch space. Any user still receiving a “Proxy Error” should contact pace-support@oit.gatech.edu for an individual reset of their OnDemand session.

Globus file transfers have resumed. We have determined that transfers to/from home, scratch, and CEDAR storage were inadvertently paused, and we apologize for any confusion. Any paused transfer should have automatically resumed.

The PACE team continues to monitor the storage system for any further issues. We are working with the vendor to identify the root cause and prevent future performance degradation.

Please contact us at pace-support@oit.gatech.edu with any questions. We appreciate your patience during this unexpected outage.

Best,

The PACE Team

[Updated March 31, 2025 at 12:41pm]

Dear Phoenix Users,

To limit the impact of the current Phoenix project filesystem issues, we have implemented the following changes to expedite troubleshooting and limit impact to currently running jobs:

New Logins to Phoenix Login Nodes are Paused

We have prevented new login attempts to the Phoenix login nodes. Users that are currently logged in will be able to stay logged onto the system.

Phoenix Jobs Prevented from Starting

Jobs that are in the queue but that have not yet started have been paused to prevent them from starting. These submitted jobs will remain in the queue.

Jobs that are currently running may experience decreased performance if using project storage. We are doing our best to prioritize the successful completion of these jobs.

Open OnDemand (OOD)

Users of Phoenix OOD can log in and interact with only their home directory. Project and scratch space are not available.

Some users of Open OnDemand may be unable to reach this service and are experiencing “Proxy Error” messages. We are investigating the root cause of this issue.

Globus File Transfer Paused for Project Space

File transfers to/from project storage on Globus have been paused. Other Globus transfers (Box, DropBox, and OneDrive cloud connectors; scratch; home; and CEDAR) will continue.

The PACE team is working to diagnose the current issues with support from our filesystem vendor. We will continue to share updates as we have them and apologize for this unexpected service outage.

Best,

The PACE Team