Posts – Partnership for an Advanced Computing Environment

Complete: PACE Maintenance Period (Oct 6 – Oct 8, 2025)

Dear PACE Users,

Maintenance on the Phoenix, ICE, Firebird, Buzzard, and Hive clusters is complete. All clusters are back in production. As a reminder, Hive is only available for data retrieval until November 1^st.

A message to Phoenix users:

Phoenix login nodes were upgraded from Intel Cascade Lake to Granite Rapids as a part of PACE’s ongoing efforts to provide cutting-edge research infrastructure – in addition, FIVE new H200 GPU nodes have been added to the system.

Researchers can leverage Granite Rapids’ Advanced Matrix Extension (AMX) instructions to improve performance of linear algebra operations. AMX optimization requires intentional code changes importantly, AMX-enabled code is likely not backwards compatible with older hardware (e.g., Cascade Lake). Instructions for compiling software for different CPU architectures is available here.

Researchers not utilizing AMX should not be impacted by the login node upgrade; however, some code/software may be sensitive to an underlying change in compute architecture. Prior to the upgrade, both login and default compute nodes were Cascade Lake architecture. Post-upgrade, login nodes are Granite Rapids, while the default compute nodes remain Cascade Lake. If your jobs encounter errors such as “Illegal instruction” errors, researchers should compile their code on the architecture that they want to run on.

Best,

The PACE Team

1-Week Reminder – PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday, 10/06/2025, and is tentatively scheduled to conclude by 11:59PM on Wednesday, 10/8/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

All Systems: Cooling tower maintenance and cleanup
All Systems: Updating to RHEL 9.6 Operating System
Phoenix: New GNR (Granite Rapids) login nodes coming online!
Phoenix and ICE: Filesystem checks for project and scratch
Phoenix: Updating load balancer for login nodes
IDEaS Storage: Updating LDAP configuration

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.

Thank you,

The PACE Team

OnDemand Web Portal Access Outage (Campus-Wide)

[Update 10/1/25 12:15 PM]

At this time, PACE expects that all PACE users should have restored access to Phoenix/Hive OnDemand. Anyone should email pace-support@oit.gatech.edu if they are still receiving an unexpected “unauthorized” message on Phoenix or Hive OnDemand.

Efforts towards a long-term fix are still in progress.

Please continue to visit status.gatech.edu for updates.

[Update 9/30/25 10:40 AM]

The issue appears to be resolved for GT students, faculty, and staff at this time. Troubleshooting continues for impacted external guest accounts while a long-term fix is being developed.
GT students and employees should email pace-support@oit.gatech.edu if they are still receiving an unexpected “unauthorized” message on Phoenix or Hive OnDemand.

Please continue to visit status.gatech.edu for updates.

[Original post 9/29/25 12:30 PM]

Summary: Some researchers are intermittently unable to access Phoenix/Hive OnDemand web portals due to a campus-wide access management outage.

Details: Due to an access management outage affecting various campus services, some users’ identity status has become invalid this morning. Campus IT staff are currently investigating and will continue to post updates on the outage on status.gatech.edu.

Impact: Some Phoenix and Hive users may receive an “Unauthorized” message when attempting to reach the Phoenix/Hive OnDemand website. There is no available workaround at this time.

PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

WHAT DO YOU NEED TO DO?

WHAT IS HAPPENING?

A detailed list of updates will be provided once it is available.

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.

Thank you,

-The PACE Team

Services Restored: PACE Login

Summary: PACE services are restored after an issue with Georgia Tech’s domain name system (DNS) servers caused interruptions for users and jobs accessing PACE resources. The DNS service has been restored and OIT is working on identifying the root cause.

Details: OIT will continue to investigate the origin of this outage.

Impact: PACE users trying to login to Phoenix, Buzzard, Hive, Firebird and ICE clusters and/or Open OnDemand instances were prevented from logging in. Additionally, users and jobs trying to check out software licenses or access CEDAR storage might have been unable to do so. If you continue to experience issues logging into your PACE account or accessing CEDAR, please contact pace-support@oit.gatech.edu.

PACE Login Down

Summary: Users are currently unable to complete login attempts to PACE clusters via command line or OnDemand web portals, receiving an “unauthorized” or “permission denied” error.

Details: The PACE team is investigating and believe there is an issue with authentication of logins from the central GT access management but do not yet have details.

Impact: Attempts to access PACE resources may fail at this time.

Thank you for your patience as we continue investigating. Please visit https://status.gatech.edu for updates.

Disruption of the Phoenix scheduler service on August 14

The Phoenix cluster had a service interruption on Thursday, August 14, around 0:01 AM. The SLURM scheduler was in process of restarting, which is our regular procedure for the purpose of clearing the jobs stuck in the CG (“completing”) state. Unfortunately, during restart the scheduler lost the connection to the network drive that hosts the job state, and all jobs that were running on Phoenix at that time were terminated.

We sincerely apologize for this disruption of service. We are working on modifying our configuration to prevent this from happening in future. We are in contact with the developers of the scheduler software, and are developing an alternative, more stable way to maintain the scheduler’s job state. For Thursday night, we disabled the midnight automatic restarts. For Friday and this weekend, we will offset the scheduler restarts of the HA pair; the two nodes running the scheduler service will restart at 12:15 and 1:15 AM, lessening the load on each node. While there is a risk of losing the connection to the filesystem, we estimate this risk as low. Next week we will look at more robust options. The cost of the jobs that were terminated at midnight on August 14 will not count towards the August usage.

Phoenix Login Outages

Summary: An issue with DNS caused researchers to receive error messages when attempting to ssh to Phoenix or to open a shell in Phoenix OnDemand beginning late Thursday evening. A workaround has been activated to restore access, but researchers may still encounter intermittent issues.

Details: The load balancer receiving ssh requests to the Phoenix login node began routing to incorrect servers late Thursday evening. The PACE team deployed a workaround at approximately 10:15 AM on Friday that is still populating in DNS servers.

Impact: Researchers may receive “man-in-the-middle” warnings and be presented with ssh fingerprints that do not match those published by PACE for verification. Overriding the warning might lead to further errors as an incorrect server was reached. Researchers using the cluster shell access in Phoenix OnDemand may receive a connection closed error.

It is possible to get around this outage by ssh to a specific Phoenix login node (-1 through -6). There is no specific workaround for the OnDemand shell, though it is possible to request an Interactive Desktop job and use the terminal within it.

Thank you for your patience as we identified the cause and are working to resolve the issue. Please email pace-support@oit.gatech.edu with any questions or concerns. You may visit status.gatech.edu for ongoing updates.

Phoenix project storage outage, impacting login

Summary: An outage of the metadata servers on Phoenix project storage (Lustre) is preventing access to that storage and may also prevent login by ssh, access to Phoenix OnDemand, and some Globus access on Phoenix. The PACE team is working to repair the system.

Details: During the afternoon of Saturday, July 19, one of the metadata servers for Phoenix Lustre project storage stopped responding. The failover to the other metadata server was not successful. The PACE team has not yet been able restore access and has engaged our storage vendor.

Impact: Files on the Phoenix Lustre project storage system are not accessible, and researchers may not be able to log in to Phoenix by ssh nor via the OnDemand web interface. Globus on Phoenix may time out, but researchers can type another path into the Path box to bypass the home directory and enter a subdirectory directly (e.g., typing ~/scratch will allow access to the scratch storage). Research groups that have already migrated to VAST project storage may not be impacted. VAST project, scratch, and CEDAR storage may still be reachable this way.

Thank you for your patience as we work to restore access to Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[UPDATE Mon 21 Jul, 11:00]

Phoenix project storage outage is over

The outage on the Lustre project storage is over; the scheduler has been released and is accepting jobs. The access through the head nodes, Globus, and Open OnDemand is restored.

The diagnostic check of the metadata volumes, performed over the weekend, completed successfully. As a precaution, we are running a thorough check to data volumes to verify there are no other issues. In an unlikely event of data loss, it will be restored from the backups. Scratch, home, VAST and CEDAR storage systems were not affected by the outage. The cost of the jobs that were terminated due to the outage will be refunded.

We are continuing to work with the storage vendors to prevent project storage outages. The ongoing migration of project storage from Lustre to VAST systems will reduce the impact when one of the shared file systems has issues.

Degraded performance on Phoenix storage

Dear Phoenix users,

Summary: The project storage system on Phoenix (/storage/coda1) is slower than normal, due to heavy use and hard drive failures. The rebuild process to spare hard drives is ongoing; until it is complete, some users might experience slower file access on the project storage.

Details: Two hard drives that support the /storage/coda1 project storage failed on 1-July at 3:30am and 9:20am forcing a rebuild of the data to spare drives. This rebuild usually takes 24-30 hours to complete. We are closely monitoring the rebuilding process, which we expect to complete on July 2 around noon. In addition, we are temporarily moving file services from one metadata server to another and back to rebalance the load across all available systems.

Impact: Access to files is slower than usual during the drive rebuild and metadata server migration process. There is no data loss for any users. For the affected users, the degradation of performance can be observed on the login as well as compute nodes. The file system will continue to be operational while the rebuilds are running in the background. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working to solve the problem.

Partnership for an Advanced Computing Environment

Recent Posts