gyourganov3 – Partnership for an Advanced Computing Environment

Phoenix project storage outage

During work to support the migration of data from Lustre to VAST, a temporary error occurred which broke symlinks to project storage between 12:37:21 and 13:08:46. Refunds are being issued for all failed jobs in that timeframe.

Degraded performance on Phoenix storage

UPDATE – July 07, 2025
An additional two drives failed overnight, introducing additional rebuild tasks in separate pools – we now expect the process to now complete early on July 4th.

Dear Phoenix users,

Summary: The project storage system on Phoenix (/storage/coda1) is slower than normal, due to heavy use and hard drive failures. The rebuild process to spare hard drives is ongoing; until it is complete, some users might experience slower file access on the project storage.

Details: Two hard drives that support the /storage/coda1 project storage failed on 1-July at 3:30am and 9:20am forcing a rebuild of the data to spare drives. This rebuild usually takes 24-30 hours to complete. We are closely monitoring the rebuilding process, which we expect to complete on July 2 around noon. In addition, we are temporarily moving file services from one metadata server to another and back to rebalance the load across all available systems.

Impact: Access to files is slower than usual during the drive rebuild and metadata server migration process. There is no data loss for any users. For the affected users, the degradation of performance can be observed on the login as well as compute nodes. The file system will continue to be operational while the rebuilds are running in the background. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working to solve the problem.

Data Center Power Outage – May 12, 2025

On May 12, there was a power outage on rack 02-012, from 9:35 AM until 3:35 pm. Other racks were not affected by the outage. Jobs that were running on this rack that were terminated have been refunded.

PACE Spending Deadlines for FY25

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY25 on June 30, 2025, we would like to alert you to several deadlines:

Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 30, 2025. Please contact us if you know there will be any purchases exceeding that amount so that we may help you with planning.
1. Purchases under $5,000 can continue without restrictions.
All spending after May 31, 2025, will be held for processing in July, in FY26. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2025.
1. State funds (DE worktags) expiring on June 30, 2025, may not be used for June spending.
2. Grant funds (GR worktags) expiring June 30, 2025, may be used for postpaid compute and monthly storage in June.
Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

[Resolved] Phoenix login nodes outage on Dec 5, 2024

On the morning of December 5, 2024, the RHEL9 login nodes of the Phoenix cluster became unresponsive. The problems started at 4:37 AM, when one login node (out of two) had a memory problem; at 6:27 AM, it crashed. The other login node crashed at 9:37 AM, rendering the RHEL9 environment on Phoenix inaccessible. Both login nodes were restarted at 11:30 AM, which resolved the issue. The jobs that crashed between 4:37 and 11:30 AM have been refunded.

[Resolved] Firebird ASDL Outage

On Oct 30, 2024, at 9:20 PM, there was a drive failure on the Firebird ASDL servers (on the ZFS pool dedicated to the ASDL project). The ASDL login nodes were offlined. Several jobs failed, and no new jobs were accepted since 10:09 AM on Oct 31. The NFS server was restarted and tested, and the ASDL nodes were back online at 12:38 PM on Oct 31.

PACE-Wide Emergency Shutdown – September 8, 2024

[Update 9/11/24 2:51 PM]

Dear Hive community,

The emergency maintenance on the Coda datacenter has been completed and the Hive cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that were held by the scheduler have been released.

[Update 9/11/24 10:52 AM]

Dear Firebird users,

The emergency maintenance on the Coda datacenter has been completed and the Firebird cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that have been held by the scheduler have been released.

As a reminder:

RHEL7 Firebird nodes are accessible at the usual address login-<project>.pace.gatech.edu. RHEL9 Firebird nodes can be accessed via ssh at login-<project>-rh9.pace.gatech.edu for testing new software. The majority of our software stack has been rebuilt for the RHEL9 environment. We strongly encourage you to test your software on RHEL9, and please let us know if anything is missing! For more information, please see our Firebird RHEL9 documentation page.

Please take the time to test your software and workflows on the RHEL9 Firebird Environment (accessible via login-<project>-rh9.pace.gatech.edu) and let us know if anything is missing!

The next Maintenance Period will be January 13-16, 2025.

[Update 9/9/24 6:00 PM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. The datacenter provider, Data Bank, has identified an alternate replacement part which has been brought onsite and is in the process of being deployed/tested. At this time, we estimate that Data Bank will have restored cooling to the Research Hall by Tuesday, September 10, 2024, by close of business day. At which point, PACE will begin powering up, testing infrastructure and begin the process to bring services back online. We plan to provide additional updates on the restoration of services by Wednesday, September 11, 2024, evening.

Please visit https://status.gatech.edu for updates.

Access to head nodes and file systems is available.

[Update 9/9/24 9:00 AM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. While a time frame for resolution is currently unknown, we are actively working with the vendor, Data Bank, to resolve the issue and restore service to the data center as soon as possible. We will provide updates as they are available. Please visit https://status.gatech.edu for updates.

Access to login nodes and filesystems (via Globus, OpenOnDemand or direct connection to login nodes) is still available.

[Original Post 9/8/24]

WHAT’S HAPPENING? 

Due to an emergency with a cooling system at the Research Hall, all PACE clusters had to be shut down on the morning of Sunday, September 8, 2024.

 WHEN IS IT HAPPENING? 

Sunday, September 8, 2024, starting at 7.30 AM.EDT.

 WHY IS IT HAPPENING? 

PACE have been notified by IOC that the temperatures in the CODA building Research Hall are rising due to a failure of a water pump in the cooling system. Emergency shutdown had to be executed in order to protect equipment. The physical infrastructure provider for our datacenter is working on evaluating the situation.

WHO IS AFFECTED? 

All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) had to be stopped at 7.30 AM. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!  

WHAT DO YOU NEED TO DO? 

Wait patiently; we will communicate as soon as the clusters are ready to resume work. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

For any questions, please contact PACE at pace-support@oit.gatech.edu.

PACE clusters unreachable on the morning of April 4, 20204

The PACE clusters were not accepting new connections from 4 AM until 10 AM today (April 4, 2024). As part of the preparations to migrate the clusters to a new version of the operating system (Red Hat Enterprise Edition 9), an entry in the configuration management system from the development environment was accidentally applied to production, including the /etc/nologin file on the head nodes. This has been fixed and additional controls are in place to avoid reincidence.

The jobs and the data transfers running during that period were not affected. The interactive sessions that started before the configuration change were not affected either.

Currently, the clusters are back online, and the scheduler is accepting jobs. We strongly apologize for this accidental disruption.

PACE Maintenance Period (Jan 23 – Jan 25, 2024) is over

Dear PACE users,

The maintenance on the Phoenix, Hive, Firebird, and ICE clusters has been completed; the OSG Buzzard cluster is still under maintenance, and we expect it to be ready next week. The Phoenix, Hive, Firebird, and ICE clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released.  

The POSIX group names on the Phoenix, Hive, Firebird, and ICE clusters have not been updated, due to the factors within the IAM team. This update is now scheduled to happen during our next maintenance period in May 7-9, 2024.

Thank you for your patience!  

The PACE Team

PACE Maintenance Period (Jan 23 – Jan 25, 2024)

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 01/23/2024, and is tentatively scheduled to conclude by 11:59PM on Thursday, 01/25/2024. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
- This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
- If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
- This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.

ITEMS NOT REQUIRING USER ACTION:

[datacenter] Databank maintenance: Replace pump impeller, cooling tower maintenance
[storage] Install NFS over RDMA kernel module to enable pNFS for access to VAST storage test machine
Replace two UPS for SFA14KXE controllers
[storage] upgrade DDN SFA14KXE controllers FW
[storage] upgrade DDN 400NV ICE storage controllers and servers
[Phoenix, Hive, Ice, Firebird] Upgrade all Clusters to Slurm version 23.11.X

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

Partnership for an Advanced Computing Environment

Author: gyourganov3