PACE Spending Deadlines for FY25

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY25 on June 30, 2025, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 30, 2025. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2025, will be held for processing in July, in FY26. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2025.
    1. State funds (DE worktags) expiring on June 30, 2025, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2025, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

[Resolved] Phoenix login nodes outage on Dec 5, 2024

On the morning of December 5, 2024, the RHEL9 login nodes of the Phoenix cluster became unresponsive. The problems started at 4:37 AM, when one login node (out of two) had a memory problem; at 6:27 AM, it crashed. The other login node crashed at 9:37 AM, rendering the RHEL9 environment on Phoenix inaccessible. Both login nodes were restarted at 11:30 AM, which resolved the issue. The jobs that crashed between 4:37 and 11:30 AM have been refunded.

[Resolved] Firebird ASDL Outage

On Oct 30, 2024, at 9:20 PM, there was a drive failure on the Firebird ASDL servers (on the ZFS pool dedicated to the ASDL project). The ASDL login nodes were offlined. Several jobs failed, and no new jobs were accepted since 10:09 AM on Oct 31. The NFS server was restarted and tested, and the ASDL nodes were back online at 12:38 PM on Oct 31.

PACE-Wide Emergency Shutdown – September 8, 2024

[Update 9/11/24 2:51 PM]

Dear Hive community, 

The emergency maintenance on the Coda datacenter has been completed and the Hive cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that were held by the scheduler have been released. 

[Update 9/11/24 10:52 AM]

Dear Firebird users,

The emergency maintenance on the Coda datacenter has been completed and the Firebird cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that have been held by the scheduler have been released.

As a reminder:

RHEL7 Firebird nodes are accessible at the usual address login-<project>.pace.gatech.edu. RHEL9 Firebird nodes can be accessed via ssh at login-<project>-rh9.pace.gatech.edu for testing new software. The majority of our software stack has been rebuilt for the RHEL9 environment. We strongly encourage you to test your software on RHEL9, and please let us know if anything is missing! For more information, please see our Firebird RHEL9 documentation page.

Please take the time to test your software and workflows on the RHEL9 Firebird Environment (accessible via login-<project>-rh9.pace.gatech.edu) and let us know if anything is missing!

The next Maintenance Period will be January 13-16, 2025.

[Update 9/9/24 6:00 PM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. The datacenter provider, Data Bank, has identified an alternate replacement part which has been brought onsite and is in the process of being deployed/tested. At this time, we estimate that Data Bank will have restored cooling to the Research Hall by Tuesday, September 10, 2024, by close of business day. At which point, PACE will begin powering up, testing infrastructure and begin the process to bring services back online. We plan to provide additional updates on the restoration of services by Wednesday, September 11, 2024, evening.

Please visit https://status.gatech.edu for updates.

Access to head nodes and file systems is available.

[Update 9/9/24 9:00 AM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. While a time frame for resolution is currently unknown, we are actively working with the vendor, Data Bank, to resolve the issue and restore service to the data center as soon as possible. We will provide updates as they are available. Please visit https://status.gatech.edu for updates. 

Access to login nodes and filesystems (via Globus, OpenOnDemand or direct connection to login nodes) is still available.

[Original Post 9/8/24]

WHAT’S HAPPENING?  

Due to an emergency with a cooling system at the Research Hall, all PACE clusters had to be shut down on the morning of Sunday, September 8, 2024. 

WHEN IS IT HAPPENING?  

Sunday, September 8, 2024, starting at 7.30 AM.EDT.  

WHY IS IT HAPPENING?  

PACE have been notified by IOC that the temperatures in the CODA building Research Hall are rising due to a failure of a water pump in the cooling system. Emergency shutdown had to be executed in order to protect equipment. The physical infrastructure provider for our datacenter is working on evaluating the situation.  

WHO IS AFFECTED?  

All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) had to be stopped at 7.30 AM. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!   

WHAT DO YOU NEED TO DO?  

Wait patiently; we will communicate as soon as the clusters are ready to resume work.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

For any questions, please contact PACE at pace-support@oit.gatech.edu.  

PACE clusters unreachable on the morning of April 4, 20204

The PACE clusters were not accepting new connections from 4 AM until 10 AM today (April 4, 2024). As part of the preparations to migrate the clusters to a new version of the operating system (Red Hat Enterprise Edition 9), an entry in the configuration management system from the development environment was accidentally applied to production, including the /etc/nologin file on the head nodes. This has been fixed and additional controls are in place to avoid reincidence. 

The jobs and the data transfers running during that period were not affected. The interactive sessions that started before the configuration change were not affected either. 

Currently, the clusters are back online, and the scheduler is accepting jobs. We strongly apologize for this accidental disruption. 

PACE Maintenance Period (Jan 23 – Jan 25, 2024) is over 

Dear PACE users,  

The maintenance on the Phoenix, Hive, Firebird, and ICE clusters has been completed; the OSG Buzzard cluster is still under maintenance, and we expect it to be ready next week. The Phoenix, Hive, Firebird, and ICE clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released.   

  

The POSIX group names on the Phoenix, Hive, Firebird, and ICE clusters have not been updated, due to the factors within the IAM team. This update is now scheduled to happen during our next maintenance period in May 7-9, 2024.  

Thank you for your patience!   

  

The PACE Team 

PACE Maintenance Period (Jan 23 – Jan 25, 2024) 

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 01/23/2024, and is tentatively scheduled to conclude by 11:59PM on Thursday, 01/25/2024. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated. 
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part. 
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.  

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: Replace pump impeller, cooling tower maintenance 
  • [storage] Install NFS over RDMA kernel module to enable pNFS for access to VAST storage test machine 
  • Replace two UPS for SFA14KXE controllers 
  • [storage] upgrade DDN SFA14KXE controllers FW 
  • [storage] upgrade DDN 400NV ICE storage controllers and servers 
  • [Phoenix, Hive, Ice, Firebird] Upgrade all Clusters to Slurm version 23.11.X 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

PACE Maintenance Period (Oct 24 – Oct 30, 2023) is over

The maintenance on the Phoenix, Hive, Buzzard, Firebird, and ICE clusters has been completed. All clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. The Firebird cluster has been released at 12:30 pm on October 30, and the other clusters have been released at 2:45 pm on October 27.  

Update on the current cooling situation: DataBank performed a temporary repair to restore cooling to the research hosting environment. Cooling capacity in the research hall is at less than 100%, and is being actively monitored. We are currently able to run the clusters at full capacity. The plan is for DataBank to install new parts during the next Maintenance window, which is scheduled for Jan 23rd-25th, 2024. Should the situation worsen, and a full repair be required sooner, we will do our best to provide at least 1 week worth of notice. At this time, we do not expect the need for additional downtime.  

Update on Firebird: We are happy to announce that the Firebird cluster is ready to use after migration to the Slurm scheduler! Again, we greatly appreciate your patience during this extended maintenance period. Over the weekend we were able to research a few lingering issues with MPI and the user environment on the cluster and have both implemented and tested corrections.  
 

Firebird users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides the Firebird Migration Guide, and an additional Firebird-specific Slurm training session [register here] to support the smooth transition of your workflows to Slurm. You are also welcome to join our PACE Consulting Sessions or to email us for support.  

 
[Changes to Note] 

  • New Hardware: There are 12 new 32-core Intel Cascade Lake CPU nodes with 384 GB of RAM available, in addition to new GPU nodes with 4x NVIDIA A100 GPUs, 48 core Intel Xeon Gold CPUs and 512GB of RAM.  
  • Account names: Under slurm, charge accounts will have the prefix “cgts-<PI username>-<project>-<account>” rather than “GT-” 
  • Default GPU: If you do not specify a GPU type in your job script, Slurm will default to using an NVIDIA A100 node, rather than an NVIDIA RTX6000 node; the A100 nodes are more expensive but more performant.  
  • SSH Keys: When you login in for the first time, you may receive a warning about new Host Keys:, similar to the following: 
    Warning: the ECDSA host key for ‘login-.pace.gatech.edu’ differs from the key for the IP address ‘xxx.xx.xx.xx’ 
    Offending key for IP in /home/gbrudell3/.ssh/known_hosts:1 
    Are you sure you want to continue connecting (yes/no)? 
    This is expected! Simply type “yes” to continue!
    • You may also be prevented from login, and have to edit your .ssh/known_hosts to remove the old key, depending on your local ssh client settings. 
  • Jupyter and VNC: We do not currently have a replacement for Jupyter or VNC scripts for the new Slurm environment; we will be working on a solution to these needs over the coming weeks. 
  • MPI: For researchers using mvapich2 under the Slurm environment, specifying the additional –-constraint=core24 or –-constraint=core32 is necessary to ensure a homogeneous node allocation for the job (these reflect the number of CPUs per node).  

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Thank you for your patience during this extended outage!

The PACE Team

PACE Maintenance Period (Oct 24 – Oct 26, 2023) 

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 10/24/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 10/26/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

•     [Firebird] Migrate from the Moab/Torque scheduler to the Slurm scheduler. If you are a Firebird user, we will get in touch with you and provide assistance with rewriting your batch scripts and adjusting your workflow to Slurm.

ITEMS NOT REQUIRING USER ACTION:

•     [Network] Upgrade network switches

•     [Network][Hive] Configure redundancy on Hive racks

•     [Network] Upgrade firmware on InfiniBand network switches

•     [Storage][Phoenix] Reconfigure old scratch storage

•     [Storage][Phoenix] Upgrade Lustre controller and disk firmware, apply patches

•     [Datacenter] Datacenter cooling tower cleaning

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.