NetApp Storage Outage

[Update 1/18/24 6:30 PM]

Access to storage has been restored, and all systems have full functionality. The Phoenix and ICE schedulers have been resumed, and queued jobs will now start.

Please resubmit any jobs that may have failed. If a running job is no longer progressing, please cancel and resubmit.

The cause of the outage was identified as an update this afternoon to resolve a specific permissions issue affecting some users on the ICE shared directories. The update has been reverted.

Thank you for your patience as we resolved this issue.

[Original Post 1/18/24 5:20 PM]

Summary: An outage on PACE NetApp storage devices is affecting the Phoenix and ICE clusters. Home directories and software are not accessible.

Details: At approximately 5:00 PM, an issue began affecting access to NetApp storage devices on PACE. The PACE team is investigating at this time.

Impact: All storage devices provided by NetApp services are currently unreachable. This includes home directories on Phoenix and ICE, the pace-apps software repository on Phoenix and ICE, and course shared directories on ICE. Users may encounter errors upon login due to inaccessible home directories. We have paused the schedulers on Phoenix and ICE, so no new jobs will start. The Hive and Firebird clusters are not affected.

Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Maintenance Period (Jan 23 – Jan 25, 2024) 

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 01/23/2024, and is tentatively scheduled to conclude by 11:59PM on Thursday, 01/25/2024. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated. 
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part. 
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.  

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: Replace pump impeller, cooling tower maintenance 
  • [storage] Install NFS over RDMA kernel module to enable pNFS for access to VAST storage test machine 
  • Replace two UPS for SFA14KXE controllers 
  • [storage] upgrade DDN SFA14KXE controllers FW 
  • [storage] upgrade DDN 400NV ICE storage controllers and servers 
  • [Phoenix, Hive, Ice, Firebird] Upgrade all Clusters to Slurm version 23.11.X 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

PACE Winter Break Schedule

Thank you for being a PACE user. Please be mindful that we are closed during the official GT Winter Break, providing only emergency services, and will have limited availability the week of Dec 18th-22nd. If you have an urgent incident, be specific about the request, including deadlines. While we cannot make any guarantees, we will do our best. We hope you enjoy your holiday, stay safe, and best wishes for the new year!

Resolved – Scratch Space Outage on the Phoenix Cluster

[Update 11/6/2023 at 12:26 pm]

Dear Phoenix users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused, and the jobs that have been put on hold are now resumed.  

Details: The PACE support team has upgraded the different components (controller software, disk firmware) of the Scratch storage system according to the plan provided by the hardware vendor (DDN). We have tested the performance of the file system and the tests have passed.  

Impact: Please continue using the Phoenix cluster as usual. In case of issues, please contact us at pace-support@oit.gatech.edu. Also, please keep in mind that the cluster will be offline tomorrow (November 7) from 8am until 8pm so the PACE team can work on fixing the project storage (which is an unrelated issue). 

Thank you and have a great day!

The PACE Team

[Update 11/6/2023 at 9:27 am]

Dear Phoenix users, 

Summary: Storage performance on Phoenix scratch space is degraded. 

Details: Around 11pm on Saturday (November 4, 2023), the scratch space on the Phoenix cluster became unresponsive. Currently, the scratch space is inaccessible to the users. The PACE team is investigating the situation and applying an upgrade recommended by the vendor to improve stability. The PACE team paused the scheduler on Phoenix at 8:13am on Monday, November 6, to prevent additional job failures. The upgrade is estimated to take until 12pm on Monday. After the upgrade is installed, the scheduler will be released, and the paused jobs will resume executing. This issue is not related to the issue of the slowness of the Phoenix project storage which was reported last week and will be addressed during the Phoenix outage tomorrow (November 7). 

Impact: The users of the Phoenix cluster are currently unable to access the scratch storage. The jobs on the Phoenix cluster have been paused, and the new jobs will not start until the scheduler is resumed. Other PACE clusters (ICE, Hive, Firebird, Buzzard) are not affected. 

We apologize for the multiple issues that have been observed on the Phoenix cluster related to storage access. We are continuing to engage with the storage vendor to improve the performance of our system. The recommended upgrade is in process, and the cluster will be offline tomorrow to address the project filesystem issue. 

Thank you for your patience!

The PACE Team

Degraded Phoenix Project storage performance

[Update 11/12/2023 11:15 PM]

The rebuild process completed on Sunday afternoon, and the system has returned to normal performance.

[Update 11/11/2023 6:40 PM]

Unfortunately the rebuilding is still going. Another drive has failed and it is slowing down the rebuilding process. We keep monitoring the situation closely.

[Update 11/10/2023 4:30 PM]

Summary: The project storage on Phoenix (/storage/coda1) is degraded, due to a failure of the hard drives. Access to the data is not affected; the scheduler continues to accept and process jobs. 

Details: Two hard drives that are part of the Phoenix storage space failed on the morning of Friday, November 10, 2023 (the first drive failed at 8:05 am, and the second drive failed at 11:10 am). The operating system automatically activated some spare drives and started rebuilding the pool. During this process, file read and write operations by Phoenix users will take longer than usual. The rebuild is expected to end around 3 am on Saturday, November 11, 2023 (our original estimate of 7pm, Nov 10 2023, was too optimistic).  

Impact: during the rebuilding process, file input/output operations are slower than usual. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate. 

We thank you for your patience as we are working on solving the problem. 

[Update 11/8/2023 at 10:00 am]

Summary: Phoenix project storage experienced degraded performance overnight. PACE and our storage vendor made an additional configuration change this morning to restore performance.  

Details: Following yesterday’s upgrade, Phoenix project storage became degraded overnight, though to a lesser extent than prior to the upgrade. Early this morning, the PACE team found that performance was slower than normal and began working with our storage vendor to identify the cause. We adjusted a parameter that handles migration of data between disk pools, and performance was restored.  

Impact: Reading or writing files on the Phoenix project filesystem (coda1) may have been slower than usual last night and this morning. The prior upgrade mitigated the impact, so performance was less severely impacted. Home and scratch directories were not affected. 

Thank you for your patience as we completed this repair.

[Update 11/7/2023 at 2:53 pm]

Dear Phoenix users, 

Summary: The hardware upgrade of the Phoenix cluster storage was completed successfully. The cluster is back in operation. The scheduler is unpaused and the jobs that were put on hold are now resumed. Globus transfer jobs have also been resumed.

Details: In order to fix the issue with the slow response of the project storage, we had to bring the Phoenix cluster offline from 8am until 2:50pm, and to upgrade several hardware and firmware libraries on the /storage/coda1 file system. The engineers from the storage vendor company have been working with us through the upgrade and helped us ensure that the storage is operating correctly.  

Impact: The storage on the Phoenix cluster can be accessed as usual, and the jobs can be submitted. The Globus and OpenOnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu

Thank you for your patience! 

The PACE team

[Update 11/3/2023 at 3:53 pm]

Dear Phoenix users,  
 

Summary: Based on the significant impact the storage issues are causing to the community, the Phoenix cluster will be taken offline on Tuesday, November 7, 2023, to fix problems with the project storage. The offline period will start at 8am and is planned to be over at 8pm.   

Details: To implement the fixes to the firmware and software libraries for the storage appliance controllers, we need to pause the Phoenix cluster for 12 hours, starting at 8am. Access to the file system /storage/coda1 will be interrupted while the work is in progress; Globus transfer jobs will also be paused while the fix is implemented. These fixes are expected to help improve the performance of the project storage, which has been below the normal baseline since Monday, October 30.

The date of Tuesday, November 7, 2023 was selected to ensure that an engineer from our storage vendor will be available to assist our team perform the upgrade tasks and monitor the health of the storage. 

Impact: On November 7, 2023, no new jobs will start on the Phoenix cluster from 8 am until 8 pm. The job queue will be resumed after 8 pm. In case your job fails after the cluster is released at 8 pm, please resubmit it. This only affects the Phoenix cluster; the other PACE clusters (Firebird, Hive, ICE, and Buzzard) will be online and operate as usual. 

Again, we greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!

[Update 11/3/2023 1:53 pm]

Dear Phoenix users,  

Summary: Storage performance on Phoenix coda1 project space continues to be degraded. 

Details: Intermittent performance issues continue on the project file system on the Phoenix cluster, /storage/coda1. This was first observed on the afternoon of Monday, October 30. 

Our storage vendor found versioning issues with firmware and software libraries on storage appliance controllers that might be causing additional delays with data retransmissions. The mismatch was created when a hardware component was replaced during a scheduled maintenance period, that required the rest of the system to be upgraded to the same versions. Unfortunately, this step was omitted as part of the installation and upgrade instructions. 

We continue to work with the vendor to define a proper plan to update all the components and correct this situation. It is possible we’ll need to pause cluster operations to avoid any issues while the fix is implemented; during this pause, the jobs will be put on hold, and will be resumed when the cluster is released. Again, we are working with the vendor to make sure we have all the details before scheduling the implementation. We’ll provide information on when the fix will be applied, and what to expect of the cluster performance and operations. 

Impact: Simple file operations, including listing the files in a directory, reading from a file, saving a file, etc., are intermittently taking longer than usual (at its worst, the operation that is expected to take a few milliseconds runs in about 10 seconds). This affects the /storage/coda1/ project storage directories, but not scratch storage, nor any of the other PACE clusters.    

We greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!  

Thank you and have a great day!  

-The PACE Team 

PACE Maintenance Period (Oct 24 – Oct 30, 2023) is over

The maintenance on the Phoenix, Hive, Buzzard, Firebird, and ICE clusters has been completed. All clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. The Firebird cluster has been released at 12:30 pm on October 30, and the other clusters have been released at 2:45 pm on October 27.  

Update on the current cooling situation: DataBank performed a temporary repair to restore cooling to the research hosting environment. Cooling capacity in the research hall is at less than 100%, and is being actively monitored. We are currently able to run the clusters at full capacity. The plan is for DataBank to install new parts during the next Maintenance window, which is scheduled for Jan 23rd-25th, 2024. Should the situation worsen, and a full repair be required sooner, we will do our best to provide at least 1 week worth of notice. At this time, we do not expect the need for additional downtime.  

Update on Firebird: We are happy to announce that the Firebird cluster is ready to use after migration to the Slurm scheduler! Again, we greatly appreciate your patience during this extended maintenance period. Over the weekend we were able to research a few lingering issues with MPI and the user environment on the cluster and have both implemented and tested corrections.  
 

Firebird users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides the Firebird Migration Guide, and an additional Firebird-specific Slurm training session [register here] to support the smooth transition of your workflows to Slurm. You are also welcome to join our PACE Consulting Sessions or to email us for support.  

 
[Changes to Note] 

  • New Hardware: There are 12 new 32-core Intel Cascade Lake CPU nodes with 384 GB of RAM available, in addition to new GPU nodes with 4x NVIDIA A100 GPUs, 48 core Intel Xeon Gold CPUs and 512GB of RAM.  
  • Account names: Under slurm, charge accounts will have the prefix “cgts-<PI username>-<project>-<account>” rather than “GT-” 
  • Default GPU: If you do not specify a GPU type in your job script, Slurm will default to using an NVIDIA A100 node, rather than an NVIDIA RTX6000 node; the A100 nodes are more expensive but more performant.  
  • SSH Keys: When you login in for the first time, you may receive a warning about new Host Keys:, similar to the following: 
    Warning: the ECDSA host key for ‘login-.pace.gatech.edu’ differs from the key for the IP address ‘xxx.xx.xx.xx’ 
    Offending key for IP in /home/gbrudell3/.ssh/known_hosts:1 
    Are you sure you want to continue connecting (yes/no)? 
    This is expected! Simply type “yes” to continue!
    • You may also be prevented from login, and have to edit your .ssh/known_hosts to remove the old key, depending on your local ssh client settings. 
  • Jupyter and VNC: We do not currently have a replacement for Jupyter or VNC scripts for the new Slurm environment; we will be working on a solution to these needs over the coming weeks. 
  • MPI: For researchers using mvapich2 under the Slurm environment, specifying the additional –-constraint=core24 or –-constraint=core32 is necessary to ensure a homogeneous node allocation for the job (these reflect the number of CPUs per node).  

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Thank you for your patience during this extended outage!

The PACE Team

PACE Maintenance Period (Oct 24 – Oct 26, 2023) 

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 10/24/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 10/26/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

•     [Firebird] Migrate from the Moab/Torque scheduler to the Slurm scheduler. If you are a Firebird user, we will get in touch with you and provide assistance with rewriting your batch scripts and adjusting your workflow to Slurm.

ITEMS NOT REQUIRING USER ACTION:

•     [Network] Upgrade network switches

•     [Network][Hive] Configure redundancy on Hive racks

•     [Network] Upgrade firmware on InfiniBand network switches

•     [Storage][Phoenix] Reconfigure old scratch storage

•     [Storage][Phoenix] Upgrade Lustre controller and disk firmware, apply patches

•     [Datacenter] Datacenter cooling tower cleaning

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Phoenix Storage and Scheduler Outage

[Update 10/13/2023 10:35am] 

Dear Phoenix Users,  

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues. 

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26). 

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors.  We will refund any jobs that failed due to the outage. 

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you, 

-The PACE Team 

[Update 10/13/2023 9:48am] 

Dear Phoenix Users, 

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive. 

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed. 

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler. 

Thank you, 

-The PACE Team  

Phoenix Storage Cables and Hard Drive Replacement

[Update 9/14/2023 1:02pm]
The cables have been replaced on Phoenix and Hive Storage with no interruption on production.

[Update 9/14/2023 5:54pm]

WHAT’S HAPPENING?

Two cables on Phoenix’s Lustre storage and one cable on Hive’s storage need to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Thursday, September 14th, 2023 starting at 10 AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement for the Phoenix and Hive clusters, respectively, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Upcoming Firebird Slurm Migration Announcement

The Firebird cluster will be migrating to the Slurm scheduler on October 24-26, 2023. PACE has developed a plan to transition researchers’ workflow smoothly. As you may be aware, PACE began the Slurm migration in July 2022, and we have successfully migrated the Hive, Phoenix, and ICE clusters already. Firebird is the last cluster in PACE’s transition from Torque/Moab to Slurm, bringing increased job throughput and better scheduling policy enforcement. The new scheduler will better support the new hardware to be added soon to Firebird. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration. 

Software Stack 

In addition to the scheduler migration, the PACE Apps central software stack will also be updated. This software stack supports the Slurm scheduler and runs successfully on Phoenix/Hive/ICE. The Firebird cluster will feature the provided applications listed in our documentationPlease review this list of non-CUI software we will offer on Firebird post-migration and let us know via email (pace-support@oit.gatech.edu) if any PACE-installed software you are currently using on Firebird is missing from the list.  If you already submitted a reply to the application survey sent to Firebird PIs there is no need to repeat requests. Researchers installing or writing custom software will need to recompile applications to reflect new MPI and other libraries once the new system is ready.   
 
We will freeze the new software installation in PACE central software stack in Torque stack from Sep 1st, 2023. You can continue installing the software under your local/shared space without interruption. 

No Test Environment 

Due to security and capacity constraints, it is infeasible to use a progressive rollout approach as we did for Phoenix and Hive. Hence there will not be a test environment. For researchers installing or writing their software, we highly recommend the following: 

  • For those with access to Phoenix, compile Non-CUI software on Phoenix now and report any issue you encounter so that we can help you before migration. 
  • Please report any self-installed CUI software you need which cannot be tested on Phoenix. We will try our best to make all dependent libraries ready and give higher priority to assisting with reinstallation immediately after the Slurm migration.  

Support 

PACE will provide documentation, training sessions [register here], and support (consulting sessions and 1-1 sessions) to aid your workflow transitions to Slurm. Documentation and a guide for converting job scripts from PBS to Slurm-based commands will be ready before migration. We will offer Slurm training right after Migration; future communications will provide the schedule. You are welcome to join our PACE Consulting Sessions or to email us for support.  

We are excited to launch Slurm on Firebird to improve Georgia Tech’s research computing infrastructure! Please contact us with any questions or concerns about this transition.