Posts

Phoenix Storage and Scheduler Outage

[Update 10/13/2023 10:35am] 

Dear Phoenix Users,  

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues. 

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26). 

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors.  We will refund any jobs that failed due to the outage. 

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you, 

-The PACE Team 

[Update 10/13/2023 9:48am] 

Dear Phoenix Users, 

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive. 

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed. 

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler. 

Thank you, 

-The PACE Team  

Phoenix Storage Cables and Hard Drive Replacement

[Update 9/14/2023 1:02pm]
The cables have been replaced on Phoenix and Hive Storage with no interruption on production.

[Update 9/14/2023 5:54pm]

WHAT’S HAPPENING?

Two cables on Phoenix’s Lustre storage and one cable on Hive’s storage need to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Thursday, September 14th, 2023 starting at 10 AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement for the Phoenix and Hive clusters, respectively, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Upcoming Firebird Slurm Migration Announcement

The Firebird cluster will be migrating to the Slurm scheduler on October 24-26, 2023. PACE has developed a plan to transition researchers’ workflow smoothly. As you may be aware, PACE began the Slurm migration in July 2022, and we have successfully migrated the Hive, Phoenix, and ICE clusters already. Firebird is the last cluster in PACE’s transition from Torque/Moab to Slurm, bringing increased job throughput and better scheduling policy enforcement. The new scheduler will better support the new hardware to be added soon to Firebird. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration. 

Software Stack 

In addition to the scheduler migration, the PACE Apps central software stack will also be updated. This software stack supports the Slurm scheduler and runs successfully on Phoenix/Hive/ICE. The Firebird cluster will feature the provided applications listed in our documentationPlease review this list of non-CUI software we will offer on Firebird post-migration and let us know via email (pace-support@oit.gatech.edu) if any PACE-installed software you are currently using on Firebird is missing from the list.  If you already submitted a reply to the application survey sent to Firebird PIs there is no need to repeat requests. Researchers installing or writing custom software will need to recompile applications to reflect new MPI and other libraries once the new system is ready.   
 
We will freeze the new software installation in PACE central software stack in Torque stack from Sep 1st, 2023. You can continue installing the software under your local/shared space without interruption. 

No Test Environment 

Due to security and capacity constraints, it is infeasible to use a progressive rollout approach as we did for Phoenix and Hive. Hence there will not be a test environment. For researchers installing or writing their software, we highly recommend the following: 

  • For those with access to Phoenix, compile Non-CUI software on Phoenix now and report any issue you encounter so that we can help you before migration. 
  • Please report any self-installed CUI software you need which cannot be tested on Phoenix. We will try our best to make all dependent libraries ready and give higher priority to assisting with reinstallation immediately after the Slurm migration.  

Support 

PACE will provide documentation, training sessions [register here], and support (consulting sessions and 1-1 sessions) to aid your workflow transitions to Slurm. Documentation and a guide for converting job scripts from PBS to Slurm-based commands will be ready before migration. We will offer Slurm training right after Migration; future communications will provide the schedule. You are welcome to join our PACE Consulting Sessions or to email us for support.  

We are excited to launch Slurm on Firebird to improve Georgia Tech’s research computing infrastructure! Please contact us with any questions or concerns about this transition. 

All PACE Clusters Down Due to Cooling Failure

[Update 8/27/23 11:10 AM]

All PACE clusters have returned to service.

The datacenter cooling pump was replaced early this morning. After powering on compute nodes and testing, PACE resumed jobs on all clusters. On clusters which charge for use (Phoenix and Firebird), jobs that were cancelled yesterday evening when compute nodes were turned off will be refunded. Please submit new jobs to resume your work.

Thank you for your patience during this emergency repair.

[Original Post 8/26/23 9:40 PM]

Summary: A pump in the Coda Datacenter cooling system overheated on Saturday evening. All PACE compute nodes across all clusters (Phoenix, Hive, Firebird, Buzzard, and ICE) have been shut down until cooling is restored, stopping all compute jobs.

Details: Databank is reporting an issue with the high temp condenser pump in the Research Hall of the CODA data center, hosting PACE compute nodes. The Research Hall is being powered off in order for Databank facilities to get the pump changed.

Impact: All PACE compute nodes are unavailable. Running jobs have been cancelled, and no new jobs can start. Login nodes and storage systems remain available. Compute nodes will remain off until the cooling system is repaired.

Phoenix Storage Cables and Hard Drive Replacement

WHAT’S HAPPENING?

Two SAS cables and one hard drive for Phoenix’s Lustre storage need to be replaced. Cable and hard drive replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, August 22th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Slurm Scheduler Outage

[Update 8/21/23 5:02 PM]

Dear Phoenix Users, 

The Slurm scheduler on Phoenix is back up and available. We have applied the patch that was recommended by SchedMD, the developer of Slurm; cleaned the database; and run tests to confirm that the scheduler is running correctly. We will continue to monitor the scheduler database for any other issues.

Existing jobs that have been queued should have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. We will refund any jobs that failed due to the scheduler outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you,

-The PACE Team 

[Update 8/21/23 3:20 PM]

Dear Phoenix Users, 

We have been working with the Slurm scheduler vendor, SchedMD, to identify and fix a corrupted association in the scheduler database and provide a patch. In troubleshooting the scheduler this afternoon, some jobs were able to be scheduled. We are going to pause the scheduler again to make sure the database cleanup can be completed without disruption from new jobs. 

Based on our estimates, we are expecting to restore the scheduler by later tonight. We will provide an update as soon as the scheduler is released.

Thank you, 

-The PACE Team 

[Update 8/21/23 11:17 AM]

Dear Phoenix Users, 

Unfortunately, the Slurm scheduler controller is down due to issues with Slurm’s database and jobs are not able to be scheduled. We have submitted a high-priority service request to SchedMD, the developer of Slurm, and should be able to provide an update soon. 

Jobs currently running will likely run, but we recommend reviewing the output as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler is fixed. 

The rest of the Phoenix cluster infrastructure (i.e. login, storage, etc.) outside of the scheduler should be working. We recommend not running commands that require interaction with Slurm (i.e.  any scheduler commands like ‘sbatch’, ‘srun’, ‘sacct’, or ‘pace-quota’ commands, etc.) because they will not work at this time. 

We will provide updates soon as we work on fixing the scheduler. 

Thank you, 

-The PACE Team 

Phoenix Scratch Storage Outage

[Update 8/7/23 9:34 PM]

Access to Phoenix scratch continued to have issues last night as of 10:19 PM last night (Sunday). We paused the scheduler and restarted the controller around 6am this morning (Monday).

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 10:19 PM Sunday and ended this morning at 9:24 AM Monday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 8/6/23 2:25 PM]

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 9:30 PM Saturday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 8/6/23 1:30 PM]

Summary: Phoenix scratch storage is currently unavailable, which may impact access to directories on other Phoenix storage systems. The Phoenix scheduler is paused, so no new jobs can start.

Details: A storage target controller on the Phoenix scratch system became unresponsive just before midnight on Saturday evening. The Phoenix scheduler crashed shortly before 7 AM Sunday morning due to the number of failures to reach scratch directories. PACE restarted the scheduler around 1 PM today (Sunday), restoring access, while also pausing it to prevent new jobs from starting.

Impact: The network scratch filesystem on Phoenix is inaccessible. Due to the symbolic link to scratch, an ls of Phoenix home directories may also hang. Access via Globus may also time out. Individual directories on the home storage device may be reachable if an ls of the main home directory is not performed. Scheduler commands, such as squeue, were not available this morning but have now been restored. As the scheduler is paused, any new jobs submitted will not start at this time. There is no impact to project storage.

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Maintenance Period (Aug 8 – Aug 10, 2023) 

[Update 8/11/2023 8:33pm]

The controller replacement on the scratch storage system successfully passed four rounds of testing. Phoenix is back in production and is ready for research. We have released all jobs that were held by the scheduler. Please let us know if you have any problems.

I apologize for the inconvenience, but I believe this delayed return to production will help decrease future downtime.

The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for January 23-25, 2024, and May 7-9, 2024.

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

Pam Buffington 

Pace Director 

[Update 8/10/2023 5:00pm]

The Hive, ICE, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler. 

Unfortunately, Phoenix storage issues continue. All work was completed, but the scratch storage failed initial stress-tests. The vendor is sending us a replacement controller, which will arrive and be replaced early tomorrow. We will then stress-test the storage again. If it passes, Phoenix will be brought into production. If it fails, we will revert to the old scratch infrastructure in use prior to May 2023 while we hunt for a new solution. While we have begun syncing data, this will take time and Phoenix will be brought into production with a syncing scratch file system while 800TB is transferred, which may take approximately 1 week. Not all files will be there, but if you wait, they’ll come back. In the meantime, you may encounter files that were present in your scratch directory prior to the May maintenance period but have since been deleted, which will disappear as the sync completes.  

The monthly deletion of old scratch directories scheduled for next week is canceled. Please disregard the notification you may have received last week.  

I apologize for the inconvenience, but I believe this delay will help decrease future downtime.  

The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM. 

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

Pam Buffington 

Pace Director 

[Update 8/8/2023 6:00am]

PACE Maintenance Period starts now at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.

[Update 8/7/2023 12:00pm]

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.

[Update 8/2/2023 1:43pm]

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?  

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING? 

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Create Interactive CPU and GPU partitions on Phoenix 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB 
  • [Phoenix, Hive, ICE] Open XDMoD to campus 
  • [Phoenix] Replace Phoenix project storage controller 
  • [Firebird] Upgrade firewall device firmware supporting CUI 
  • [Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity 
  • [OSG][Network] Move ScienceDMZ VRF to new network fabric 
  • [Network] Install leaf module to InfiniBand director switch 
  • [Network] Configure VPC pair redundancy to Research hall network switches 
  • [Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity 
  • [Storage] DDN Controller firmware & Disk firmware upgrade
  • [Storage] Reboot the backup controller to synchronize with the main controller 
  • [Storage] Increase storage capacity for PACE backup servers 
  • [Storage] Increase storage capacity for EAS group storage servers 
  • [Storage] Replace cables on storage controller
  • [Software] Move pace-apps to Slurm on admin nodes 
  • [Datacenter] Datacenter cooling maintenance

WHY IS IT HAPPENING? 

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED? 

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

-The PACE Team 

[Update 7/26/2023 4:39pm]

WHEN IS IT HAPPENING? 

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?  

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING? 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB 
  • [Phoenix, Hive, ICE] Open XDMoD to campus 
  • [Phoenix] Replace Phoenix project storage controller 
  • [Firebird] Upgrade firewall device firmware supporting CUI 
  • [Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity 
  • [OSG][Network] Move ScienceDMZ VRF to new network fabric 
  • [Network] Install leaf module to InfiniBand director switch 
  • [Network] Configure VPC pair redundancy to Research hall network switches 
  • [Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity 
  • [Storage] Reboot the backup controller to synchronize with the main controller 
  • [Storage] Increase storage capacity for PACE backup servers 
  • [Storage] Increase storage capacity for EAS group storage servers 
  • [Storage] Replace cables on storage controller 
  • [Datacenter] Datacenter cooling maintenance 

WHY IS IT HAPPENING? 

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. 

WHO IS AFFECTED? 

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

-The PACE Team 

Hive Storage SAS Cable Replacement

[Update 7/25/2023 1:04pm]
The SAS cable has been replaced with no interruption on production.

[Update 7/24/2023 3:13pm]
Hive Storage SAS Cable Replacement

WHAT’S HAPPENING?

One SAS cable for Hive between the enclosure and controller for Hive storage needs to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, July 25th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shutdown and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Project Storage & Login Node Outage

[Update 7/21/2023 3:30pm]

Dear Phoenix Users,

The Lustre project storage filesystem on Phoenix is back up and available. We have completed cable replacements, reseated and replaced a couple hard drives, and restarted the controller. We have run tests to confirm that the storage is running correctly. Performance may still be degraded and impacted as redundant drives rebuild, but is better than the last few days.

Phoenix’s head nodes, which were unresponsive earlier this morning, are available again without issue. We will continue to monitor the login nodes for any other issues.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM this morning due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 7/21/2023 9:46 am]

Summary: The Lustre project storage filesystem on Phoenix became unresponsive this morning. Researchers may be unable to access data in their project storage. Multiple Phoenix login nodes have also become unresponsive, which may also prevent logins. We have paused the scheduler, preventing new jobs from starting, while we investigate.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known, but PACE is working with the vendor to find a resolution.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. This may impact logins as well. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.