jvaldez8 – Partnership for an Advanced Computing Environment

Data Center Power Outage – 10/2/2024

[Update – 1/3/2024 – 12:43am]

The Buzzard cluster has been tested and confirmed functional, all nodes are back in service.

All PACE clusters are back in service, the impacts of the power outage have been remediated – this outage is over.

[Update – 10/3/2024 – 11:59am]

The Firebird cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Firebird are available for use.
All Firebird nodes are back in service.

Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week.

[Update – 10/3/2024 – 11:55am]

The Phoenix cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Phoenix are available for use.
PACE continues to investigate 54 nodes which we were unable to power on remotely after the outage, which includes 19 NVIDIA V100 GPU nodes.

Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week. We will provide the details for reimbursement of paid storage to affected users later this week.

We are also doubling the amount of credits for ALL free-tier accounts on Phoenix for the month of October to offset the impacts of these outages. All Georgia Tech free-tier accounts (starting with gts-) will have the balance of $136 for the month of October; all GTRI free-tier accounts (starting with gtris-) will have the balance of $504.

[Update – 10/3/2024 – 9:58am]

The Hive cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Hive are available for use.
PACE continues to investigate 21 CPU nodes, 10 “nvme” nodes, and 4 “himem” nodes on Hive for errors and will return those to service as soon as possible.

The PACE Team is continuing to test the Phoenix, Firebird, and Buzzard clusters, in that order of priority.

[Update – 10/3/2024 – 9:00am]

PACE and the OIT Datacenter teams have brought up the vast majority of machines making up the PACE clusters. Roughly 100 nodes remain in a state requiring manual intervention out of our 2,100 machines. The PACE team is working to confirm hardware readiness and beginning to carry out test procedures prior to releasing the clusters. Further updates will be provided as clusters become available for use.

The PACE team is prioritizing the Phoenix and Hive clusters, followed by Firebird and Buzzard. We hope to have the full suite of systems released by mid-afternoon.

[Update – 10/2/2024 – 5:01pm]

The ICE Cluster has been fully powered on, tested, and released for access in order to prioritize educational resources.

PACE and the OIT Datacenter teams are in the process of bringing up machines that make up the research clusters. Due to the sudden nature of the outage, the usual recovery mechanisms for rapid power-up are not available, which is considerably slowing recovery efforts compared to previous outages. The PACE and OIT Datacenter teams are continuing to check, manually reset, power on and subsequently test the hundreds of nodes that have been left in a bad state due to the nature of this power outage. Our tests have currently covered slightly over 1/5th of our 2,100 machines, and we expect to continue working to bring all machines online through the following day and will provide updates as we’re able to release clusters.

[Initial Post – 10/2/2024-12:55pm]

Dear PACE users,

A power outage (related to Georgia Power) impacted Tech Square including the CODA Datacenter. Due to a secondary failure of the UPS system, all PACE clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) were impacted. Currently, most of the nodes on all clusters are powered off, and the schedulers on all clusters have been paused. The outage started at approximately 11:37 am this morning. At the moment, no new jobs can start, and large number of jobs that have been running when the outage started have been terminated. Access to login nodes and storage remains available due to backup power. We are actively monitoring the situation and will keep you updated on the progress of the restoration of services.

Thank you for your patience,

– The PACE team

Degraded Phoenix Project Storage Performance

Summary: The metadata servers on Phoenix, /storage/coda1, restarted by themselves, with one of them not responding, leading to degraded performance on the project storage file system.

Details: We have restarted the servers in order to restore access. Testing performance of the file system is ongoing. We will continue to monitor performance and work with the vendor to find the cause.

Impact: We have paused the scheduler for now, so you will not be able to start jobs on Phoenix. We will release the scheduler soon once we have verified that storage is stable. Access to project storage (/storage/coda1) might have been interrupted for some users. If you are running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. Only storage on Phoenix should be affected; storage on Hive, ICE, Buzzard and Firebird work without issues.

Hive Storage Maintenance

WHAT’S HAPPENING?

One of the storage controllers in use for Hive requires a hard drive replacement to restore the high availability of the device. The activity takes about 2 hours to complete.

WHEN IS IT HAPPENING?

Tuesday, June 11th, 2024, starting at 10 AM EDT.

WHY IS IT HAPPENING?

The failed drive limits the high availability of the controller.

WHO IS AFFECTED?

Users of the Hive storage system will notice decreased performance since all services will be switched over to a single controller. It is possible that access will be interrupted while the switch happens.

WHAT DO YOU NEED TO DO?

During hard drive replacement for the Hive cluster, one of the controllers will be shut down, and the redundant controller will take all the traffic. Data access should be preserved, and we do not expect downtime, but there have been cases in the past where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage can be accessed.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Firebird Firewall Update

Summary: The firewall protecting access to Firebird needs to be updated to avoid certificate expiration at the end of the month.

Details: The network team needs to update the code on the firewalls protecting access to Firebird. As the connections are switched over to the High Availability (HA) pair, users might experience disconnections. The upgrade is needed to avoid certificate expiration at the end of the month; it was not done during the last maintenance day due to delays in the release of the production version of the code and it cannot wait until the next maintenance day.

The update will be completed during tomorrow’s network change window, Thursday, March 14, starting at 8 PM EDT, and finishing no later than 11:59 PM EDT. The upgrade itself will take about 30 minutes to complete within that time frame.

Impact: Access to Firebird head nodes will be impacted. Running batch jobs on the Slurm scheduler will continue without issues, but interactive jobs may be disrupted.

Thank you for your patience as we complete this update. Please contact us at pace-support@oit.gatech.edu with any questions.

Resolved – Scratch Space Outage on the Phoenix Cluster

[Update 11/6/2023 at 12:26 pm]

Dear Phoenix users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused, and the jobs that have been put on hold are now resumed.

Details: The PACE support team has upgraded the different components (controller software, disk firmware) of the Scratch storage system according to the plan provided by the hardware vendor (DDN). We have tested the performance of the file system and the tests have passed.

Impact: Please continue using the Phoenix cluster as usual. In case of issues, please contact us at pace-support@oit.gatech.edu. Also, please keep in mind that the cluster will be offline tomorrow (November 7) from 8am until 8pm so the PACE team can work on fixing the project storage (which is an unrelated issue).

Thank you and have a great day!

The PACE Team

[Update 11/6/2023 at 9:27 am]

Dear Phoenix users,

Summary: Storage performance on Phoenix scratch space is degraded.

Details: Around 11pm on Saturday (November 4, 2023), the scratch space on the Phoenix cluster became unresponsive. Currently, the scratch space is inaccessible to the users. The PACE team is investigating the situation and applying an upgrade recommended by the vendor to improve stability. The PACE team paused the scheduler on Phoenix at 8:13am on Monday, November 6, to prevent additional job failures. The upgrade is estimated to take until 12pm on Monday. After the upgrade is installed, the scheduler will be released, and the paused jobs will resume executing. This issue is not related to the issue of the slowness of the Phoenix project storage which was reported last week and will be addressed during the Phoenix outage tomorrow (November 7).

Impact: The users of the Phoenix cluster are currently unable to access the scratch storage. The jobs on the Phoenix cluster have been paused, and the new jobs will not start until the scheduler is resumed. Other PACE clusters (ICE, Hive, Firebird, Buzzard) are not affected.

We apologize for the multiple issues that have been observed on the Phoenix cluster related to storage access. We are continuing to engage with the storage vendor to improve the performance of our system. The recommended upgrade is in process, and the cluster will be offline tomorrow to address the project filesystem issue.

Thank you for your patience!

The PACE Team

Degraded Phoenix Project storage performance

[Update 11/12/2023 11:15 PM]

The rebuild process completed on Sunday afternoon, and the system has returned to normal performance.

[Update 11/11/2023 6:40 PM]

Unfortunately the rebuilding is still going. Another drive has failed and it is slowing down the rebuilding process. We keep monitoring the situation closely.

[Update 11/10/2023 4:30 PM]

Summary: The project storage on Phoenix (/storage/coda1) is degraded, due to a failure of the hard drives. Access to the data is not affected; the scheduler continues to accept and process jobs.

Details: Two hard drives that are part of the Phoenix storage space failed on the morning of Friday, November 10, 2023 (the first drive failed at 8:05 am, and the second drive failed at 11:10 am). The operating system automatically activated some spare drives and started rebuilding the pool. During this process, file read and write operations by Phoenix users will take longer than usual. The rebuild is expected to end around 3 am on Saturday, November 11, 2023 (our original estimate of 7pm, Nov 10 2023, was too optimistic).

Impact: during the rebuilding process, file input/output operations are slower than usual. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working on solving the problem.

[Update 11/8/2023 at 10:00 am]

Summary: Phoenix project storage experienced degraded performance overnight. PACE and our storage vendor made an additional configuration change this morning to restore performance.

Details: Following yesterday’s upgrade, Phoenix project storage became degraded overnight, though to a lesser extent than prior to the upgrade. Early this morning, the PACE team found that performance was slower than normal and began working with our storage vendor to identify the cause. We adjusted a parameter that handles migration of data between disk pools, and performance was restored.

Impact: Reading or writing files on the Phoenix project filesystem (coda1) may have been slower than usual last night and this morning. The prior upgrade mitigated the impact, so performance was less severely impacted. Home and scratch directories were not affected.

Thank you for your patience as we completed this repair.

[Update 11/7/2023 at 2:53 pm]

Dear Phoenix users,

Summary: The hardware upgrade of the Phoenix cluster storage was completed successfully. The cluster is back in operation. The scheduler is unpaused and the jobs that were put on hold are now resumed. Globus transfer jobs have also been resumed.

Details: In order to fix the issue with the slow response of the project storage, we had to bring the Phoenix cluster offline from 8am until 2:50pm, and to upgrade several hardware and firmware libraries on the /storage/coda1 file system. The engineers from the storage vendor company have been working with us through the upgrade and helped us ensure that the storage is operating correctly.

Impact: The storage on the Phoenix cluster can be accessed as usual, and the jobs can be submitted. The Globus and OpenOnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu.

Thank you for your patience!

The PACE team

[Update 11/3/2023 at 3:53 pm]

Dear Phoenix users,

Summary: Based on the significant impact the storage issues are causing to the community, the Phoenix cluster will be taken offline on Tuesday, November 7, 2023, to fix problems with the project storage. The offline period will start at 8am and is planned to be over at 8pm.

Details: To implement the fixes to the firmware and software libraries for the storage appliance controllers, we need to pause the Phoenix cluster for 12 hours, starting at 8am. Access to the file system /storage/coda1 will be interrupted while the work is in progress; Globus transfer jobs will also be paused while the fix is implemented. These fixes are expected to help improve the performance of the project storage, which has been below the normal baseline since Monday, October 30.

The date of Tuesday, November 7, 2023 was selected to ensure that an engineer from our storage vendor will be available to assist our team perform the upgrade tasks and monitor the health of the storage.

Impact: On November 7, 2023, no new jobs will start on the Phoenix cluster from 8 am until 8 pm. The job queue will be resumed after 8 pm. In case your job fails after the cluster is released at 8 pm, please resubmit it. This only affects the Phoenix cluster; the other PACE clusters (Firebird, Hive, ICE, and Buzzard) will be online and operate as usual.

Again, we greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!

[Update 11/3/2023 1:53 pm]

Dear Phoenix users,

Summary: Storage performance on Phoenix coda1 project space continues to be degraded.

Details: Intermittent performance issues continue on the project file system on the Phoenix cluster, /storage/coda1. This was first observed on the afternoon of Monday, October 30.

Our storage vendor found versioning issues with firmware and software libraries on storage appliance controllers that might be causing additional delays with data retransmissions. The mismatch was created when a hardware component was replaced during a scheduled maintenance period, that required the rest of the system to be upgraded to the same versions. Unfortunately, this step was omitted as part of the installation and upgrade instructions.

We continue to work with the vendor to define a proper plan to update all the components and correct this situation. It is possible we’ll need to pause cluster operations to avoid any issues while the fix is implemented; during this pause, the jobs will be put on hold, and will be resumed when the cluster is released. Again, we are working with the vendor to make sure we have all the details before scheduling the implementation. We’ll provide information on when the fix will be applied, and what to expect of the cluster performance and operations.

Impact: Simple file operations, including listing the files in a directory, reading from a file, saving a file, etc., are intermittently taking longer than usual (at its worst, the operation that is expected to take a few milliseconds runs in about 10 seconds). This affects the /storage/coda1/ project storage directories, but not scratch storage, nor any of the other PACE clusters.

We greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!

Thank you and have a great day!

-The PACE Team

Phoenix Storage and Scheduler Outage

[Update 10/13/2023 10:35am]

Dear Phoenix Users, 

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues.

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26).

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors. We will refund any jobs that failed due to the outage.

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you,

-The PACE Team

[Update 10/13/2023 9:48am]

Dear Phoenix Users,

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive.

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed.

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler.

Thank you,

-The PACE Team 

Phoenix Storage Cables and Hard Drive Replacement

[Update 9/14/2023 1:02pm]
The cables have been replaced on Phoenix and Hive Storage with no interruption on production.

[Update 9/14/2023 5:54pm]

WHAT’S HAPPENING?

Two cables on Phoenix’s Lustre storage and one cable on Hive’s storage need to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Thursday, September 14th, 2023 starting at 10 AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement for the Phoenix and Hive clusters, respectively, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Storage Cables and Hard Drive Replacement

WHAT’S HAPPENING?

Two SAS cables and one hard drive for Phoenix’s Lustre storage need to be replaced. Cable and hard drive replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, August 22th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Slurm Scheduler Outage

[Update 8/21/23 5:02 PM]

Dear Phoenix Users,

The Slurm scheduler on Phoenix is back up and available. We have applied the patch that was recommended by SchedMD, the developer of Slurm; cleaned the database; and run tests to confirm that the scheduler is running correctly. We will continue to monitor the scheduler database for any other issues.

Existing jobs that have been queued should have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. We will refund any jobs that failed due to the scheduler outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you,

-The PACE Team

[Update 8/21/23 3:20 PM]

Dear Phoenix Users,

We have been working with the Slurm scheduler vendor, SchedMD, to identify and fix a corrupted association in the scheduler database and provide a patch. In troubleshooting the scheduler this afternoon, some jobs were able to be scheduled. We are going to pause the scheduler again to make sure the database cleanup can be completed without disruption from new jobs.

Based on our estimates, we are expecting to restore the scheduler by later tonight. We will provide an update as soon as the scheduler is released.

Thank you,

-The PACE Team

[Update 8/21/23 11:17 AM]

Dear Phoenix Users,

Unfortunately, the Slurm scheduler controller is down due to issues with Slurm’s database and jobs are not able to be scheduled. We have submitted a high-priority service request to SchedMD, the developer of Slurm, and should be able to provide an update soon.

Jobs currently running will likely run, but we recommend reviewing the output as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler is fixed.

The rest of the Phoenix cluster infrastructure (i.e. login, storage, etc.) outside of the scheduler should be working. We recommend not running commands that require interaction with Slurm (i.e. any scheduler commands like ‘sbatch’, ‘srun’, ‘sacct’, or ‘pace-quota’ commands, etc.) because they will not work at this time.

We will provide updates soon as we work on fixing the scheduler.

Thank you,

-The PACE Team

Partnership for an Advanced Computing Environment

Author: jvaldez8

Data Center Power Outage – 10/2/2024

Degraded Phoenix Project Storage Performance

Hive Storage Maintenance

Firebird Firewall Update

Resolved – Scratch Space Outage on the Phoenix Cluster

Degraded Phoenix Project storage performance

Phoenix Storage and Scheduler Outage

Phoenix Storage Cables and Hard Drive Replacement

Phoenix Storage Cables and Hard Drive Replacement

Phoenix Slurm Scheduler Outage

Georgia Institute of Technology