Outage on Scratch Storage on the Phoenix Cluster

[Update 02/19/24 10:47 AM]

Summary: The Phoenix /storage/scratch1 file system is operational. The performance is stable. The scheduler has been un-paused, current jobs continue to run, and new jobs are being accepted. 

Details: The storage vendor provided us with a hot fix late Friday evening that was installed this morning on the Lustre appliance supporting /storage/scratch1. The performance test of the scratch file system after the upgrade was stable. We are releasing the cluster and the Slurm scheduler. The Open OnDemand services are back to normal. 

The cost of all jobs running between 6PM on Wednesday, February 14, and 10AM on Monday, February 19, will be refunded to the PI’s accounts. 

During the weekend, an automatic process accidentally resumed the scheduler and some jobs started to run. If you have a job that ran during the outage and used scratch, please consider re-running it from the beginning, because, if your job was running before the hot fix was applied, it is possible some processes failed trying to access the scratch file system. The cost of the jobs that were accidentally re-started during the outage will be refunded.  

Impact: The storage on the Phoenix cluster can be accessed as usual, and jobs can be submitted. The Globus and Open OnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu.   

Thank you for your patience! 

[Update 02/16/24 05:58 PM]

PACE has decided to leave the Slurm scheduler paused, and no jobs will be accepted over the weekend. We will allow jobs that are currently running to continue, but those utilizing scratch may fail.

While this was a difficult call to decide on keeping the job scheduling paused during the weekend, we want to ensure that the issues with scratch storage will not impact the integrity of other components on Phoenix. 

We are not confident that functionality can be restored without further input from the storage vendor. As part of continuing the diagnostic process, we expect we will have no other option but to reboot the scratch storage system on Monday morning. As a result, any jobs still running at that point that utilize scratch storage will likely fail. We have continued to provide diagnostic data that the vendor will analyze during the weekend. We plan to provide an update on the state of the scratch storage by next Monday (2/19) at noon.

We will refund all jobs that ran from the start of the outage on Wednesday evening 6:00 pm until performance is restored. 

Monthly deletion of old files in scratch, scheduled for Tuesday, February 20, has been canceled. All researchers who have received notifications for February will be given a one-month extension automatically. 

Finally, while you cannot schedule jobs, you may be able to log on to Phoenix to view or copy files. However, please be aware that you may see long delays with simple commands (ls/cd), creating new files/folders, and editing the existing ones. We recommend avoiding using file commands like “ls” of your home (~) or scratch (~/scratch) directories as that may lead to your command prompt stalling. 

You may follow updates to this incident on the GT OIT Status page.  

We recognize the negative impact this storage disruption has on the research community, especially given that some of you may have research deadlines. Thank you for your patience as we continue working to fully restore scratch storage system performance. If you have additional concerns, please email ART Executive Director, Didier Contis, directly at didier.contis@gatech.edu

[Update 02/16/24 02:59 PM]

Unfortunately, the scratch storage on the Phoenix cluster remains currently unstable. You may see long delays with simple commands (ls/cd), creating new files/folders, and editing the existing ones. Jobs that are currently running from scratch might be experiencing delays. We are continuing to work on resolving the issue and we are in close communication with the storage vendor. The scheduler remains paused, and no new jobs are being accepted. We will provide an update on the state of the scratch storage by this evening. We sincerely apologize for the inconvenience that the current outage is causing you. 

Thank you for your patience.

[Update 02/16/24 09:15 AM]

Summary: Phoenix /storage/scratch1 file system continues to have issues for some users. The recommended procedure is to fail over the storage services to the high availability pair and reboot the affected component. This will require pausing the Phoenix scheduler. 

Details: After analyzing the storage logs, the vendor recommended that the affected component is rebooted, moving all the services and connections to the high availability pair. While the device restarts, the Phoenix scheduler will be paused. Running jobs will see a momentary pause accessing the /storage/scratch1 file system while the connections are moved to the redundancy device. Once the primary device is up and running and all the errors have cleared, the services will be switched back, and the jobs scheduling will be resumed. 

We will start this procedure at 10:00am EDT. Please wait for the all-clear message before starting additional jobs on the Phoenix cluster. 

Impact: Jobs on Phoenix will be paused during the appliance restart procedure; running jobs should continue with some delays while the connections are switched over. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions. 

[Update 02/15/24 04:56 PM]

Summary: Phoenix /storage/scratch1 file system is now stable for most users. A small number of users are still experiencing issues. 

Details: While we continue working with the vendor to get to the root cause of the issue, all diagnostic tests executed through the day have been successful. However, there is a small number of users who have running jobs from their scratch folder that continue to notice slowness accessing their files. 

Please inform us if you are seeing degraded performance on our file systems. As mentioned, we continue the efforts to find a permanent solution. 

Impact: Access to /storage/scratch1 is normal for the majority of users; please let us know if you are still experiencing issues by emailing us at pace-support@oit.gatech.edu. OnDemand-Phoenix and the scheduler are working fine. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions. 

[Update 02/15/24 11:07 AM]

Summary: Phoenix /storage/scratch1 file system has intermittent issues. Jobs running from the scratch storage might be stuck. 

Details: Around 5:00 PM yesterday (February 14, 2024), the Lustre filesystem hosting /storage/scratch1 on the Phoenix cluster became inaccessible. We restarted the services at 8AM today (February 15, 2024) but some accessibility issues remain. The PACE team is investigating the cause and the storage vendor has been contacted. This may cause delays and timeouts on interactive sessions and running jobs. 

Impact: Access to /storage/scratch1 might be interrupted for some users. Running ‘ls’ on Phoenix home directories may hang as it attempts to resolve the symbolic link to the scratch directory. OnDemand-Phoenix was also affected; as of this writing, it is stable, and we continue to monitor it. Jobs using /storage/scratch1 may be stuck. The output of the `pace-quota` command might hang as scratch utilization is checked and might show the incorrect balance. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix Scheduler Outage

Summary: The Slurm scheduler on Phoenix is experiencing an intermittent outage.

Details: The scheduler is repeatedly freezing due to a problematic input. The PACE team has identified the likely cause and is attempting to restore functionality.

Impact: Commands like squeue and sinfo may report errors, and new jobs may not start on Phoenix. Already-running jobs are not impacted. Other clusters (Hive, ICE, Firebird, Buzzard) are not impacted.

Thank you for your patience as we work to restore Phoenix to full functionality. Please contact us at pace-support@oit.gatech.edu with any questions. You may track the status of this outage on the GT Status page.

NetApp Storage Outage

[Update 1/18/24 6:30 PM]

Access to storage has been restored, and all systems have full functionality. The Phoenix and ICE schedulers have been resumed, and queued jobs will now start.

Please resubmit any jobs that may have failed. If a running job is no longer progressing, please cancel and resubmit.

The cause of the outage was identified as an update this afternoon to resolve a specific permissions issue affecting some users on the ICE shared directories. The update has been reverted.

Thank you for your patience as we resolved this issue.

[Original Post 1/18/24 5:20 PM]

Summary: An outage on PACE NetApp storage devices is affecting the Phoenix and ICE clusters. Home directories and software are not accessible.

Details: At approximately 5:00 PM, an issue began affecting access to NetApp storage devices on PACE. The PACE team is investigating at this time.

Impact: All storage devices provided by NetApp services are currently unreachable. This includes home directories on Phoenix and ICE, the pace-apps software repository on Phoenix and ICE, and course shared directories on ICE. Users may encounter errors upon login due to inaccessible home directories. We have paused the schedulers on Phoenix and ICE, so no new jobs will start. The Hive and Firebird clusters are not affected.

Please contact us at pace-support@oit.gatech.edu with any questions.

All PACE Clusters Down Due to Cooling Failure

[Update 8/27/23 11:10 AM]

All PACE clusters have returned to service.

The datacenter cooling pump was replaced early this morning. After powering on compute nodes and testing, PACE resumed jobs on all clusters. On clusters which charge for use (Phoenix and Firebird), jobs that were cancelled yesterday evening when compute nodes were turned off will be refunded. Please submit new jobs to resume your work.

Thank you for your patience during this emergency repair.

[Original Post 8/26/23 9:40 PM]

Summary: A pump in the Coda Datacenter cooling system overheated on Saturday evening. All PACE compute nodes across all clusters (Phoenix, Hive, Firebird, Buzzard, and ICE) have been shut down until cooling is restored, stopping all compute jobs.

Details: Databank is reporting an issue with the high temp condenser pump in the Research Hall of the CODA data center, hosting PACE compute nodes. The Research Hall is being powered off in order for Databank facilities to get the pump changed.

Impact: All PACE compute nodes are unavailable. Running jobs have been cancelled, and no new jobs can start. Login nodes and storage systems remain available. Compute nodes will remain off until the cooling system is repaired.

Phoenix Scratch Storage Outage

[Update 8/7/23 9:34 PM]

Access to Phoenix scratch continued to have issues last night as of 10:19 PM last night (Sunday). We paused the scheduler and restarted the controller around 6am this morning (Monday).

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 10:19 PM Sunday and ended this morning at 9:24 AM Monday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 8/6/23 2:25 PM]

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 9:30 PM Saturday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 8/6/23 1:30 PM]

Summary: Phoenix scratch storage is currently unavailable, which may impact access to directories on other Phoenix storage systems. The Phoenix scheduler is paused, so no new jobs can start.

Details: A storage target controller on the Phoenix scratch system became unresponsive just before midnight on Saturday evening. The Phoenix scheduler crashed shortly before 7 AM Sunday morning due to the number of failures to reach scratch directories. PACE restarted the scheduler around 1 PM today (Sunday), restoring access, while also pausing it to prevent new jobs from starting.

Impact: The network scratch filesystem on Phoenix is inaccessible. Due to the symbolic link to scratch, an ls of Phoenix home directories may also hang. Access via Globus may also time out. Individual directories on the home storage device may be reachable if an ls of the main home directory is not performed. Scheduler commands, such as squeue, were not available this morning but have now been restored. As the scheduler is paused, any new jobs submitted will not start at this time. There is no impact to project storage.

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix Project Storage & Login Node Outage

[ Update 7/18/2023 4:00 PM]

Summary: Phoenix project storage performance is degraded as redundant drives rebuild. The process may continue for several more days. Scratch storage is not impacted, so tasks may proceed more quickly if run on the scratch filesystem.

Details: During and after the storage outage last week, several redundant drives on the Phoenix project storage filesystem failed. The system is rebuilding the redundant array across additional disks, which is expected to take several more days. Researchers may wish to copy necessary files to their scratch directories or to local disk and run jobs from there for faster performance. In addition, we continue working with our storage vendor to identify the cause of last week’s outage.

Impact: Phoenix project storage performance is degraded for both read & write, which may continue for several days. Home and scratch storage are not impacted. All data on project storage is accessible.

Thank you for your patience as the process continues. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 7/13/2023 2:57 PM]

Phoenix’s head nodes, which were unresponsive earlier this morning, have been rebooted and are available again without issue. We will continue to monitor the login nodes for any other issues.

Regarding the failed redundant drives, we have replaced the control cables and a few hard drives have been reseated. We have run tests to confirm that the storage is running correctly.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM due to the outage.

[Update 7/13/2023 12:20 PM]

Failed redundant drives led an object storage target to become unreachable. We are working to replace controller cables to restore access.

[Original Post 7/13/2023 10:20 AM]

Summary: The Phoenix project storage filesystem became unresponsive this morning. Researchers may be unable to access data in their project storage. We have paused the scheduler, preventing new jobs from starting, while we investigate. Multiple Phoenix login nodes have also become unresponsive, which may have prevented logins.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known. We have also rebooted several Phoenix login nodes that had become unresponsive to restore ssh access.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress. Some login attempts this morning may have hung.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix filesystem intermittent slowness

Summary: The Phoenix’s filesystem response has been inconsistent starting today. We are noticing that there is a high utilization on all the head-nodes. 

Details: File access is intermittently slow on home storage, project storage, and scratch. Executing any command such as ‘ls’ on the head-node can have a slow response. Slowness in file access was first detected by a couple users around 3pm yesterday, and we have started getting more reports this afternoon. PACE team is actively working on the issue to identify the root cause and resolve this at the earliest. 

Impact: Users may continue to experience intermittent slowness in using the head-node, submitting jobs, compiling code, using interactive sessions, and file read/write. 

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions. We will continue to watch the performance and follow-up with another status message tomorrow morning.

06/08/2023 Update

Phoenix home, project storage and scratch are all fully functional. The filesystem performance has been normal for the last 12 hours. We will continue our investigation on the root cause and continue to monitor the performance.

As of now, the utilization on our servers has stabilized. The issue has not impacted any jobs running or waiting in queue. Users can resume using Phoenix as usual.

For questions, please contact PACE at pace-support@oit.gatech.edu.

PACE Maintenance Period, May 9-11, 2023

[Update 5/11/23]

The Phoenix, Hive, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler. 

The ICE instructional cluster remains under maintenance until tomorrow. Summer instructors will be notified when the upgraded ICE is ready for use.

The next maintenance period for all PACE clusters is August 8, 2023, at 6:00 AM through August 10, 2023, at 11:59 PM. An additional maintenance period for 2023 is tentatively scheduled for October 24-26, 2023 (note revised date).  

Status of activities:

  • [Complete][Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [In progress][ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Complete][Phoenix Storage] Phoenix scratch will be migrated to a new Lustre device, which will result in fully independent project & scratch filesystems. Researchers will find their scratch data remains accessible at the same path via symbolic link or directly via the same mount location.
  • [Complete][Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Complete][Datacenter] High-temperature loop pump maintenance
  • [Complete][Storage] Replace cables on Hive and Phoenix parallel filesystems
  • [Complete]Network] Upgrade ethernet switch code in Enterprise Hall
  • [Complete][Network] Configure virtual pair between ethernet switches in Research Hall

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/2/23]

This is a reminder that the next PACE Maintenance Period starts at 6:00AM on Tuesday, 05/09/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 05/11/2023.

Maintenance on the ICE instructional cluster is expected to continue through Friday, 05/12/2023.

Updated planned activities:

WHAT IS HAPPENING?  

ITEMS NOT REQUIRING USER ACTION: 

  • [Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Phoenix Storage] Phoenix scratch will be migrated to a new Lustre device, which will result in fully independent project & scratch filesystems. Researchers will find their scratch data remains accessible at the same path via symbolic link or directly via the same mount location.
  • [Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Datacenter] High-temperature loop pump maintenance
  • [Storage] Replace cables on Hive and Phoenix parallel filesystems
  • [Network] Upgrade ethernet switch code in Enterprise Hall
  • [Network] Configure virtual pair between ethernet switches in Research Hall

[Original Announcement 4/24/23]

WHEN IS IT HAPPENING?
The next PACE Maintenance Period starts at 6:00AM on Tuesday, 05/09/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 05/11/2023.

Maintenance on the ICE instructional cluster is expected to continue through Friday, 05/12/2023.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During the Maintenance Period, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

WHAT IS HAPPENING?  

ITEMS NOT REQUIRING USER ACTION: 

  • [Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Datacenter] High-temperature loop pump maintenance
  • [Storage] Replace Input/Output Modules on two storage devices

WHY IS IT HAPPENING? 
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. Future maintenance dates may be found on our homepage.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Phoenix Scratch Storage & Scheduler Outages

[Update 4/3/23 5:30 PM]

Phoenix’s scratch storage & scheduler are again fully functional.

The scratch storage system was repaired by 3 PM. We rebooted one of the storage servers, with the redundant controllers taking over the load, and brought it back online to restore responsiveness.

The scheduler outage was caused by a number of communication timeouts, later exacerbated by stuck jobs on scratch storage. After processing the backlog, the scheduler began allowing jobs to begin around 4:20 PM this afternoon. We have been monitoring it since then. At this time, due high utilization, the Phoenix CPU nodes are nearly completely occupied.

We will refund any job that failed after 10:30 AM today due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 4/3/23 2:30 PM]

Summary: Scratch storage is currently inaccessible on Phoenix. In addition, jobs are not able to start. The login nodes experienced high load earlier today, rendering them non-responsive, which was resolved through a reboot.

Details: Phoenix is currently experiencing multiple issues, and the PACE team is investigating. The scratch storage system is inaccessible as the Lustre service has been timing out since approximately 11:30 AM today. The scheduler is also failing to launch jobs, which started by 10:30 AM today. Finally, we experienced high load on all four Phoenix login nodes around 1:00 PM today. The login nodes were repaired through a reboot. All issues, including any potential root cause, are being investigated by the PACE team today.

Impact: Researchers on login nodes may have been disconnected during the reboots required to restore functionality. Scratch storage is unreachable at this time. Home and project storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in scratch storage may not be working. New jobs are not launching and will remain in queue.

Thank you for your patience as we investigate these issues and restore Phoenix to full functionality. For questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix project storage outage

[Updated 2023/03/17 3:30 PM]

Phoenix project storage is again available, and we have resumed the scheduler, allowing new jobs to begin. Queued jobs will begin as resources are available.

The storage issue arose when one metadata server rebooted shortly after 1:00 PM yesterday, and the high-availability configuration automatically switched to the secondary server, which became overloaded. After extensive investigation yesterday evening and today, in collaboration with our storage vendor, we identified and stopped a specific series of jobs heavily taxing storage and also replaced several cables to fully restored Phoenix project storage availability.

Jobs that were running as of 1:00 PM yesterday that will fail or have failed due to the project storage outage will be refunded to the charge account provided. Please resubmit these failed jobs to Slurm to continue research.

Thank you for your patience as we repaired project storage. Please contact us with any questions.

[Updated 2023/03/16, 11:55PM ET]

We’re still experiencing significant slowness of the filesystem. We’re going to keep job scheduling paused for tonight and PACE team will resume troubleshooting in the morning as early as possible.

[Updated 2023/03/16, 6:50PM ET]

Troubleshooting continues with the vendor’s assistance. The file system is currently stable, but one of the meta data servers continues with an abnormal workload. We are working to resolve this issue to avoid additional file system failures.

[Original post 2023/03/16, 2:48PM ET]

Summary: Phoenix project storage is currently unavailable. The scheduler is paused, preventing any additional jobs from starting until the issue is resolved.

Details: An MDS server for the Phoenix Lustre parallel filesystem for project storage has encountered errors and rebooted. The PACE team is investigating at this time and working to restore project storage availability.

Impact: Project storage is slow or unreachable at this time. Home and scratch storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in project storage may not be working. To avoid further job failures, we have paused the scheduler, so no new jobs will start on Phoenix, regardless of the storage used.

Thank you for your patience as we investigate this issue and restore Phoenix storage to full functionality.

For questions, please contact PACE at pace-support@oit.gatech.edu.