Aaron Jezghani – Partnership for an Advanced Computing Environment

Phoenix cluster outage

Summary: The Phoenix cluster is currently inaccessible. The status of running jobs cannot be determined at this time.

Details: Efforts are under way to identify the extent and root cause of the issue.

Impact: Users are unable to access the Phoenix cluster at this time. It is unknown if ongoing compute jobs are affected.

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions. We will continue to investigate and follow-up with another status message tomorrow morning.

Phoenix scheduler outage

Summary: The Phoenix scheduler became non-responsive last evening and was restored at approximately 8:50 AM today.

Details: The Torque resource manager on the Phoenix scheduler shut down unexpectedly around 6:00 PM yesterday. The PACE team restarted the scheduler and restored its function around 8:50 AM, and is continuing to engage with the vendor to identify the cause of the crash.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Coda Datacenter Cooling Issue

[Update – 02/04/2022 10:24AM]

Dear PACE Researchers,

We are following up to inform you that all PACE clusters have resumed normal operations and clusters are accepting new user jobs. After the cooling loop was restored last night, datacenter’s operating temperatures had returned to normal and remained stable.

As previously mentioned, this outage should not have impacted any running jobs as PACE had only powered off idle compute nodes, so there is no user action required. Thank you for your patience as we worked through this emergency outage in coordination with Databank. If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

Best,
The PACE Team

[Original Post]

Dear PACE Researchers,

Due to a cooling issue in the Coda datacenter, we were asked to power off as many nodes as possible to control temperature in the research hall. At this time, Databank has recovered the cooling loop, and temperatures have stabilized. However, all PACE job schedulers will remain paused to help expedite the return to normal operating temperatures in the datacenter.

These events should have had no impact on running jobs, so no action is required at this time. We expect normal operation to resume in the morning. As always, if you have any questions, please contact us at pace-support@oit.gatech.edu.

Best,
The PACE Team

[RESOLVED] Phoenix Scratch Outage

Starting around 4 PM Sunday, the Phoenix scratch filesystem became non-responsive, causing issues with access to files and directories stored in ~/scratch. Functionality was restored promptly Monday morning, and at this time, all systems are performing as expected. If you were running jobs that utilized scratch storage during this outage, they may have been negatively impacted; please reach out to pace-support@oit.gatech.edu with related IDs for any such jobs.

[UPDATE] shared-scheduler Degraded Performance

7/31/2020 UPDATE

Dear Researchers,

In addition to the previously announced maintenance day activities, we will be migrating the Torque component of shared-sched to a dedicated server to address the recent performance issues. This move should improve the scheduler’s response time to client queries such as qstat, and decrease job submission and start times when compute resources are available. While you do not need to do anything to prepare for this migration, we advise that you make note of any jobs queued at the start of maintenance just in case. As always, please direct any questions or concerns to pace-support@oit.gatech.edu. We thank you for your patience.

The PACE Team

7/29/2020 UPDATE

Dear Researchers,

At this time the scheduler is functional, although some commands may be slow to respond. We will continue investigating to ascertain the source of these problems, and will update accordingly. Thank you.

[ORIGINAL MESSAGE]

We are aware of a significant slowdown in the performance of the shared-scheduler since last week. Initial attempts to resolve the issue towards the end of the week appeared successful, but the problems have restarted and we are continuing our investigation along with scheduler support. We appreciate your patience as we work to restore full functionality to shared-scheduler.

The PACE Team

[RESOLVED] Rich Data/Project and Scratch Storage Slow Performance

[RESOLVED]:
At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[UPDATE]:
The issues from this morning’s storage problems are still ongoing. At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

[Original Post]:
We have identified slow performance in the Rich data/project and scratch storage volumes. Jobs utilizing these volumes may experience problems, so please verify results accordingly. We are actively working to resolve the issue.

PACE License Manager and Server Issues

Overnight we experienced issues with several of our servers, including our License manager, GTLib server, and the Testflight and Novazohar queues. We are actively addressing the problem, having restored functionality to the License manager and Novazohar. We are still working on Testflight, and will provide updates as they are available. As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu accordingly.

[RESOLVED] RHEL7 Dedicated Scheduler Down

[RESOLVED] We have restored functionality to the RHEL7 dedicated scheduler. Thank you for your patience.

[UPDATE] The RHEL7 dedicated scheduler, accessed via login7-d, is again down. We are actively working to resolve the issue at this time, and we will update you when the scheduler is restored. Please follow the same blog post (http://blog.pace.gatech.edu/?p=6715) for updates. If you have any questions, please contact pace-support@oit.gatech.edu.

[RESOLVED] We have rebooted the RHEL7 Dedicated scheduler, and functionality has been restored. Thank you for your patience.

[ORIGINAL MESSAGE] Roughly 30 minutes ago we determined an issue with the scheduler for dedicated RHEL7 clusters; this scheduler is responsible for all jobs submitted from the dedicated RHEL7 headnode, login7-d. All other schedulers are operating as expected. We are actively working to resolve the problem, but in the meantime you will be unable to submit new jobs or query the status of queued or running jobs.

If you have any questions, please contact pace-support@oit.gatech.edu.

[Resolved] Rich InfiniBand Switch Power Failure

This morning, we discovered a power failure in an InfiniBand switch in the Rich Datacenter that resulted in GPFS mount failure to a number of compute resources. Power was restored at 9:10am, and connectivity across the switch has been confirmed. However, prior to the fix, it is possible that jobs may have experienced problems (including failure to produce results or exiting with error) due to GPFS access time-outs. Please review the status of any jobs run recently by checking the output/error logs or, if still running, the timestamps of output files for any discrepancies. If an issue appears (e.g. previously successful code exceeded wallclock limit with no output or file creation occurred much later than the start of the job), please resubmit the job.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

RESOLVED [Hive and Testflight-CODA Clusters] Connectivity Issue to All CODA Resources

RESOLVED [1:44 PM]:

The network engineers report that they have fixed the issues and are continuing to monitor it, although the cause remains unknown. Jobs appear to have continued uninterrupted on the Hive and Testflight-CODA clusters, but we encourage users to verify.
https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5e46cb01fa0e5304bc04ecb5
Any residual issues should be reported to pace-support@oit.gatech.edu. Thank you.

UPDATE [11:33 AM]:

Georgia Tech IT is aware of the situation and is investigating as well:

Original Message:

Around 11:00 AM, we noticed that we could not connect to any resources housed in CODA, including the Hive and Testflight-CODA clusters. At this time, the source of the problem is being investigated, but access to these resources will experience problems. In theory, jobs on these clusters should continue to run. Further details will be provided as they become available.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu. Thank you.

Partnership for an Advanced Computing Environment

Author: Aaron Jezghani