mweiner3 – Page 4 – Partnership for an Advanced Computing Environment

Action Required: Globus Certificate Authority Update

Globus is updating the Certificate Authority (CA) used for its transfer service, and action is required to continue using existing Globus endpoints. PACE updated the Phoenix, Hive, and Vapor server endpoints during the recent maintenance period. To continue using Globus Connect Personal to transfer files to/from your own computers, please update your Globus client to version 3.2.0 by December 12, 2022. Full details are available on the Globus website. This update is required to continue transferring data between your local computer and PACE or other computing sites.

Please contact us at pace-support@oit.gatech.edu with any questions.

Firebird inaccessible

[Update 10/3/22 10:45 AM]

Access to Firebird and the PACE VPN has been restored, and all systems should be functioning normally. If you do not see the PACE VPN as an option in the GlobalProtect client, please disconnect from the GT VPN and reconnect for it to appear again.

Urgent maintenance on the GlobalProtect VPN device on Thursday night inadvertently led to the loss of PACE VPN access, which was restored this morning.

Please contact us at pace-support@oit.gatech.edu with questions, or if you are still unable to access Firebird.

[Original Message 10/3/22 9:40 AM]

Summary: The Firebird cluster and PACE VPN are currently inaccessible. OIT is working to restore access.

Details: The Firebird cluster was found to be inaccessible over the weekend. PACE is working with OIT colleagues to identify the cause and restore access.

Impact: Researchers are unable to connect to the PACE VPN or access the Firebird cluster.

Thank you for your patience as we work to restore access. Please contact us at pace-support@oit.gatech.edu with questions.

Hive scheduler outage

Summary: The Hive scheduler became non-responsive last evening and was restored at approximately 8:30 AM today.

Details: The Torque resource manager on the Hive scheduler stopped responding around 7:00 PM yesterday. The PACE team restored its function around 8:30 AM this morning and is continuing to monitor its status. The scheduler was fully functional for some time after the system utility repair yesterday afternoon, and it is not clear if the issues are connected.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Hive Open OnDemand. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

(updated to reflect that Hive was impacted as well)

Summary: An error in a system utility resulted in the Hive, Phoenix, PACE-ICE, and COC-ICE clusters temporarily not launching new jobs. It has been repaired, and jobs have resumed launching.

Details: An unintended update to the system utility that checks the health of compute nodes resulted in all Hive, Phoenix, PACE-ICE, and COC-ICE compute nodes being recorded as down shortly before 4:00 PM today, even if there was in fact no issue with them. The scheduler will not launch new jobs on nodes marked down. After correcting the issue, all nodes are again correctly reporting their status, and jobs have resumed launching on all three clusters as of 6:30 PM.

Impact: As all nodes appeared down, no new jobs could launch but would instead remain in queue after being submitted. Running jobs were not impacted. Interactive jobs waiting to start might have been cancelled, in which case the researcher should re-submit.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scheduler outage

Summary: The Phoenix scheduler became nonresponsive this afternoon and was restored at approximately 4:50 PM today.

Details: The Torque resource manager on the Phoenix scheduler became overloaded, likely around 2:45 PM. The PACE team restarted the scheduler and restored its function around 4:50 PM.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Campus ESX Incident Impacting PACE services

[Update 6/28/22 2:00 PM]

The ESX host issue is resolved, and all PACE services are fully restored. Please contact pace-support@oit.gatech.edu with any questions, or if you encounter further issues.

[Original Post 6/28/22 12:55 PM]

Summary: An issue with an ESX host is affecting multiple campus services, including several PACE services. Open OnDemand and some PACE utilities are currently unavailable. OIT is working to resolve the issue.

Details: The ESX issue affects campus virtual machines, hosting both PACE and other services. Visit https://status.gatech.edu for details.

Impact:

– Open OnDemand websites for all PACE clusters may not load.

– Some PACE utilities may hang, including pace-quota, pace-whoami, and pace-check-queue.

– There may be intermittent unavailability of software licenses.

Thank you for your patience as OIT works to resolve this outage. Please contact us at pace-support@oit.gatech.edu with any questions about the impacted PACE services.

Hive scheduler degraded state

[Update 6/3/22 4:55 PM]

After the full restart of scheduler services across Hive this afternoon, we have returned to full production status on the cluster. Thank you for your patience this week as we investigated the issue. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 6/3/22 2:25 PM]

The PACE team is continuing to investigate the partial disruption of the Hive scheduler. We are currently performing a full restart of all scheduler services across the Hive cluster. While this cluster-wide service restart is in progress this afternoon, it is not possible to submit, start, or check the status of any jobs on Hive. Commands such as qsub, qstat, and showq are unavailable. Running jobs are not impacted.

We appreciate your patience during this process. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 5/31/22 5:30 PM]

Summary: The Hive scheduler is currently in a degraded state, and many waiting jobs will not start.

Details: The Torque resource manager and the Moab workload manager, the two components of the Hive scheduler, are currently reporting conflicting information about resources allocated to running jobs. This causes failed attempts to schedule waiting jobs on resources that are already allocated, which prevents the jobs from starting. The PACE team is actively investigating this situation and working to resolve it.

Impact: Some queued jobs, especially those requesting a larger number of resources, may remain in the queue even though resources may appear to be available via tools such as pace-check-queue. Interactive jobs may be cancelled by the scheduler while waiting to start. Running jobs are not impacted.

Please contact us at pace-support@oit.gatech.edu with any questions.

Hive scheduler outage

Summary: The Hive scheduler stopped launching new jobs on Monday afternoon and was restored at approximately 10:00 AM on Tuesday.

Details: At approximately 12:35 PM on Monday, during the Memorial Day holiday, the Torque resource manager on Hive became nonresponsive due to an error. The PACE team restarted the scheduler and restored its function at 10:00 this morning.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted. Running jobs were not interrupted. Moab commands such as “showq” were not impacted.

Thank you for your patience during the holiday weekend. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Phoenix scheduler timeout

Summary: A timeout on the Phoenix scheduler prevented new jobs from beginning earlier today.

Details: A setting caused a timeout issue in the communication between the Torque and Moab portions of the Phoenix scheduler this morning, beginning at 10:20 AM. The PACE team restored communication between the services before 12:20 PM today.

Impact: During the intervening period, no new jobs could start. Running jobs were not interrupted, and submitting new jobs to queue remained functional. Commands such as “qsub” and “qstat” continued to work.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Firebird Login Node Outages

[Update 4/27/22 5:45 PM]

The remaining headnode has been repaired, and service is restored. Thank you for your patience.

[Original Post 4/27/22 5:20 PM]

Summary: A storage server issue made headnodes for two projects on Firebird inaccessible. One has been recovered, while repairs are in progress on the second one.

Details: The storage server housing two Firebird projects had an NFS issue earlier today. The login nodes were impacted. The PACE team has repaired one project’s login node and is currently repairing the second that has a more complex issue.

Impact: Researchers on impacted projects are/were not able to log into Firebird today. Running jobs were not impacted, as only the login node is/was affected.

We apologize for the disruption. Please email us at pace-support@oit.gatech.edu with any questions.

Partnership for an Advanced Computing Environment

Author: mweiner3

Action Required: Globus Certificate Authority Update

Firebird inaccessible

Hive scheduler outage

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

Phoenix scheduler outage

Campus ESX Incident Impacting PACE services

Hive scheduler degraded state

Hive scheduler outage

[Resolved] Phoenix scheduler timeout

PACE Firebird Login Node Outages

Georgia Institute of Technology