Uncategorized – Page 11 – Partnership for an Advanced Computing Environment

Hive scheduler outage

Summary: The Hive scheduler became non-responsive last evening and was restored at approximately 8:30 AM today.

Details: The Torque resource manager on the Hive scheduler stopped responding around 7:00 PM yesterday. The PACE team restored its function around 8:30 AM this morning and is continuing to monitor its status. The scheduler was fully functional for some time after the system utility repair yesterday afternoon, and it is not clear if the issues are connected.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Hive Open OnDemand. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

(updated to reflect that Hive was impacted as well)

Summary: An error in a system utility resulted in the Hive, Phoenix, PACE-ICE, and COC-ICE clusters temporarily not launching new jobs. It has been repaired, and jobs have resumed launching.

Details: An unintended update to the system utility that checks the health of compute nodes resulted in all Hive, Phoenix, PACE-ICE, and COC-ICE compute nodes being recorded as down shortly before 4:00 PM today, even if there was in fact no issue with them. The scheduler will not launch new jobs on nodes marked down. After correcting the issue, all nodes are again correctly reporting their status, and jobs have resumed launching on all three clusters as of 6:30 PM.

Impact: As all nodes appeared down, no new jobs could launch but would instead remain in queue after being submitted. Running jobs were not impacted. Interactive jobs waiting to start might have been cancelled, in which case the researcher should re-submit.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Maintenance Period (August 10 – 12, 2022)

[8/12/22 5:00 PM Update]

PACE continues work to deploy Hive-Slurm. Maintenance on Hive-Slurm only will be extended into next week, as we complete setting up the new environment. At this time, please use the existing (Moab/Torque) Hive, which was released earlier today. We will provide another update next week when the Slurm cluster is ready for research, along with details about how to access and use the new scheduler and updated software stack.

The Slurm Orientation session previously announced for Tuesday, August 16, will be rescheduled for a later time.

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[8/12/22 2:20 PM Update]

The Phoenix, existing Hive (Moab/Torque), Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning. We have released all jobs that were held by the scheduler.

We are continuing to work on launching the Hive-Slurm cluster, and we will provide another update to Hive researchers later today. Maintenance on the existing Hive (Moab/Torque) cluster has completed, and researchers can resume using it.

The next maintenance period for all PACE clusters is November 2, 2022, at 6:00 AM through November 4, 2022, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on January 31 – February 2, May 9-11, August 8-10, and October 31 – November 2.

Status of activities:

ITEMS REQUIRING USER ACTION:

[Complete][Utilities] PACE will merge the functionality of pace-whoami into the pace-quota utility. Please use the pace-quota command to find out all relevant information about your account, including storage directories and usage, job charge or tracking accounts, and other relevant information. Running pace-whoami will now report the same output as pace-quota.
[In progress][Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

[Complete][Hive][Storage] Cable replacement for GPFS (project/scratch) controller
[Complete][Datacenter] Transformer repairs
[Complete][Datacenter] Cooling tower cleaning
[Complete][Scheduler] Accounting database maintenance
[Complete][Firebird][Network] Add VPC redundancy
[Complete][Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[8/9/22 Update]

This is a reminder that our next PACE maintenance period is scheduled to begin tomorrow at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

[Utilities] PACE will merge the functionality of pace-whoami into the pace-quota utility. Please use the pace-quota command to find out all relevant information about your account, including storage directories and usage, job charge or tracking accounts, and other relevant information. Running pace-whoami will now report the same output as pace-quota.
[Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

[Hive][Storage] Cable replacement for GPFS (project/scratch) controller
[Datacenter] Transformer repairs
[Datacenter] Cooling tower cleaning
[Scheduler] Accounting database maintenance
[Firebird][Network] Add VPC redundancy
[Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[8/3/22 Update]

This is a reminder that our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

[Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

[Hive][Storage] Cable replacement for GPFS (project/scratch) controller
[Datacenter] Transformer repairs
[Datacenter] Cooling tower cleaning
[Scheduler] Accounting database maintenance
[Firebird][Network] Add VPC redundancy
[Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[7/27/22 Update]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:
• [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:
• [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
• [Datacenter] Transformer repairs
• [Datacenter] Cooling tower cleaning
• [Scheduler] Accounting database maintenance
• [Firebird][Network] Add VPC redundancy

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[7/18/22 Early reminder]

Dear PACE Users,

This is friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/10/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/12/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to this maintenance, we will share further details on the tasks, which will be posted here.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

Phoenix scheduler outage

[Update 7/15/22 5:10 PM]

The PACE team has identified some known issues and addressed them to restore scheduler functionality. The scheduler is currently back up and running and new jobs can be submitted. We will continue to monitor the performance over next week and we appreciate your patience as we work through this. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 7/15/22 3:12 PM]

Summary: The Phoenix scheduler’s response was inconsistent starting today. While we are working towards fully resolving it, we have mitigated the issue by restarting the scheduler at approximately 2:00 PM today. The scheduler will be shut down temporarily to continue restoring it with full capacity.

Details: An unexpected scheduler crash earlier this week has resulted in certain ongoing issues that we are actively working with the vendor to fully resolve. The resource manager has been unable to detect all free resources. As a result,Torque resource manager on the Phoenix scheduler was not accepting certain interactive jobs today morning. Also, some jobs are waiting in queue to be launched for an unusually long period of time. PACE team restarted the scheduler and restored some of its function around 2:00 PM. PACE team is continuing to work on the issue to fully resolve at the earliest. As a result, the scheduler will be shut down temporarily for a system wide restart.

Impact: Interactive jobs submissions requesting relatively high number of processors and memory were failing and being cancelled without an error message. The wait time for jobs in queue has also been longer than usual. These issues have been resolved. However, due to scheduler down time, new jobs can’t be submitted in the meantime and scheduler commands such as qstat will not work. The jobs currently running will complete without interruption.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions. We will be following-up with another status message later today.

Best,

-The PACE Team

Phoenix scheduler outage

Summary: The Phoenix scheduler became non-responsive last evening and was restored at approximately 8:50 AM today.

Details: The Torque resource manager on the Phoenix scheduler shut down unexpectedly around 6:00 PM yesterday. The PACE team restarted the scheduler and restored its function around 8:50 AM, and is continuing to engage with the vendor to identify the cause of the crash.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scheduler outage

Summary: The Phoenix scheduler became nonresponsive this afternoon and was restored at approximately 4:50 PM today.

Details: The Torque resource manager on the Phoenix scheduler became overloaded, likely around 2:45 PM. The PACE team restarted the scheduler and restored its function around 4:50 PM.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Campus ESX Incident Impacting PACE services

[Update 6/28/22 2:00 PM]

The ESX host issue is resolved, and all PACE services are fully restored. Please contact pace-support@oit.gatech.edu with any questions, or if you encounter further issues.

[Original Post 6/28/22 12:55 PM]

Summary: An issue with an ESX host is affecting multiple campus services, including several PACE services. Open OnDemand and some PACE utilities are currently unavailable. OIT is working to resolve the issue.

Details: The ESX issue affects campus virtual machines, hosting both PACE and other services. Visit https://status.gatech.edu for details.

Impact:

– Open OnDemand websites for all PACE clusters may not load.

– Some PACE utilities may hang, including pace-quota, pace-whoami, and pace-check-queue.

– There may be intermittent unavailability of software licenses.

Thank you for your patience as OIT works to resolve this outage. Please contact us at pace-support@oit.gatech.edu with any questions about the impacted PACE services.

Hive scheduler degraded state

[Update 6/3/22 4:55 PM]

After the full restart of scheduler services across Hive this afternoon, we have returned to full production status on the cluster. Thank you for your patience this week as we investigated the issue. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 6/3/22 2:25 PM]

The PACE team is continuing to investigate the partial disruption of the Hive scheduler. We are currently performing a full restart of all scheduler services across the Hive cluster. While this cluster-wide service restart is in progress this afternoon, it is not possible to submit, start, or check the status of any jobs on Hive. Commands such as qsub, qstat, and showq are unavailable. Running jobs are not impacted.

We appreciate your patience during this process. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 5/31/22 5:30 PM]

Summary: The Hive scheduler is currently in a degraded state, and many waiting jobs will not start.

Details: The Torque resource manager and the Moab workload manager, the two components of the Hive scheduler, are currently reporting conflicting information about resources allocated to running jobs. This causes failed attempts to schedule waiting jobs on resources that are already allocated, which prevents the jobs from starting. The PACE team is actively investigating this situation and working to resolve it.

Impact: Some queued jobs, especially those requesting a larger number of resources, may remain in the queue even though resources may appear to be available via tools such as pace-check-queue. Interactive jobs may be cancelled by the scheduler while waiting to start. Running jobs are not impacted.

Please contact us at pace-support@oit.gatech.edu with any questions.

Hive scheduler outage

Summary: The Hive scheduler stopped launching new jobs on Monday afternoon and was restored at approximately 10:00 AM on Tuesday.

Details: At approximately 12:35 PM on Monday, during the Memorial Day holiday, the Torque resource manager on Hive became nonresponsive due to an error. The PACE team restarted the scheduler and restored its function at 10:00 this morning.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted. Running jobs were not interrupted. Moab commands such as “showq” were not impacted.

Thank you for your patience during the holiday weekend. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Phoenix scheduler timeout

Summary: A timeout on the Phoenix scheduler prevented new jobs from beginning earlier today.

Details: A setting caused a timeout issue in the communication between the Torque and Moab portions of the Phoenix scheduler this morning, beginning at 10:20 AM. The PACE team restored communication between the services before 12:20 PM today.

Impact: During the intervening period, no new jobs could start. Running jobs were not interrupted, and submitting new jobs to queue remained functional. Commands such as “qsub” and “qstat” continued to work.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Partnership for an Advanced Computing Environment

Category: Uncategorized

Hive scheduler outage

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

PACE Maintenance Period (August 10 – 12, 2022)

Phoenix scheduler outage

Phoenix scheduler outage

Phoenix scheduler outage

Campus ESX Incident Impacting PACE services

Hive scheduler degraded state

Hive scheduler outage

[Resolved] Phoenix scheduler timeout

Georgia Institute of Technology