Phoenix scheduler outage

Summary: The Phoenix scheduler became non-responsive last evening and was restored at approximately 8:50 AM today.

Details: The Torque resource manager on the Phoenix scheduler shut down unexpectedly around 6:00 PM yesterday. The PACE team restarted the scheduler and restored its function around 8:50 AM, and is continuing to engage with the vendor to identify the cause of the crash.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scheduler outage

Summary: The Phoenix scheduler became nonresponsive this afternoon and was restored at approximately 4:50 PM today.

Details: The Torque resource manager on the Phoenix scheduler became overloaded, likely around 2:45 PM. The PACE team restarted the scheduler and restored its function around 4:50 PM.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Campus ESX Incident Impacting PACE services

[Update 6/28/22 2:00 PM]

The ESX host issue is resolved, and all PACE services are fully restored. Please contact pace-support@oit.gatech.edu with any questions, or if you encounter further issues.

[Original Post 6/28/22 12:55 PM]

Summary: An issue with an ESX host is affecting multiple campus services, including several PACE services. Open OnDemand and some PACE utilities are currently unavailable. OIT is working to resolve the issue.

Details: The ESX issue affects campus virtual machines, hosting both PACE and other services. Visit https://status.gatech.edu for details.

Impact:

– Open OnDemand websites for all PACE clusters may not load.

– Some PACE utilities may hang, including pace-quota, pace-whoami, and pace-check-queue.

– There may be intermittent unavailability of software licenses.

Thank you for your patience as OIT works to resolve this outage. Please contact us at pace-support@oit.gatech.edu with any questions about the impacted PACE services.

Hive scheduler degraded state

[Update 6/3/22 4:55 PM]

After the full restart of scheduler services across Hive this afternoon, we have returned to full production status on the cluster. Thank you for your patience this week as we investigated the issue. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 6/3/22 2:25 PM]

The PACE team is continuing to investigate the partial disruption of the Hive scheduler. We are currently performing a full restart of all scheduler services across the Hive cluster. While this cluster-wide service restart is in progress this afternoon, it is not possible to submit, start, or check the status of any jobs on Hive. Commands such as qsub, qstat, and showq are unavailable. Running jobs are not impacted.

We appreciate your patience during this process. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 5/31/22 5:30 PM]

Summary: The Hive scheduler is currently in a degraded state, and many waiting jobs will not start.

Details: The Torque resource manager and the Moab workload manager, the two components of the Hive scheduler, are currently reporting conflicting information about resources allocated to running jobs. This causes failed attempts to schedule waiting jobs on resources that are already allocated, which prevents the jobs from starting. The PACE team is actively investigating this situation and working to resolve it.

Impact: Some queued jobs, especially those requesting a larger number of resources, may remain in the queue even though resources may appear to be available via tools such as pace-check-queue. Interactive jobs may be cancelled by the scheduler while waiting to start. Running jobs are not impacted.

Please contact us at pace-support@oit.gatech.edu with any questions.

Hive scheduler outage

Summary: The Hive scheduler stopped launching new jobs on Monday afternoon and was restored at approximately 10:00 AM on Tuesday.

Details: At approximately 12:35 PM on Monday, during the Memorial Day holiday, the Torque resource manager on Hive became nonresponsive due to an error. The PACE team restarted the scheduler and restored its function at 10:00 this morning.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted. Running jobs were not interrupted. Moab commands such as “showq” were not impacted.

Thank you for your patience during the holiday weekend. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Resolved] Phoenix scheduler timeout

Summary: A timeout on the Phoenix scheduler prevented new jobs from beginning earlier today.

Details: A setting caused a timeout issue in the communication between the Torque and Moab portions of the Phoenix scheduler this morning, beginning at 10:20 AM. The PACE team restored communication between the services before 12:20 PM today.

Impact: During the intervening period, no new jobs could start. Running jobs were not interrupted, and submitting new jobs to queue remained functional. Commands such as “qsub” and “qstat” continued to work.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

[Complete] PACE Maintenance Period: May 11 – 13, 2022

[Update 5/16/22 9:20 AM]

All PACE clusters, including Phoenix, are now ready for research and learning. We have restored stability of the Phoenix Lustre storage system and released jobs on Phoenix.

Thank you for your patience as we worked to restore Lustre project & scratch storage on the Phoenix cluster. In working with our support vendor, we identified a scanning tool that was causing instability on the scratch filesystem and impacting the entire storage system. This has been disabled pending further investigation.

Due to the complications, we will not proceed with monthly deletions of old files on the Phoenix & Hive scratch filesystems tomorrow. Although only Phoenix was impacted, we will also delay Hive to avoid confusion. Files for which researchers were notified this month will not be deleted at this time, and you will receive another notification prior to any future deletion. Researchers are still encouraged to delete unneeded scratch files to preserve space on the system.

Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13. The next maintenance period for all PACE clusters is August 10, 2022, at 6:00 AM through August 12, 2022, at 11:59 PM. An additional maintenance period is tentatively scheduled for November 2-4.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Postponed][Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [Complete][Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Complete][Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Complete][Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Complete][Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [Complete][System] Install operating system patches
  • [Complete][System] Update operating system on administrative servers
  • [Complete][Network] Move BCDC DNS appliance to new IP address
  • [Complete][Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [Complete][System] Remove unused nouveau graphics kernel from GPU nodes
  • [Complete][Network] Set static IP addresses on schedulers to improve reliability
  • [Complete][Datacenter] Cooling loop maintenance
  • [Complete][Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 5/13/22 3:25 PM]

The PACE team and our support vendor’s engineers continue working to restore functionality of the Phoenix Lustre filesystem following the upgrade. Testing and remediation will continue today and through the weekend. At this time, we hope to be able to open Phoenix for research on Monday. We appreciate your patience as our maintenance period is extended. If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 5/13/22 2:00 PM]

PACE maintenance continues on Phoenix, while the Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning.

Phoenix remains under maintenance, as complications arose following the upgrade of Lustre project and scratch storage. PACE and our storage vendor are working to resolve the issue at this time. We will update you when Phoenix is ready for research.

Jobs on the Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters have been released.

Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13. The next maintenance period for all PACE clusters is August 10, 2022, at 6:00 AM through August 12, 2022, at 11:59 PM. An additional maintenance period is tentatively scheduled for November 2-4.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Postponed][Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [In progress][Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Complete][Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Complete][Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Complete][Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [Complete][System] Install operating system patches
  • [Complete][System] Update operating system on administrative servers
  • [Complete][Network] Move BCDC DNS appliance to new IP address
  • [Complete][Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [Complete][System] Remove unused nouveau graphics kernel from GPU nodes
  • [Complete][Network] Set static IP addresses on schedulers to improve reliability
  • [Complete][Datacenter] Cooling loop maintenance
  • [Complete][Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

 

 

[Detailed announcement 5/3/22]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, May 11, and end at 11:59 PM on Friday, May 13. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [System] Install operating system patches
  • [System] Update operating system on administrative servers
  • [Network] Move BCDC DNS appliance to new IP address
  • [Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [System] Remove unused nouveau graphics kernel from GPU nodes
  • [Network] Set static IP addresses on schedulers to improve reliability
  • [Datacenter] Cooling loop maintenance
  • [Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[Early announcement]

Dear PACE Users,

This is friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 05/11/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 05/13/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

PACE Firebird Login Node Outages

[Update 4/27/22 5:45 PM]

The remaining headnode has been repaired, and service is restored. Thank you for your patience.

[Original Post 4/27/22 5:20 PM]

Summary: A storage server issue made headnodes for two projects on Firebird inaccessible. One has been recovered, while repairs are in progress on the second one.

Details: The storage server housing two Firebird projects had an NFS issue earlier today. The login nodes were impacted. The PACE team has repaired one project’s login node and is currently repairing the second that has a more complex issue.

Impact: Researchers on impacted projects are/were not able to log into Firebird today. Running jobs were not impacted, as only the login node is/was affected.

We apologize for the disruption. Please email us at pace-support@oit.gatech.edu with any questions.

Campus network disaster recovery testing June 10-13

[Update 6/6/22 11:20 AM]

Summary: Revised plans for OIT’s network disaster recovery test remove all expected impact to PACE.

Details: Changes in the disaster recovery test mean that we no longer expect PACE to have any impact this weekend, and all PACE clusters should operate normally, including OnDemand and other PACE services. Campus license servers should also remain reachable from PACE. For additional details about the disaster recovery scope, please see https://oit.gatech.edu/recoveryexercisejun22.

Impact: We have removed the scheduler reservations on all PACE clusters, so longer jobs that have been held can now begin. No impact is expected.

Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

[Update 5/24/22 10:00 AM]

Summary: Updated information lessens impact to Hive and introduces new partial impact to Firebird during disaster recovery testing (June 10-13).

Details: As additional details about the disaster recovery testing have been clarified, we have determined that Hive can remain in production throughout the testing with limited disruptions, which will also impact Firebird. We will remove the reservation currently in place on Hive for these dates.

Impact:

  • Phoenix, PACE-ICE, and COC-ICE will be disabled from 5:00 PM on Friday, June 10, through the morning of Monday, June 13.
  • Hive and Firebird will remain in production, but some services will be unavailable for much of the weekend:
    • Hive OnDemand will be unavailable.
    • PACE license servers will be unavailable. Intel compilers will not be usable, so no code can be compiled with Intel compilers, though previously-compiled binaries can be executed.
    • License servers from the College of Engineering, providing access to MATLAB, Ansys, Abaqus, and Comsol for the entire campus, will not be reachable. Any batch or interactive jobs that attempt to check out a license for these applications will fail. Researchers are encouraged to avoid such jobs just before the outage and to wait until it is complete before submitting them.
    • A number of PACE utilities, such as pace-quota and pace-check-queue, will not function.
    • Other intermittent disruptions are possible.
  • Buzzard will not be impacted.

Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

 

[Original announcement 4/27/22 11:45 AM]

Summary: Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13.  

Details: In accordance with USG security requirements, OIT will be conducting disaster recovery testing on the Georgia Tech campus network during the weekend of June 11, which will close access to most of PACE’s clusters as well as some other campus resources.  PACE’s Phoenix, Hive, PACE-ICE, and COC-ICE clusters will be impacted. Firebird and Buzzard will remain in production.  

Impact: PACE will set a reservation to prevent any jobs from running during the downtime. You will not be able to log in, access your data, nor run jobs during the outage.  

Longer jobs will be held until the testing is complete if their walltime request will not lead the job to conclude before the outage, just as they are during quarterly maintenance periods. Researchers who run long jobs should note the duration between PACE’s May maintenance period (May 11-13) and the testing period, beginning June 10. In particular, Hive researchers who submit 30-day jobs to the hive-nvme, hive-sas, or hive-nvme-sas queues should note that any 30-day job submitted after April 12 will not begin until at least June 13. Researchers are encouraged to submit jobs with reduced walltimes whenever feasible to make use of the cluster between maintenance and disaster recovery testing.  

Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

Hive Gateway Resource Now Available to Campus Champions

Dear Campus Champion Community,

We are pleased to announce the official release of the Hive Gateway at Georgia Tech’s Partnership for an Advanced Computing Environment (PACE) to the Campus Champion community. The Hive gateway is powered by Apache Airavata, and provides access to a portion of the Hive cluster at GT that is an NSF MRI funded supercomputer that delivers nearly 1 Linpack petaflops of computing power.  For more hardware details see the following link: https://docs.pace.gatech.edu/hive/resources/.

The Hive Gateway is available to *any* XSEDE researcher via federated login (i.e., CILogon), and has a variety of applications available including Abinit, Psi4, NAMD, a python environment with Tensorflow and Keras installed, among others.

Hive Gateway is accessible via https://gateway.hive.pace.gatech.edu

Our user guide is available at: https://docs.pace.gatech.edu/hiveGateway/gettingStarted/ and contains details on the process of getting access.  Briefly, to get access to the Hive gateway, go to “Log In” on the site, select XSEDE credentials via CILogon, which should allow you to log into the gateway and generate a request to our team to approve your gateway access and enable job submissions on the resource.

Please feel free to stop by the Hive gateway site, try it out, and/or direct your researchers to it.

Cheers!

– The PACE Team