Phoenix Project & Scratch Storage Cable Replacement

[Update 09/16/2022 2:18 PM]

Work has been completed on Sep 15 as scheduled in the original post.

[Original post: 09/12/2022 3:40PM]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

Details: A cable connecting one enclosure of the Phoenix Lustre device, hosting project and scratch storage, to one of its controllers needs to be replaced, beginning around 1PM Thursday, September 15th, 2022. After the replacement, pools will need to rebuild over the course of about a day.

Impact: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

[Update 09/16/2022 2:18 PM]

Work has been completed on Sep 15 as scheduled in the original post.

[Original post: 09/12/2022 3:40PM]

Summary: Hive project & scratch storage cable replacement and potential for an outage.

Details: Two cables connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10AM Thursday, September 15th, 2022.

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Phoenix Scheduler Outage

Summary: The Phoenix scheduler became non-responsive on Friday 9/9/2021 between 7:30pm and 10pm.

Details: The Torque resource manager on the Phoenix scheduler crashed unexpectedly around 7:30 PM. A bad GPU node with the same error message caused a segmentation fault on the server, and the crashing scheduler corrupted a handful of jobs in queue with dependencies, requiring some pruning of those records from the system. Around 10pm, the node causing issues was purged from the scheduler and the corrupted jobs were removed restoring normal operations.

Impact: Running jobs were not interrupted, but no new jobs could be submitted during the period scheduler was down. Commands such as “qsub” and “qstat” were impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Corrupted jobs in queue were cancelled.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Data Center Power outage

[Update 09/07/2022 10:23 AM]

Cooling has been fully restored to the datacenter, and operations have resumed on all PACE clusters, including Phoenix, Hive, Firebird, Buzzard, PACE-ICE, and COC-ICE. Jobs are now running, and new jobs may be submitted via the command line or Open OnDemand.

Any jobs that were running at the time of the outage have been cancelled and must be resubmitted by the researcher.

Refunds will be issued for jobs that were cancelled on clusters with charge accounts (Phoenix and Firebird).

Thank you for your patience as emergency repairs were completed. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 09/06/2022 4:51 PM]

Summary: Unfortunately, the same issue that happened yesterday with the primary cooling loop happened today with the secondary cooling loop. OIT operations and Databank requested that we power off all the compute nodes to repair the secondary cooling loop. We have captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix.

Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or from Open Ondemand webserver, the existing running jobs will be killed, and computing credits will be refunded if applied. Your partially ran jobs will lose their calculation if checkpointing is not supported in the job.

Sorry for the inconvenience and thank you for your patience; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

 

[Update 09/06/2022 11:10 AM]

Cooling has been restored to the datacenter, and operations have resumed on all PACE clusters, including Phoenix, Hive, Firebird, Buzzard, PACE-ICE, and COC-ICE. Jobs are now running, and new jobs may be submitted via the command line or Open OnDemand. Any jobs that were running at the time of the outage have been cancelled and must be resubmitted by the researcher. Refunds will be issued for jobs that were cancelled on clusters with charge accounts (Phoenix and Firebird).

Thank you for your patience as emergency repairs were completed to avoid damage to the datacenter. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 09/06/2022 9:58 AM]

Around 8:00PM on 09/05/2022, OIT operations and Databank requested that PACE powers off all the compute nodes to avoid additional issues. PACE had captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix after they have been brought back online.

Currently, the cooling has been restored in Coda datacenter, and as of about 7:00am (09/06/22), PACE has been cleared to online and test the clusters before releasing them to users.

 

[Update 09/05/2022 8:45 PM]

Summary: Unfortunately, cooling tower issues continue. OIT operations and Databank requested that we power off all the compute nodes to avoid additional issues. We have captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix.
Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or from Open Ondemand webserver, the existing running jobs will be killed, and computing credits will be refunded if applied. Your partially ran jobs will lose their calculation if checkpointing is not supported in the job.
Sorry for the inconvenience and thank you for your patience; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

[Original post]

Summary: One of the cooling towers in the CODA data center has issues, and the temperature is rising. We need to pause all PACE cluster schedulers, and possibly power down all compute nodes.

Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or the Open OnDemand web server; the existing jobs should continue running.

Thank you for your patience this afternoon; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

Phoenix scheduler outage

[Update 9/2/22 5:23 PM]

The PACE team has identified an issue with the Phoenix scheduler and restored functionality. The scheduler is currently back up and running and new jobs can be submitted. We will continue to monitor scheduler performance and we appreciate your patience as we work through this. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 9/2/22 4:33 PM]

Summary: The Phoenix scheduler became non-responsive this afternoon around 4pm.

Details: The Torque resource manager on the Phoenix scheduler shut down unexpectedly around 4:00 PM. The PACE team is working on resolution.

Impact: Commands such as “qsub” and “qstat” are impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs are not expected to be interrupted, but no new jobs can be submitted, and currently running jobs can not be queried.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Hive Cluster Migration to Slurm Scheduler and Update to Software Stack

Dear Hive researchers, 

The Hive cluster will be migrating to the Slurm scheduler with the first phase scheduled for the August 10-12 maintenance period! PACE has worked closely with the Hive PIs on the plan for the migration to ensure minimum interruption to research. Slurm is a widely-popular scheduler  on many research computing clusters, so you may have experienced it elsewhere (if commands like ‘sbatch’ and ‘squeue’ sound familiar to you, then you’ve used Slurm!). Hive will be the first cluster in PACE’s transition from Torque/Moab to Slurm. We expect the new scheduler to provide improved stability and reliability, offering a better user experience. At the same time, we will be updating our software stack. We will be offering extensive support to facilitate this migration.  

The first phase will begin with the August maintenance period (August 10-12), during which 100 Hive compute nodes (of 484 total) will join our new “Hive-Slurm” cluster, while the rest remain in the existing Torque/Moab cluster. The 100 nodes will represent each existing queue/node type proportionally. Following the conclusion of maintenance, we strongly encourage all researchers to begin exploring the Slurm-based side of Hive and shifting over their workflows.  Also, as part of the phased migration approach, researchers will continue to have access to the Hive (Moab/Torque) cluster that will last until the final phase of this migration, and this is to ensure minimum interruption to research.     Users will receive detailed communication on how to connect to the Hive-Slurm part of the cluster along with other documentation and training.  

The phased transition is planned in collaboration with the Hive Governance Committee, represented by the PIs on the NSF MRI grant that funds the cluster (Drs. Srinivas Aluru, Surya Kalidindi, C. David Sherrill, Richard Vuduc, and John H. Wise on behalf of Deirdre Shoemaker).  Following the migration of the first 100 nodes, the committee will review the status and consider the timing for migrating the remaining compute nodes to the ‘Hive-Slurm’ cluster.   

In addition to the scheduler migration, another significant change for researchers on Hive will be an update to the PACE Apps software stack. The Hive-Slurm cluster will feature a new set of provided applications, listed in our documentationPlease review this list of software we plan to offer on Hive post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you are currently using on Hive is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay to your research as the migration process concludes. We have reviewed batch job logs to determine packages in use and upgraded them to the latest version. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.  

PACE will provide documentation, training sessions, and additional support (e.g., increased frequency of PACE consulting sessions) to aid you as you transition your workflows to Slurm. Prior to the launch, we will have updated documentation as well as a guide for converting job scripts from PBS to Slurm-based commands. We will also offer specialized training virtual sessions (PACE Slurm Orientation) on the use of Slurm on Hive. Additionally, we are increasing the frequency of our PACE consulting sessions during this migration phase, and you are invited to join PACE Consulting Sessions or to email us for support.  Schedule for PACE Slurm orientation and consulting sessions will be communicated soon.  

You will notice a few other changes to Hive in the new environment. There continues to be no charge to use Hive.  As part of this migration, we are introducing a new feature, in which each job will require a “tracking account” to be provided for reporting purposes. Researchers who use the Phoenix cluster will be familiar with this accounting feature; however, the tracking accounts on Hive will have neither balances nor limitations, as they’ll be used solely for cluster utilization metrics. We will provide additional details prior to the launch of Hive-Slurm. Also, we will restructure access to GPUs to increase utilization while continuing to support short jobs.   

We are excited to launch Slurm on PACE as we continue working to improve Georgia Tech’s research computing infrastructure, and we will be providing additional information and support in the coming weeks through documentation, support tickets, and live sessions. Please contact us with any questions or concerns about this transition.  

Best, 

-The PACE Team 

 

[08/08/22 update]

As you already know, the Hive cluster will be migrating to the Slurm scheduler with the first phase scheduled for the August 10-12 maintenance period! This is a follow up to our initial notification on 7/27/2022 in this regard. PACE will be providing all the necessary documentation, orientation, and additional PACE consulting sessions in support of a smooth transition of your workflows to Slurm. 

 

Documentation – Our team is working on the necessary documentation for guiding you through the new Hive-Slurm environment and conversion of the submission scripts to Slurm. We have drafted information for 1) login information, partitions, and tracking account for the new Hive-Slurm cluster. 2) Guidelines on converting existing PBS scripts and commands to Slurm. 3) Details on using Slurm on hive and examples for writing new scripts. The links for the documentation will be provided soon! 

 

Orientation sessions – PACE will be hosting orientation sessions on migration to Slurm. They are open-attendance, and there is no registration required to attend these sessions. Find the details for the first two sessions here. 

When: Tuesday, Aug 16, 1-2 PM and Wednesday, Aug 24, 1-2 PM 

What is discussed: Introduction to the new Hive-Slurm environment and Slurm usage on Hive. Q&A for broad questions. The orientation would be providing the information to get you started on converting scripts. PACE will be working with individuals and provide hands-on help during the consulting sessions later. 

 

PACE Consulting sessions – PACE will be providing consulting sessions at a higher frequency to help researchers get onboarded in the new Hive-Slurm environment and provide one-on-one help in converting their PBS scripts to Slurm.  For the first month following the maintenance period, we will be hosting consulting sessions twice every week, rather than once. You can join us through the same link we use for consulting right now – find more details here starting from Aug 18th. 

When: Tuesdays, 2-3:45 and Thursdays, 10:30-12:15 AM, repeats weekly. 

Purpose: In addition to any PACE related queries or issues, you could join the session to seek help from experts on converting your scripts to Slurm on the new Hive-Slurm cluster. 

 

Software Changes – The Slurm cluster will feature a new set of provided applications listed in our documentation. As a gentle reminder, please review this list of software we plan to offer on Hive post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you currently use on Hive is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay in your research as the migration process concludes. A couple of points to note:  

  1. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.  
  2. The commands pace-jupyter-notebook and pace-vnc-job will be retired with the migration to Slurm. Instead, OnDemand will be available for Hive-Slurm (online after maintenance day) via the existing portal. Please use OnDemand to access Jupyter notebooks, VNC sessions, and more on Hive-Slurm via your browser. 

We are excited to launch Slurm on HIVE as we continue working to improve Georgia Tech’s research computing infrastructure and strive to provide all the support you need to facilitate this transition with minimum interruption to your research.  We will follow up with additional updates and timely reminders as needed. In the meantime, please contact us with any questions or concerns about this transition. 

 

Hive scheduler outage

Summary: The Hive scheduler became non-responsive last evening and was restored at approximately 8:30 AM today.

Details: The Torque resource manager on the Hive scheduler stopped responding around 7:00 PM yesterday. The PACE team restored its function around 8:30 AM this morning and is continuing to monitor its status. The scheduler was fully functional for some time after the system utility repair yesterday afternoon, and it is not clear if the issues are connected.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Hive Open OnDemand. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

(updated to reflect that Hive was impacted as well)

Summary: An error in a system utility resulted in the Hive, Phoenix, PACE-ICE, and COC-ICE clusters temporarily not launching new jobs. It has been repaired, and jobs have resumed launching.

Details: An unintended update to the system utility that checks the health of compute nodes resulted in all Hive, Phoenix, PACE-ICE, and COC-ICE compute nodes being recorded as down shortly before 4:00 PM today, even if there was in fact no issue with them. The scheduler will not launch new jobs on nodes marked down. After correcting the issue, all nodes are again correctly reporting their status, and jobs have resumed launching on all three clusters as of 6:30 PM.

Impact: As all nodes appeared down, no new jobs could launch but would instead remain in queue after being submitted. Running jobs were not impacted. Interactive jobs waiting to start might have been cancelled, in which case the researcher should re-submit.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Maintenance Period (August 10 – 12, 2022)

[8/12/22 5:00 PM Update]

PACE continues work to deploy Hive-Slurm. Maintenance on Hive-Slurm only will be extended into next week, as we complete setting up the new environment. At this time, please use the existing (Moab/Torque) Hive, which was released earlier today. We will provide another update next week when the Slurm cluster is ready for research, along with details about how to access and use the new scheduler and updated software stack.

The Slurm Orientation session previously announced for Tuesday, August 16, will be rescheduled for a later time.

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[8/12/22 2:20 PM Update]

The Phoenix, existing Hive (Moab/Torque), Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning. We have released all jobs that were held by the scheduler.

We are continuing to work on launching the Hive-Slurm cluster, and we will provide another update to Hive researchers later today. Maintenance on the existing Hive (Moab/Torque) cluster has completed, and researchers can resume using it.

The next maintenance period for all PACE clusters is November 2, 2022, at 6:00 AM through November 4, 2022, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on January 31 – February 2, May 9-11, August 8-10, and October 31 – November 2.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • [Complete][Utilities] PACE will merge the functionality of pace-whoami into the pace-quota utility. Please use the pace-quota command to find out all relevant information about your account, including storage directories and usage, job charge or tracking accounts, and other relevant information. Running pace-whoami will now report the same output as pace-quota.
  • [In progress][Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Complete][Datacenter] Transformer repairs
  • [Complete][Datacenter] Cooling tower cleaning
  • [Complete][Scheduler] Accounting database maintenance
  • [Complete][Firebird][Network] Add VPC redundancy
  • [Complete][Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[8/9/22 Update]

This is a reminder that our next PACE maintenance period is scheduled to begin tomorrow at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Utilities] PACE will merge the functionality of pace-whoami into the pace-quota utility. Please use the pace-quota command to find out all relevant information about your account, including storage directories and usage, job charge or tracking accounts, and other relevant information. Running pace-whoami will now report the same output as pace-quota.
  • [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Datacenter] Transformer repairs
  • [Datacenter] Cooling tower cleaning
  • [Scheduler] Accounting database maintenance
  • [Firebird][Network] Add VPC redundancy
  • [Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[8/3/22 Update]

This is a reminder that our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Datacenter] Transformer repairs
  • [Datacenter] Cooling tower cleaning
  • [Scheduler] Accounting database maintenance
  • [Firebird][Network] Add VPC redundancy
  • [Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[7/27/22 Update]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:
• [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:
• [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
• [Datacenter] Transformer repairs
• [Datacenter] Cooling tower cleaning
• [Scheduler] Accounting database maintenance
• [Firebird][Network] Add VPC redundancy

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[7/18/22 Early reminder]

Dear PACE Users,

This is friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/10/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/12/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to this maintenance, we will share further details on the tasks, which will be posted here.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

Phoenix scheduler outage

[Update 7/15/22 5:10 PM]

The PACE team has identified some known issues and addressed them to restore scheduler functionality. The scheduler is currently back up and running and new jobs can be submitted. We will continue to monitor the performance over next week and we appreciate your patience as we work through this. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 7/15/22 3:12 PM]

Summary: The Phoenix scheduler’s response was inconsistent starting today. While we are working towards fully resolving it, we have mitigated the issue by restarting the scheduler at approximately 2:00 PM today. The scheduler will be shut down temporarily to continue restoring it with full capacity.

Details: An unexpected scheduler crash earlier this week has resulted in certain ongoing issues that we are actively working with the vendor to fully resolve. The resource manager has been unable to detect all free resources. As a result,Torque resource manager on the Phoenix scheduler was not accepting certain interactive jobs today morning. Also, some jobs are waiting in queue to be launched for an unusually long period of time. PACE team restarted the scheduler and restored some of its function around 2:00 PM. PACE team is continuing to work on the issue to fully resolve at the earliest. As a result, the scheduler will be shut down temporarily for a system wide restart.

Impact: Interactive jobs submissions requesting relatively high number of processors and memory were failing and being cancelled without an error message. The wait time for jobs in queue has also been longer than usual. These issues have been resolved. However, due to scheduler down time, new jobs can’t be submitted in the meantime and scheduler commands such as qstat will not work. The jobs currently running will complete without interruption.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions. We will be following-up with another status message later today.

Best,

-The PACE Team