Slow Storage on Phoenix

[Update 12/5/22 10:45 AM]

Performance on Phoenix project & scratch storage has returned to normal. PACE continues to investigate the root cause of last week’s slowness, and we would like to thank those researchers we have contacted with questions about your workflows. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 12/2/22 1:11 PM]

Summary: Researchers may experience slow performance on Phoenix project & scratch storage.

Details: Over the past three days, Phoenix has experienced intermittent slowness on the Lustre filesystem hosting project & scratch storage due to heavy utilization. PACE is investigating the source of the heavy load on the storage system.

Impact: Any jobs or commands that read or write on project or scratch storage may run more slowly than normal.

Thank you for your patience as we continue to investigate. Please contact us at pace-support@oit.gatech.edu with any questions.

 

Scratch Deletion Resumption on Phoenix & Hive

Monthly scratch deletion will resume on the Phoenix and Hive clusters in December, in accordance with PACE’s scratch deletion policy for files over 60 days old. Scratch deletion has been suspended since May 2022, due to an issue with a software upgrade on Phoenix’s Lustre storage system that was resolved during the November maintenance period. Researchers with data scheduled for deletion will receive warning emails on Tuesday, December 6, and Tuesday, December 13, and files will be deleted on Tuesday, December 20. If you receive an email notification next week, please review the files scheduled for deletion and contact PACE if you need additional time to relocate the files.

Scratch is intended to be temporary storage, and regular deletion of old files allows PACE to offer a large space at no cost to our researchers. Please keep in mind that scratch space is not backed up, and any important data for your research should be relocated to your research group’s project storage.

If you have any questions about scratch or any other storage location on PACE clusters, please contact PACE.

New A100 GPU and AMD CPU nodes available on Phoenix-Slurm

Dear Phoenix researchers, 

We have migrated 800 (out of 1319) nodes of our existing hardware as part of our ongoing Phoenix cluster migration to Slurm. PACE has continued our effort to provide a heterogenous hardware environment by adding 5 GPU nodes (2x Nvidia A100s per node) and 4 CPU nodes (2x AMD Epyc 7713 processors with 128 cores per node) to the Phoenix-Slurm cluster.  

Both service offerings provide exciting, new hardware for research computing at PACE. The A100 GPU nodes, which also include 2x AMD Epyc 7513 processors with 64 cores per node, provide a powerful option to our users for GPU compute in machine learning and scientific applications. The AMD Epyc CPU nodes provide a cost-effective alternative to Intel processors for research, with energy and equipment savings we pass to our users with a lower rate than our current base option. However, AMD CPUs still provide great value in traditional HPC due to higher memory bandwidth and core density. You can find out more about our latest costs in our rate study here. 

You can find out more information on our new nodes in our documentation here. We also provide documentation on how to use the A100 GPU nodes and AMD CPU nodes on Phoenix-Slurm. If you need further assistance with using these new resources, please feel free to reach out to us at pace-support@oit.gatech.edu or attend our next consulting session.

Best,  

-The PACE Team

Action Required: Globus Certificate Authority Update

Globus is updating the Certificate Authority (CA) used for its transfer service, and action is required to continue using existing Globus endpoints. PACE updated the Phoenix, Hive, and Vapor server endpoints during the recent maintenance period. To continue using Globus Connect Personal to transfer files to/from your own computers, please update your Globus client to version 3.2.0 by December 12, 2022. Full details are available on the Globus website. This update is required to continue transferring data between your local computer and PACE or other computing sites.

Please contact us at pace-support@oit.gatech.edu with any questions.

 

PACE Maintenance Period (November 2 – 4, 2022)

[11/4/2022 Update]

The Phoenix (Moab/Torque and Slurm), Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning. We have released all jobs that were held by the scheduler. 

The 2nd phase of the Phoenix-Slurm cluster migration for 300 additional nodes (for a combined total of 800 nodes [out of 1319]) completed successfully and researchers can resume using it. 

The next maintenance period for all PACE clusters is January 31, 2023, at 6:00 AM through February 2, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on May 9-11, August 8-10, and October 31-November 2. Additional phases for the Phoenix-Slurm cluster migration are tentatively scheduled for November 29 in 2022, and January 4, 17, and 31 in 2023. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Complete][Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Complete] [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Complete] [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Complete] [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Complete] [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Complete] [Firebird] Reconfigure Firebird in PACE DB 
  • [Complete] [OSG] Update Nvidia drivers 
  • [Complete] [OSG][Network] Remove IB drivers on osg-login2 
  • [Complete] [Datacenter] Transformer repairs 
  • [Complete] [Network] Update VRF configuration on compute racks 
  • [Complete] [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[11/2/2022 Update]

This is a reminder that our next PACE Maintenance period has now begun and is scheduled to end at 11:59PM on Friday, 11/04/2022During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Firebird] Reconfigure Firebird in PACE DB 
  • [OSG] Update Nvidia drivers 
  • [OSG][Network] Remove IB drivers on osg-login2 
  • [Datacenter] Transformer repairs 
  • [Network] Update VRF configuration on compute racks 
  • [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[10/31/2022 Update]

This is a reminder that our next PACE Maintenance period is scheduled to begin later this week at 6:00AM on Wednesday, 11/02/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/04/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. 

Tentative list of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Firebird] Reconfigure Firebird in PACE DB 
  • [OSG] Update Nvidia drivers 
  • [OSG][Network] Remove IB drivers on osg-login2 
  • [Datacenter] Transformer repairs 
  • [Network] Update VRF configuration on compute racks 
  • [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[10/24/2022 Early Reminder]

Dear PACE Users,

This is a friendly reminder that our next PACE Maintenance period is scheduled to begin at 6:00AM on Wednesday, 11/02/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/04/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message

ITEMS NOT REQUIRING USER ACTION:

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319])
  • [Phoenix] Reconfigure Phoenix in PACE DB
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server
  • [Firebird] Reconfigure Firebird in PACE DB
  • [OSG] Update Nvidia drivers
  • [OSG][Network] Remove IB drivers on osg-login2
  • [Datacenter] Transformer repairs
  • [Network] Update VRF configuration on compute racks
  • [Storage] Upgrade Globus to 5.4.50 for new CA

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

Phoenix Scheduler Outage

Summary: The Phoenix scheduler was non-responsive between Wed 10/19/2022 9:30pm and Thurs 10/20/2022 12:30am.

Details: The Torque resource manager on the Phoenix scheduler was non-responsive around 9:30pm last night. At 12:30am this morning we restarted the scheduler.

Impact: Running jobs were not interrupted, but no new jobs could be submitted or cancelled during the period scheduler was down, including via Phoenix Open OnDemand. Commands such as “qsub” and “qstat” were impacted as well.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Firebird Storage Outage

[Update 2022/10/21, 10:00am]
Summary: The Firebird storage outage recurred this morning at approximately 3:45 AM, and repairs were completed at approximately 9:15 AM. ASDL, LANNS, and Montecarlo projects were affected. Orbit and RAMC were not affected.
Details: Storage for three Firebird projects became unavailable this morning, and PACE has now restored the system. Jobs that failed at the time of the outage will be refunded. At this time, we have adjusted several settings, and we continue investigating the root cause of the issue.
Impact: Researchers on ASDL, LANNS, and Montecarlo would have been unable to access Firebird this morning. Running jobs on these projects would have failed as well. Please resubmit any failed job to run it again.
Thank you for your patience as we restored the system this morning. Please contact us at pace-support@oit.gatech.edu if you have any questions.
[Update 2022/10/19, 10:00am CST]
Everything is back to normal on Firebird, apologies for any inconvenience!
[Original post]
We are having an issue with Firebird storage. Jobs on ASDL, LANNS and Montecarlo are effected. Rebooting storage server causes the login nodes issue on LANNS and Montecarlo. We are actively working on resolving issues and expect the issue to be resolved by noon today.
Orbit and RAMC are not affected by this storage outrage.

Please contact us at pace-support@oit.gatech.edu if you have any questions.

Firebird inaccessible

[Update 10/3/22 10:45 AM]

Access to Firebird and the PACE VPN has been restored, and all systems should be functioning normally. If you do not see the PACE VPN as an option in the GlobalProtect client, please disconnect from the GT VPN and reconnect for it to appear again.

Urgent maintenance on the GlobalProtect VPN device on Thursday night inadvertently led to the loss of PACE VPN access, which was restored this morning.

Please contact us at pace-support@oit.gatech.edu with questions, or if you are still unable to access Firebird.

 

[Original Message 10/3/22 9:40 AM]

Summary: The Firebird cluster and PACE VPN are currently inaccessible. OIT is working to restore access.

Details: The Firebird cluster was found to be inaccessible over the weekend. PACE is working with OIT colleagues to identify the cause and restore access.

Impact: Researchers are unable to connect to the PACE VPN or access the Firebird cluster.

Thank you for your patience as we work to restore access. Please contact us at pace-support@oit.gatech.edu with questions.

Phoenix Project & Scratch Storage Cables Replacement

[Update 2022/10/05, 12:40PM CST]

Work has been completed on one cable and associated systems connecting to the storage were restored back to normal. We’re going to do stability assessment of the system after first cable replacement and schedule second cable replacement sometime next week.

 

[Update 2022/10/05, 10:10AM CST]

As the work is still ongoing we’re experiencing issues with one of the cable replacement. While there is still redundant controller in place we already identified an impact on some users where the data are not currently accessible. In order to minimize impact on the system we’ve decided to pause scheduler to prevent new jobs from starting and crashing. Running jobs may be impacted by the storage outage.

Please, be mindful about opening new ticket to pace-support@oit.gatech.edu if your issue is storage related.

 

[Original post]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

Details: Two cables connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, both needs to be replaced, beginning around 10AM Wednesday, October 5th, 2022. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours. After the replacement, pools will need to rebuild over the course of about a day.

Impact: Since there is a redundant controller when doing work on one cable, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Phoenix Cluster Migration to Slurm Scheduler

Dear Phoenix researchers,

The Phoenix cluster will be migrating to the Slurm scheduler over the next couple of months with the first phase scheduled for October 10! PACE has worked closely with the PACE Advisory Committee (PAC) on a plan for the migration to ensure minimum interruption to research. Slurm is a widely popular, open-source scheduler on many research computing clusters, so you may have experienced it elsewhere. If commands like ‘sbatch’ and ‘squeue’ sound familiar to you, then you have used Slurm! Phoenix will be the second cluster (after Hive) in PACE’s transition from Torque/Moab to Slurm. We expect the new scheduler to provide improved stability and reliability, offering a better user experience. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration.

Phased Migration
The phased transition is planned in collaboration with the faculty-led PACE Advisory Committee which is comprised of a representative group of PACE and faculty members. We are planning a staggered phased migration for the Phoenix cluster. The six phases include the following dates and number of nodes:
  • October 10, 2022 – 500 nodes
  • November 2, 2022 (PACE Maintenance Period) – 300 nodes
  • November 29, 2022 – 200 nodes
  • January 4, 2023 – 100 nodes
  • January 17, 2023 – 100 nodes
  • January 31, 2023 (PACE Maintenance Period) – 119 nodes

The first phase will begin October 10, during which 500 Phoenix compute nodes (of 1319 total) will join our new “Phoenix-Slurm” cluster while the rest will remain on the existing Phoenix cluster. The 500 nodes will represent each existing node type proportionally. Following the first phase, we strongly encourage all researchers to begin shifting over their workflows to the Slurm-based side of Phoenix to take advantage of the improved features and queue wait times. Also, as part of the phased migration approach, researchers will continue to have access to the existing Phoenix cluster that will last until the final phase of this migration to ensure minimum interruption to research. Users will receive detailed communication on how to connect to the Phoenix-Slurm cluster along with other documentation and training.

Software Stack
In addition to the scheduler migration, another significant change for researchers on Phoenix will be an update to the PACE Apps software stack. The Phoenix-Slurm cluster will feature a new set of provided applications listed in our documentation. Please review this list of software we plan to offer on Phoenix post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you are currently using on Phoenix is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay to your research as the migration process concludes. We have reviewed batch job logs to determine packages in use and upgraded them to the latest version. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.

Starting after the November PACE Maintenance period (November 2), we will no longer be accepting software installation requests for new software on the existing Phoenix cluster with Torque/Moab. All software requests after November 2 will be for Phoenix-Slurm. Additionally, all new researcher groups joining PACE after November 2 will be onboarded onto Phoenix-Slurm only.

Billing
You will notice a few other changes to Phoenix in the new environment. As with the current Phoenix cluster, faculty and their research teams will receive the full free tier monthly allocation, equivalent to 10,000 CPU*hours on base hardware and usable on all architectures, on Phoenix-Slurm (in addition to the one on the existing Phoenix cluster) as well as access to Embers, our free backfill queue. We will be charging users for jobs on Phoenix-Slurm.

For prepaid accounts (including CODA20 refresh accounts), PACE will split your account balances 50/50 on the Phoenix-Slurm and existing Phoenix (with Torque/Moab) clusters during the migration. For new computing credits purchased after Nov 1st, 75% will be allocated to the Phoenix-Slurm cluster. For new computing credits purchased after Jan 3, 100% will be allocated to the Phoenix-Slurm cluster.

For postpaid (monthly) accounts, PACE will set the same limit based on existing credits on Phoenix to Phoenix-Slurm. Please be aware that for postpaid accounts this could lead to potential monthly overcharges if users were to run on both clusters to 100%. However, we wanted to allow researchers to have access to their full monthly limit for flexibility. For postpaid accounts, Principal Investigators and users are responsible for tracking their spending limit on the Phoenix-Slurm and Phoenix clusters to avoid going over budget.

Support
PACE will provide documentation, training sessions, and additional support (e.g., increased frequency of PACE consulting sessions) to aid you as you transition your workflows to Slurm. Prior to the launch, we will have updated documentation as well as a guide for converting job scripts from PBS to Slurm-based commands. We will also offer specialized training virtual sessions (PACE Slurm Orientation) on the use of Slurm on Phoenix. Additionally, we have increased the frequency of our PACE consulting sessions during this migration phase for the Fall and Spring semesters, and you are invited to join our PACE Consulting Sessions or to email us for support. The schedule for the upcoming PACE Phoenix-Slurm orientation sessions will be provided in future communications.

We are excited to launch Slurm on Phoenix as we continue to improve Georgia Tech’s research computing infrastructure, and we will be providing additional information and support in the coming weeks through documentation, support tickets, and live sessions. Please contact us with any questions or concerns about this transition.

Best,
-The PACE Team