Posts

Phoenix Cluster Migration to Slurm Scheduler – Phase 4

[Update 2022/01/04, 2:18PM EST]

Dear Phoenix researchers,

The fourth phase of migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated 1100 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster.

As a reminder, the final phases of the migration are scheduled to continue in January 2023, during which the remaining 219 nodes will join Phoenix-Slurm: 

  • Phase 5: January 17, 2023 – 100 nodes  
  • Phase 6: January 31, 2023 (PACE Maintenance Period) – about 119 nodes 

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled for Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users on January 31st, at 6 AM ET.

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition. 

Best, 

-The PACE Team

[Update 2022/01/04, 6:00AM EST]

Dear Phoenix researchers, 

Just a reminder that the fourth phase of will start today, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster. 

The 100 nodes will be taken offline now (6am ET), and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm. 

We will follow up with additional updates and reminders as needed. In the meantime, please email us if you have any questions or concerns about the migration. 

Best, 

– The PACE Team 

[Update 2022/01/03, 5:26PM EST]

Dear Phoenix researchers,

We have successfully migrated about 1000 nodes (out of about 1319 total) from Phoenix to the Phoenix-Slurm cluster. As a reminder, the fourth phase is scheduled starting tomorrow, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster. 

The 100 nodes will be taken offline tomorrow morning (January 4th) at 6am ET, and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm. 

As recommended at the beginning of this migration, we strongly encourage all researchers to begin shifting over their workflows to the Slurm-based side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom. 

Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm. 

Best, 

– The PACE Team

Storage-eas read-only during configuration change

[Update 1/9/23 10:58 AM]

The migration of storage-eas data to a new location is complete, and full read/write capability is available for all research groups on the device. Researchers may resume regular use of storage-eas, including writing new data to it.

Thank you for your patience as we completed these configuration changes to improve stability of storage-eas. Please email us at pace-support@oit.gatech.edu with any questions.

 

[Original Post 12/21/22 11:08 AM]

Summary: Researchers have reported multiple outages of the storage-eas server recently. To stabilize the storage, PACE will make configuration changes. The storage-eas server will become read-only at 3 PM today and will remain read-only until after the Winter Break, while the changes are being implemented. We will provide an update when write access is restored.

Details: PACE will remove the deduplication setting on storage-eas, which is causing performance and stability issues. Beginning this afternoon, the system will become read-only while all data is copied to a new location. After the copy is complete, we will enable access to the storage in the new location, with full read/write capabilities.

Impact: Researchers will not be able to write to storage-eas for up to two weeks. You may continue reading files from it on both PACE and external systems where it is mounted. While this move is in progress, PACE recommends that researchers copy any files that need to be used in Phoenix jobs into their scratch directories, then work from there to write during a job. Scratch provides each researcher with 15 TB of temporary storage on the Lustre parallel filesystem. Files in scratch can be copied to non-PACE storage via Globus.

Thank you for your patience as we complete these configuration changes to improve stability of storage-eas. Please email us at pace-support@oit.gatech.edu with any questions.

Phoenix Project & Scratch Storage Cables Replacement for Redundant Controller

[Update 2022/12/08, 5:52PM EST]
Work was been completed on the cable replacement on the redundant storage controller and associated systems connecting to the storage were restored back to normal. We were able to replace 2 cables on the controller without interruption to service.

[Update 2022/12/05, 9:00AM EST]
Summary: Phoenix project & scratch storage cable replacement for redundant controller and potential outage and subsequent temporary decreased performance

Details: A cable connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, to the redundant controller needs to be replaced, beginning around 10AM Wednesday, December 8th, 2022. The expected time to finish the work for cable replacement will take about 3-4 hours. After the replacement, pools will need to be rebuilt over the course of about a day.

Impact: Because we are replacing a cable on the redundant controller while maintaining the main controller, there should not be an outage during the cable replacement. However, a similar replacement has previously caused storage to become unavailable, so an outage is possible. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance may be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. If a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Slow Storage on Phoenix

[Update 12/5/22 10:45 AM]

Performance on Phoenix project & scratch storage has returned to normal. PACE continues to investigate the root cause of last week’s slowness, and we would like to thank those researchers we have contacted with questions about your workflows. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 12/2/22 1:11 PM]

Summary: Researchers may experience slow performance on Phoenix project & scratch storage.

Details: Over the past three days, Phoenix has experienced intermittent slowness on the Lustre filesystem hosting project & scratch storage due to heavy utilization. PACE is investigating the source of the heavy load on the storage system.

Impact: Any jobs or commands that read or write on project or scratch storage may run more slowly than normal.

Thank you for your patience as we continue to investigate. Please contact us at pace-support@oit.gatech.edu with any questions.

 

Scratch Deletion Resumption on Phoenix & Hive

Monthly scratch deletion will resume on the Phoenix and Hive clusters in December, in accordance with PACE’s scratch deletion policy for files over 60 days old. Scratch deletion has been suspended since May 2022, due to an issue with a software upgrade on Phoenix’s Lustre storage system that was resolved during the November maintenance period. Researchers with data scheduled for deletion will receive warning emails on Tuesday, December 6, and Tuesday, December 13, and files will be deleted on Tuesday, December 20. If you receive an email notification next week, please review the files scheduled for deletion and contact PACE if you need additional time to relocate the files.

Scratch is intended to be temporary storage, and regular deletion of old files allows PACE to offer a large space at no cost to our researchers. Please keep in mind that scratch space is not backed up, and any important data for your research should be relocated to your research group’s project storage.

If you have any questions about scratch or any other storage location on PACE clusters, please contact PACE.

New A100 GPU and AMD CPU nodes available on Phoenix-Slurm

Dear Phoenix researchers, 

We have migrated 800 (out of 1319) nodes of our existing hardware as part of our ongoing Phoenix cluster migration to Slurm. PACE has continued our effort to provide a heterogenous hardware environment by adding 5 GPU nodes (2x Nvidia A100s per node) and 4 CPU nodes (2x AMD Epyc 7713 processors with 128 cores per node) to the Phoenix-Slurm cluster.  

Both service offerings provide exciting, new hardware for research computing at PACE. The A100 GPU nodes, which also include 2x AMD Epyc 7513 processors with 64 cores per node, provide a powerful option to our users for GPU compute in machine learning and scientific applications. The AMD Epyc CPU nodes provide a cost-effective alternative to Intel processors for research, with energy and equipment savings we pass to our users with a lower rate than our current base option. However, AMD CPUs still provide great value in traditional HPC due to higher memory bandwidth and core density. You can find out more about our latest costs in our rate study here. 

You can find out more information on our new nodes in our documentation here. We also provide documentation on how to use the A100 GPU nodes and AMD CPU nodes on Phoenix-Slurm. If you need further assistance with using these new resources, please feel free to reach out to us at pace-support@oit.gatech.edu or attend our next consulting session.

Best,  

-The PACE Team

Action Required: Globus Certificate Authority Update

Globus is updating the Certificate Authority (CA) used for its transfer service, and action is required to continue using existing Globus endpoints. PACE updated the Phoenix, Hive, and Vapor server endpoints during the recent maintenance period. To continue using Globus Connect Personal to transfer files to/from your own computers, please update your Globus client to version 3.2.0 by December 12, 2022. Full details are available on the Globus website. This update is required to continue transferring data between your local computer and PACE or other computing sites.

Please contact us at pace-support@oit.gatech.edu with any questions.

 

PACE Maintenance Period (November 2 – 4, 2022)

[11/4/2022 Update]

The Phoenix (Moab/Torque and Slurm), Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning. We have released all jobs that were held by the scheduler. 

The 2nd phase of the Phoenix-Slurm cluster migration for 300 additional nodes (for a combined total of 800 nodes [out of 1319]) completed successfully and researchers can resume using it. 

The next maintenance period for all PACE clusters is January 31, 2023, at 6:00 AM through February 2, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on May 9-11, August 8-10, and October 31-November 2. Additional phases for the Phoenix-Slurm cluster migration are tentatively scheduled for November 29 in 2022, and January 4, 17, and 31 in 2023. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Complete][Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Complete] [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Complete] [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Complete] [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Complete] [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Complete] [Firebird] Reconfigure Firebird in PACE DB 
  • [Complete] [OSG] Update Nvidia drivers 
  • [Complete] [OSG][Network] Remove IB drivers on osg-login2 
  • [Complete] [Datacenter] Transformer repairs 
  • [Complete] [Network] Update VRF configuration on compute racks 
  • [Complete] [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[11/2/2022 Update]

This is a reminder that our next PACE Maintenance period has now begun and is scheduled to end at 11:59PM on Friday, 11/04/2022During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Firebird] Reconfigure Firebird in PACE DB 
  • [OSG] Update Nvidia drivers 
  • [OSG][Network] Remove IB drivers on osg-login2 
  • [Datacenter] Transformer repairs 
  • [Network] Update VRF configuration on compute racks 
  • [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[10/31/2022 Update]

This is a reminder that our next PACE Maintenance period is scheduled to begin later this week at 6:00AM on Wednesday, 11/02/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/04/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. 

Tentative list of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Firebird] Reconfigure Firebird in PACE DB 
  • [OSG] Update Nvidia drivers 
  • [OSG][Network] Remove IB drivers on osg-login2 
  • [Datacenter] Transformer repairs 
  • [Network] Update VRF configuration on compute racks 
  • [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[10/24/2022 Early Reminder]

Dear PACE Users,

This is a friendly reminder that our next PACE Maintenance period is scheduled to begin at 6:00AM on Wednesday, 11/02/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/04/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message

ITEMS NOT REQUIRING USER ACTION:

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319])
  • [Phoenix] Reconfigure Phoenix in PACE DB
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server
  • [Firebird] Reconfigure Firebird in PACE DB
  • [OSG] Update Nvidia drivers
  • [OSG][Network] Remove IB drivers on osg-login2
  • [Datacenter] Transformer repairs
  • [Network] Update VRF configuration on compute racks
  • [Storage] Upgrade Globus to 5.4.50 for new CA

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

Phoenix Scheduler Outage

Summary: The Phoenix scheduler was non-responsive between Wed 10/19/2022 9:30pm and Thurs 10/20/2022 12:30am.

Details: The Torque resource manager on the Phoenix scheduler was non-responsive around 9:30pm last night. At 12:30am this morning we restarted the scheduler.

Impact: Running jobs were not interrupted, but no new jobs could be submitted or cancelled during the period scheduler was down, including via Phoenix Open OnDemand. Commands such as “qsub” and “qstat” were impacted as well.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Firebird Storage Outage

[Update 2022/10/21, 10:00am]
Summary: The Firebird storage outage recurred this morning at approximately 3:45 AM, and repairs were completed at approximately 9:15 AM. ASDL, LANNS, and Montecarlo projects were affected. Orbit and RAMC were not affected.
Details: Storage for three Firebird projects became unavailable this morning, and PACE has now restored the system. Jobs that failed at the time of the outage will be refunded. At this time, we have adjusted several settings, and we continue investigating the root cause of the issue.
Impact: Researchers on ASDL, LANNS, and Montecarlo would have been unable to access Firebird this morning. Running jobs on these projects would have failed as well. Please resubmit any failed job to run it again.
Thank you for your patience as we restored the system this morning. Please contact us at pace-support@oit.gatech.edu if you have any questions.
[Update 2022/10/19, 10:00am CST]
Everything is back to normal on Firebird, apologies for any inconvenience!
[Original post]
We are having an issue with Firebird storage. Jobs on ASDL, LANNS and Montecarlo are effected. Rebooting storage server causes the login nodes issue on LANNS and Montecarlo. We are actively working on resolving issues and expect the issue to be resolved by noon today.
Orbit and RAMC are not affected by this storage outrage.

Please contact us at pace-support@oit.gatech.edu if you have any questions.