Upcoming Firebird Slurm Migration Announcement

The Firebird cluster will be migrating to the Slurm scheduler on October 24-26, 2023. PACE has developed a plan to transition researchers’ workflow smoothly. As you may be aware, PACE began the Slurm migration in July 2022, and we have successfully migrated the Hive, Phoenix, and ICE clusters already. Firebird is the last cluster in PACE’s transition from Torque/Moab to Slurm, bringing increased job throughput and better scheduling policy enforcement. The new scheduler will better support the new hardware to be added soon to Firebird. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration. 

Software Stack 

In addition to the scheduler migration, the PACE Apps central software stack will also be updated. This software stack supports the Slurm scheduler and runs successfully on Phoenix/Hive/ICE. The Firebird cluster will feature the provided applications listed in our documentationPlease review this list of non-CUI software we will offer on Firebird post-migration and let us know via email (pace-support@oit.gatech.edu) if any PACE-installed software you are currently using on Firebird is missing from the list.  If you already submitted a reply to the application survey sent to Firebird PIs there is no need to repeat requests. Researchers installing or writing custom software will need to recompile applications to reflect new MPI and other libraries once the new system is ready.   
 
We will freeze the new software installation in PACE central software stack in Torque stack from Sep 1st, 2023. You can continue installing the software under your local/shared space without interruption. 

No Test Environment 

Due to security and capacity constraints, it is infeasible to use a progressive rollout approach as we did for Phoenix and Hive. Hence there will not be a test environment. For researchers installing or writing their software, we highly recommend the following: 

  • For those with access to Phoenix, compile Non-CUI software on Phoenix now and report any issue you encounter so that we can help you before migration. 
  • Please report any self-installed CUI software you need which cannot be tested on Phoenix. We will try our best to make all dependent libraries ready and give higher priority to assisting with reinstallation immediately after the Slurm migration.  

Support 

PACE will provide documentation, training sessions [register here], and support (consulting sessions and 1-1 sessions) to aid your workflow transitions to Slurm. Documentation and a guide for converting job scripts from PBS to Slurm-based commands will be ready before migration. We will offer Slurm training right after Migration; future communications will provide the schedule. You are welcome to join our PACE Consulting Sessions or to email us for support.  

We are excited to launch Slurm on Firebird to improve Georgia Tech’s research computing infrastructure! Please contact us with any questions or concerns about this transition. 

Phoenix Project & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Cables will be replaced one at a time, taking about 3 hours to complete the work.

WHEN IS IT HAPPENING?
Monday, April 3rd, 2023 starting 9AM EDT.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

WHAT’S HAPPENING?
Two cables connecting one of the two controllers of the Hive Lustre device need to be replaced. Cables will be replaced one at a time, taking about 3 hours to complete the work.

WHEN IS IT HAPPENING?
Monday, April 3rd, 2023 starting 9AM EDT.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Connecting new cooling doors to power

[Updated 2023/04/04, 12:25PM ET]

Electricians needed to complete some additional checks before performing the final connection, so the task has been re-scheduled for Thursday, 6-April.

[Original post 2023/03/31, 4:51PM ET]

WHAT’S HAPPENING?
In order to complete the Coda data center expansion on time and under budget, low risk electrical work will be performed, to connect the additional 12 uSystems cooling doors will be wired to the distribution panels, and left powered off. Adding the circuit breaker is the only work on the “powered” side of the circuits.

WHEN IS IT HAPPENING?
Tuesday, April 4th, 2023, the work will be performed during business hours.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
None of the user’s jobs processing. The connection work is very low risk, most of the work will be done on the “unpowered” side of the panel. Worst case scenario is that we’ll lose power to up to 20 cooling doors, which expected to be recovered in less than 1 minute. If it takes longer than 5 minutes, we will initiate an emergency power down on the affected nodes.

WHAT DO YOU NEED TO DO?
Nothing.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

New compute nodes on the Phoenix cluster

In the month of February, we added several compute nodes to the Phoenix cluster. This will give the Phoenix users the opportunity to use more powerful nodes for their computations, and to decrease the waiting time for high-demand hardware.

There are three groups of new nodes:

  1. 40 32-core Intel-CPU high-memory nodes (768 GB of RAM per node). These nodes are part of our “cpu-large” partition, and this addition increases the number of “cpu-large” nodes from 68 to 108. The nodes have Dual Intel Xeon Gold 6226R processors @ 2.9 GHz (with 32 instead of 24 cores per node). Any jobs that require more than 16 GB of memory per CPU will end up on the nodes from the “cpu-large” partition.
  2. 4 128-core AMD CPU nodes with 128 cores per node. These nodes are part of our “cpu-amd” partition, and this addition increases the number of “cpu-amd” nodes from 4 to 8. The cores are Dual AMD Epyc 7713 processors @ 2.0 GHz (128 cores per node) with 512 GB of memory. For comparison, most of the older Phoenix compute nodes have 24 cores per node (and have Intel processors rather than AMD). To target these nodes specifically, you can specify the flag “-C amd” in your sbatch script or salloc command:https://docs.pace.gatech.edu/phoenix_cluster/slurm_guide_phnx/#amd-cpu-jobs
  3. 7 64-core AMD CPU nodes with Nvidia A100 GPUs (two GPUs per node) with 40 GB of GPU memory. These nodes are part of our “gpu-a100” partition, and this addition increases the number of “gpu-a100” nodes from 5 to 12. These nodes have Dual AMD Epyc 7513 processors @ 2.6 GHz (64 cores per node) with 512 GB of RAM. To target these nodes, you can specify the flag “–gres=gpu:A100:1” (to get one GPU per node) or “–gres=gpu:A100:2” (to get both GPUs for each requested node) in your sbatch script or salloc command:https://docs.pace.gatech.edu/phoenix_cluster/slurm_guide_phnx/#gpu-jobs

To see the up-to-date specifications of the Phoenix compute nodes, please refer to our website: 

https://docs.pace.gatech.edu/phoenix_cluster/slurm_resources_phnx/

If you have any other questions, please send us a ticket by emailing pace-support@oit.gatech.edu.

Phoenix Storage & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 3 hours.

WHEN IS IT HAPPENING?
Tuesday, February 21st, 2023 starting 9AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Project & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 1PM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

WHAT’S HAPPENING?
Two cables connecting one of the two controllers of the Hive Lustre device need to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 2 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 10AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Firebird Storage Outage

[Update 2022/10/21, 10:00am]
Summary: The Firebird storage outage recurred this morning at approximately 3:45 AM, and repairs were completed at approximately 9:15 AM. ASDL, LANNS, and Montecarlo projects were affected. Orbit and RAMC were not affected.
Details: Storage for three Firebird projects became unavailable this morning, and PACE has now restored the system. Jobs that failed at the time of the outage will be refunded. At this time, we have adjusted several settings, and we continue investigating the root cause of the issue.
Impact: Researchers on ASDL, LANNS, and Montecarlo would have been unable to access Firebird this morning. Running jobs on these projects would have failed as well. Please resubmit any failed job to run it again.
Thank you for your patience as we restored the system this morning. Please contact us at pace-support@oit.gatech.edu if you have any questions.
[Update 2022/10/19, 10:00am CST]
Everything is back to normal on Firebird, apologies for any inconvenience!
[Original post]
We are having an issue with Firebird storage. Jobs on ASDL, LANNS and Montecarlo are effected. Rebooting storage server causes the login nodes issue on LANNS and Montecarlo. We are actively working on resolving issues and expect the issue to be resolved by noon today.
Orbit and RAMC are not affected by this storage outrage.

Please contact us at pace-support@oit.gatech.edu if you have any questions.

Phoenix Project & Scratch Storage Cables Replacement

[Update 2022/10/05, 12:40PM CST]

Work has been completed on one cable and associated systems connecting to the storage were restored back to normal. We’re going to do stability assessment of the system after first cable replacement and schedule second cable replacement sometime next week.

 

[Update 2022/10/05, 10:10AM CST]

As the work is still ongoing we’re experiencing issues with one of the cable replacement. While there is still redundant controller in place we already identified an impact on some users where the data are not currently accessible. In order to minimize impact on the system we’ve decided to pause scheduler to prevent new jobs from starting and crashing. Running jobs may be impacted by the storage outage.

Please, be mindful about opening new ticket to pace-support@oit.gatech.edu if your issue is storage related.

 

[Original post]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

Details: Two cables connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, both needs to be replaced, beginning around 10AM Wednesday, October 5th, 2022. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours. After the replacement, pools will need to rebuild over the course of about a day.

Impact: Since there is a redundant controller when doing work on one cable, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.