jvaldez8 – Page 2 – Partnership for an Advanced Computing Environment

PACE Maintenance Period (Aug 8 – Aug 10, 2023)

[Update 8/11/2023 8:33pm]

The controller replacement on the scratch storage system successfully passed four rounds of testing. Phoenix is back in production and is ready for research. We have released all jobs that were held by the scheduler. Please let us know if you have any problems.

I apologize for the inconvenience, but I believe this delayed return to production will help decrease future downtime.

The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for January 23-25, 2024, and May 7-9, 2024.

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

Pam Buffington

Pace Director

[Update 8/10/2023 5:00pm]

The Hive, ICE, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler.

Unfortunately, Phoenix storage issues continue. All work was completed, but the scratch storage failed initial stress-tests. The vendor is sending us a replacement controller, which will arrive and be replaced early tomorrow. We will then stress-test the storage again. If it passes, Phoenix will be brought into production. If it fails, we will revert to the old scratch infrastructure in use prior to May 2023 while we hunt for a new solution. While we have begun syncing data, this will take time and Phoenix will be brought into production with a syncing scratch file system while 800TB is transferred, which may take approximately 1 week. Not all files will be there, but if you wait, they’ll come back. In the meantime, you may encounter files that were present in your scratch directory prior to the May maintenance period but have since been deleted, which will disappear as the sync completes.

The monthly deletion of old scratch directories scheduled for next week is canceled. Please disregard the notification you may have received last week.

I apologize for the inconvenience, but I believe this delay will help decrease future downtime.

The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM.

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

Pam Buffington

Pace Director

[Update 8/8/2023 6:00am]

PACE Maintenance Period starts now at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.

[Update 8/7/2023 12:00pm]

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.

[Update 8/2/2023 1:43pm]

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[Phoenix] Create Interactive CPU and GPU partitions on Phoenix

ITEMS NOT REQUIRING USER ACTION:

[Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB
[Phoenix, Hive, ICE] Open XDMoD to campus
[Phoenix] Replace Phoenix project storage controller
[Firebird] Upgrade firewall device firmware supporting CUI
[Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity
[OSG][Network] Move ScienceDMZ VRF to new network fabric
[Network] Install leaf module to InfiniBand director switch
[Network] Configure VPC pair redundancy to Research hall network switches
[Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity
[Storage] DDN Controller firmware & Disk firmware upgrade
[Storage] Reboot the backup controller to synchronize with the main controller
[Storage] Increase storage capacity for PACE backup servers
[Storage] Increase storage capacity for EAS group storage servers
[Storage] Replace cables on storage controller
[Software] Move pace-apps to Slurm on admin nodes
[Datacenter] Datacenter cooling maintenance

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

[Update 7/26/2023 4:39pm]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

WHAT IS HAPPENING?

ITEMS NOT REQUIRING USER ACTION:

[Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB
[Phoenix, Hive, ICE] Open XDMoD to campus
[Phoenix] Replace Phoenix project storage controller
[Firebird] Upgrade firewall device firmware supporting CUI
[Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity
[OSG][Network] Move ScienceDMZ VRF to new network fabric
[Network] Install leaf module to InfiniBand director switch
[Network] Configure VPC pair redundancy to Research hall network switches
[Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity
[Storage] Reboot the backup controller to synchronize with the main controller
[Storage] Increase storage capacity for PACE backup servers
[Storage] Increase storage capacity for EAS group storage servers
[Storage] Replace cables on storage controller
[Datacenter] Datacenter cooling maintenance

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

Hive Storage SAS Cable Replacement

[Update 7/25/2023 1:04pm]
The SAS cable has been replaced with no interruption on production.

[Update 7/24/2023 3:13pm]
Hive Storage SAS Cable Replacement

WHAT’S HAPPENING?

One SAS cable for Hive between the enclosure and controller for Hive storage needs to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, July 25th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shutdown and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Project Storage & Login Node Outage

[Update 7/21/2023 3:30pm]

Dear Phoenix Users,

The Lustre project storage filesystem on Phoenix is back up and available. We have completed cable replacements, reseated and replaced a couple hard drives, and restarted the controller. We have run tests to confirm that the storage is running correctly. Performance may still be degraded and impacted as redundant drives rebuild, but is better than the last few days.

Phoenix’s head nodes, which were unresponsive earlier this morning, are available again without issue. We will continue to monitor the login nodes for any other issues.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM this morning due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 7/21/2023 9:46 am]

Summary: The Lustre project storage filesystem on Phoenix became unresponsive this morning. Researchers may be unable to access data in their project storage. Multiple Phoenix login nodes have also become unresponsive, which may also prevent logins. We have paused the scheduler, preventing new jobs from starting, while we investigate.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known, but PACE is working with the vendor to find a resolution.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. This may impact logins as well. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix Cluster Outage and Fix

Summary: The scratch file system became unresponsive yesterday evening (~5:50pm) when some of the network controllers stopped working, causing an outage that may have resulted in difficulties logging into login nodes and writing to scratch.

Details: The file system was recovered this morning after restarting the controllers and all the Lustre components. The Slurm scheduler was also paused to troubleshoot issues with the cluster and has been re-released.

Impact: The file system and scheduler should now be fully functional. Users may have had issues accessing the Phoenix cluster yesterday evening and this morning. Compute jobs ongoing during that time period may have also been affected, so we recommend reviewing jobs run during that time period.

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Maintenance Period (January 31 – February 7, 2023)

[Updated 2023/02/03, 4:33 PM EST]

Dear Phoenix Users,

The Phoenix cluster is now ready for research and learning. We have released all jobs that were held by the scheduler.

Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. We will host a Slurm Orientation Session (for users new to Slurm) on Friday, Feburary 17, 11am.

The transfer of remaining funds on Phoenix Moab/Torque to Slurm is ongoing and is expected to be completed next week. January statements will report the accurate balance when they are sent out.

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2.

Status of activities:

ITEMS REQUIRING USER ACTION:

[Complete] [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster.
[Complete][Phoenix] New Phoenix login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message
[Complete] [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward.

ITEMS NOT REQUIRING USER ACTION:

[Complete] [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files were updated
[Complete] [Phoenix] Re-image last Phoenix login node; re-enable load balancer
[Complete] [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm
[Complete] [Network] Code upgrade to PACE departmental Palo Alto
[Complete] [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Complete] [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Complete] [Storage] Update the number of NFS threads to 4 times the number of cores
[Complete] [Storage] Update sysctl parameters on ZFS servers
[Complete] [Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Complete] [Datacenter] Databank: High Temp Chiller & Tower Maintenance

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Thank You,
– The PACE Team

[Updated 2023/02/02, 4:20 PM EST]

Dear Hive Users,

The Hive cluster is now ready for research and learning. We have released all jobs that were held by the scheduler.

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available.

Status of activities:

ITEMS REQUIRING USER ACTION:

[Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward.

ITEMS NOT REQUIRING USER ACTION:

[ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated
[Network] Code upgrade to PACE departmental Palo Alto
[Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Hive][Storage] Replace 40G cables on storage-hive
[Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Storage] Update the number of NFS threads to 4 times the number of cores
[Storage] Update sysctl parameters on ZFS servers
[Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Datacenter] Databank: High Temp Chiller & Tower Maintenance

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Thank You,
– The PACE Team

[Updated 2023/02/02, 4:05 PM EST]

Dear Firebird Users,

The Firebird cluster is now ready for research and learning. We have released all jobs that were held by the scheduler.

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available.

Status of activities:

ITEMS REQUIRING USER ACTION:

[Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward.

ITEMS NOT REQUIRING USER ACTION:

[ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated
[Network] Code upgrade to PACE departmental Palo Alto
[Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Storage] Update the number of NFS threads to 4 times the number of cores
[Storage] Update sysctl parameters on ZFS servers
[Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Datacenter] Databank: High Temp Chiller & Tower Maintenance

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Thank You,
– The PACE Team

[Updated 2023/02/01, 4:05 PM EST]

Dear Buzzard Users,

The Buzzard cluster is now ready for research and learning. We have released all jobs that were held by the scheduler.

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available.

Status of activities:

ITEMS NOT REQUIRING USER ACTION:

[ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated
[Network] Code upgrade to PACE departmental Palo Alto
[Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Storage] Update the number of NFS threads to 4 times the number of cores
[Storage] Update sysctl parameters on ZFS servers
[Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Datacenter] Databank: High Temp Chiller & Tower Maintenance

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Thank You,
– The PACE Team

[Updated 2023/02/01, 4:00 PM EST]

The PACE-ICE and COC-ICE instructional clusters are ready for learning. As usual, we have released all user jobs that were held by the scheduler. You may resume using PACE-ICE and COC-ICE at this time. PACE’s research clusters remain under maintenance as planned.

[Updated 2023/01/31, 6:00AM EST]

WHEN IS IT HAPPENING?
Maintenance Period starts now at 6 AM EST on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work are complete.

WHAT DO YOU NEED TO DO?
During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window.

Torque/Moab will no longer be available to Phoenix users starting now. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward.

WHAT IS HAPPENING?
PACE Maintenance Period starts now and will run until it is complete. Phoenix downtime could last until Tuesday, 02/07/2023 or beyond.

ITEMS REQUIRING USER ACTION:

[Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster.
[Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward.

ITEMS NOT REQUIRING USER ACTION:

[ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated
[Phoenix] Re-image last Phoenix login node; re-enable load balancer
[Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm
[ICE] Update cgroups limits on ICE head nodes
[Network] Code upgrade to PACE departmental Palo Alto
[Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Hive][Storage] Replace 40G cables on storage-hive
[Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Storage] Update the number of NFS threads to 4 times the number of cores
[Storage] Update sysctl parameters on ZFS servers
[Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING?
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/. Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/27, 2:06PM EST]

WHEN IS IT HAPPENING?
Reminder that the next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward.

WHAT IS HAPPENING?
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION:

[Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster.
[Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward.

ITEMS NOT REQUIRING USER ACTION:

[ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated
[Phoenix] Re-image last Phoenix login node; re-enable load balancer
[Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm
[ICE] Update cgroups limits on ICE head nodes
[Network] Code upgrade to PACE departmental Palo Alto
[Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Hive][Storage] Replace 40G cables on storage-hive
[Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Storage] Update the number of NFS threads to 4 times the number of cores
[Storage] Update sysctl parameters on ZFS servers
[Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/20, 8:45AM EST]

WHEN IS IT HAPPENING?
The next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

Users using Singularity on the command line need to use Apptainer commands moving forward.

WHAT IS HAPPENING?
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION:

[Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 119 additional nodes for a final total of about 1319). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster.
[Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward.

ITEMS NOT REQUIRING USER ACTION:

[ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated
[Phoenix] Re-image last Phoenix login node; re-enable load balancer
[Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm
[ICE] Update cgroups limits on ICE head nodes
[Network] Code upgrade to PACE departmental Palo Alto
[Network] Upgrade ethernet switch firmware to 9.3.10 (research hall)
[Hive][Storage] Replace 40G cables on storage-hive
[Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory
[Storage] Update the number of NFS threads to 4 times the number of cores
[Storage] Update sysctl parameters on ZFS servers
[Datacenter] Georgia Power: Microgrid tests and reconfiguration
[Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

Upcoming Maintenance Period Extension Required January 31-February 7 (estimated)

WHAT IS HAPPENING?
PACE is updating the group ID of every group & file in our storage infrastructure to remove the conflicts with those assigned campus wide by OIT. The expected time per cluster varies greatly due to the size of the related storage. During the maintenance period, PACE will release clusters as soon as each is complete.

WHEN IS IT HAPPENING?
Maintenance period starts on Tuesday, January 31, 2023. The changes to the Phoenix project file system are estimated to take seven days to complete. Thus, the maintenance period will be extended from the typical three days to seven days for the Phoenix cluster. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier. PACE will release them as soon as maintenance and migration work are complete. Researchers can expect those clusters to be ready earlier.

WHY IS IT HAPPENING?
This is a critical step for us to be able to make new storage available to campus users, removing group-id conflicts we currently have with the Georgia Tech Enterprise Directory (GTED). This will allow us to provide campus and pace mountable storage to our researchers and provide a foundation for additional self-service capabilities. This change will also allow us to evolve the PACE user management tools and processes. We understand that the short-term impact of this outage is problematic, but as we increase storage utilization, the problem will only get worse if left unaddressed. We expect the long-term impact of this update to be low, since the ownership, group names and permissions will remain unchanged.

WHO IS AFFECTED?
All users across all PACE’s clusters.

WHAT DO YOU NEED TO DO?
Please plan accordingly for an extended Maintenance Period for the Phoenix cluster starting Tuesday, January 31, 2023. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know and we can collaborate on possible alternatives.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. If escalation is required, please email me, Pam Buffington pam@gatech.edu directly as well.

Best,
– The PACE Team
– Pam Buffington – PACE Director

Phoenix Cluster Migration to Slurm Scheduler – Phase 5

[Updated 2023/01/17, 4:02PM EST]

Dear Phoenix researchers,

The fifth phase of Phoenix Slurm Migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated about 1200 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster.

As a reminder, the final phase of the migration is scheduled to complete later this month, during which the remaining 119 nodes will join Phoenix-Slurm:

Phase 6: January 31, 2023 (PACE Maintenance Period) – remaining 119 nodes

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled this Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET.

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition.

Best,
– The PACE Team

[Updated 2023/01/17, 6:00AM EST]

WHAT IS HAPPENING?
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster.

WHEN IS IT HAPPENING?
Today – Tuesday, January 17th at 6am ET

WHY IS IT HAPPENING?
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters.

WHO IS AFFECTED?
All Phoenix cluster users.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
We will follow up with additional updates and reminders as needed. If you should have any additional questions about the migration, please email us if you have any questions or concerns about the migration.

Best,
– The PACE Team

[Updated 2023/01/13, 1:05PM EST]

WHAT IS HAPPENING?
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster.

WHEN IS IT HAPPENING?
Tuesday, January 17th at 6am ET

WHY IS IT HAPPENING?
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. So far, we have successfully migrated about 1100 nodes (out of about 1319 total). For this fifth phase of the migration, 100 additional nodes will join the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters.

WHO IS AFFECTED?
All Phoenix cluster users.

WHAT DO YOU NEED TO DO?
As recommended at the beginning of the migration, we strongly encourage all researchers to continue shifting their workflows to the Slurm side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm.

Best,
– The PACE Team

Phoenix Cluster Migration to Slurm Scheduler – Phase 4

[Update 2022/01/04, 2:18PM EST]

Dear Phoenix researchers,

The fourth phase of migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated 1100 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster.

As a reminder, the final phases of the migration are scheduled to continue in January 2023, during which the remaining 219 nodes will join Phoenix-Slurm:

Phase 5: January 17, 2023 – 100 nodes
Phase 6: January 31, 2023 (PACE Maintenance Period) – about 119 nodes

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled for Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users on January 31st, at 6 AM ET.

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition.

Best,

-The PACE Team

[Update 2022/01/04, 6:00AM EST]

Dear Phoenix researchers,

Just a reminder that the fourth phase of will start today, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster.

The 100 nodes will be taken offline now (6am ET), and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm.

We will follow up with additional updates and reminders as needed. In the meantime, please email us if you have any questions or concerns about the migration.

Best,

– The PACE Team

[Update 2022/01/03, 5:26PM EST]

Dear Phoenix researchers,

We have successfully migrated about 1000 nodes (out of about 1319 total) from Phoenix to the Phoenix-Slurm cluster. As a reminder, the fourth phase is scheduled starting tomorrow, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster.

The 100 nodes will be taken offline tomorrow morning (January 4th) at 6am ET, and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm.

As recommended at the beginning of this migration, we strongly encourage all researchers to begin shifting over their workflows to the Slurm-based side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom.

Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm.

Best,

– The PACE Team

Phoenix Project & Scratch Storage Cables Replacement for Redundant Controller

[Update 2022/12/08, 5:52PM EST]

Work was been completed on the cable replacement on the redundant storage controller and associated systems connecting to the storage were restored back to normal. We were able to replace 2 cables on the controller without interruption to service.

[Update 2022/12/05, 9:00AM EST]

Summary: Phoenix project & scratch storage cable replacement for redundant controller and potential outage and subsequent temporary decreased performance

Details: A cable connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, to the redundant controller needs to be replaced, beginning around 10AM Wednesday, December 8th, 2022. The expected time to finish the work for cable replacement will take about 3-4 hours. After the replacement, pools will need to be rebuilt over the course of about a day.

Impact: Because we are replacing a cable on the redundant controller while maintaining the main controller, there should not be an outage during the cable replacement. However, a similar replacement has previously caused storage to become unavailable, so an outage is possible. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance may be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. If a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

New A100 GPU and AMD CPU nodes available on Phoenix-Slurm

Dear Phoenix researchers,

We have migrated 800 (out of 1319) nodes of our existing hardware as part of our ongoing Phoenix cluster migration to Slurm. PACE has continued our effort to provide a heterogenous hardware environment by adding 5 GPU nodes (2x Nvidia A100s per node) and 4 CPU nodes (2x AMD Epyc 7713 processors with 128 cores per node) to the Phoenix-Slurm cluster.

Both service offerings provide exciting, new hardware for research computing at PACE. The A100 GPU nodes, which also include 2x AMD Epyc 7513 processors with 64 cores per node, provide a powerful option to our users for GPU compute in machine learning and scientific applications. The AMD Epyc CPU nodes provide a cost-effective alternative to Intel processors for research, with energy and equipment savings we pass to our users with a lower rate than our current base option. However, AMD CPUs still provide great value in traditional HPC due to higher memory bandwidth and core density. You can find out more about our latest costs in our rate study here.

You can find out more information on our new nodes in our documentation here. We also provide documentation on how to use the A100 GPU nodes and AMD CPU nodes on Phoenix-Slurm. If you need further assistance with using these new resources, please feel free to reach out to us at pace-support@oit.gatech.edu or attend our next consulting session.

Best,

-The PACE Team

Partnership for an Advanced Computing Environment

Author: jvaldez8

PACE Maintenance Period (Aug 8 – Aug 10, 2023)

Hive Storage SAS Cable Replacement

Phoenix Project Storage & Login Node Outage

Phoenix Cluster Outage and Fix

PACE Maintenance Period (January 31 – February 7, 2023)

Upcoming Maintenance Period Extension Required January 31-February 7 (estimated)

Phoenix Cluster Migration to Slurm Scheduler – Phase 5

Phoenix Cluster Migration to Slurm Scheduler – Phase 4

Phoenix Project & Scratch Storage Cables Replacement for Redundant Controller

New A100 GPU and AMD CPU nodes available on Phoenix-Slurm

Georgia Institute of Technology