Posts

New compute nodes on the Phoenix cluster

In the month of February, we added several compute nodes to the Phoenix cluster. This will give the Phoenix users the opportunity to use more powerful nodes for their computations, and to decrease the waiting time for high-demand hardware.

There are three groups of new nodes:

  1. 40 32-core Intel-CPU high-memory nodes (768 GB of RAM per node). These nodes are part of our “cpu-large” partition, and this addition increases the number of “cpu-large” nodes from 68 to 108. The nodes have Dual Intel Xeon Gold 6226R processors @ 2.9 GHz (with 32 instead of 24 cores per node). Any jobs that require more than 16 GB of memory per CPU will end up on the nodes from the “cpu-large” partition.
  2. 4 128-core AMD CPU nodes with 128 cores per node. These nodes are part of our “cpu-amd” partition, and this addition increases the number of “cpu-amd” nodes from 4 to 8. The cores are Dual AMD Epyc 7713 processors @ 2.0 GHz (128 cores per node) with 512 GB of memory. For comparison, most of the older Phoenix compute nodes have 24 cores per node (and have Intel processors rather than AMD). To target these nodes specifically, you can specify the flag “-C amd” in your sbatch script or salloc command:https://docs.pace.gatech.edu/phoenix_cluster/slurm_guide_phnx/#amd-cpu-jobs
  3. 7 64-core AMD CPU nodes with Nvidia A100 GPUs (two GPUs per node) with 40 GB of GPU memory. These nodes are part of our “gpu-a100” partition, and this addition increases the number of “gpu-a100” nodes from 5 to 12. These nodes have Dual AMD Epyc 7513 processors @ 2.6 GHz (64 cores per node) with 512 GB of RAM. To target these nodes, you can specify the flag “–gres=gpu:A100:1” (to get one GPU per node) or “–gres=gpu:A100:2” (to get both GPUs for each requested node) in your sbatch script or salloc command:https://docs.pace.gatech.edu/phoenix_cluster/slurm_guide_phnx/#gpu-jobs

To see the up-to-date specifications of the Phoenix compute nodes, please refer to our website: 

https://docs.pace.gatech.edu/phoenix_cluster/slurm_resources_phnx/

If you have any other questions, please send us a ticket by emailing pace-support@oit.gatech.edu.

OIT Network Maintenance, Saturday, February 25

WHAT’S HAPPENING?

OIT Network Services will be upgrading the Coda Data Center firewall appliances. This will briefly disrupt connections to PACE, impacting login sessions, interactive jobs, and Open OnDemand sessions. Details on the maintenance are available on the OIT status page.

WHEN IS IT HAPPENING?
Saturday, February 25, 2023, 6:00 AM – 12:00 noon

WHY IS IT HAPPENING?
Required maintenance

WHO IS AFFECTED?

Any researcher or student with an active connection to PACE clusters (Phoenix, Hive, Buzzard, PACE-ICE, and COC-ICE) may lose their connection briefly during the maintenance window. Firebird will not be impacted.

This impacts ssh sessions and interactive jobs. Running batch jobs will not be impacted. Open OnDemand sessions that are disrupted may be resumed via the web interface once the network is restored if their walltime has not expired.

WHAT DO YOU NEED TO DO?

No action is required.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Storage & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 3 hours.

WHEN IS IT HAPPENING?
Tuesday, February 21st, 2023 starting 9AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

PACE Spending Deadlines for FY23

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY23 on June 30, 2023, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by March 31, 2023. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2023 will be held for processing in July, in FY24. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2023.
    1. State funds (DE worktags) expiring on June 30, 2023, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2023, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

PACE Maintenance Period (January 31 – February 7, 2023)

[Updated 2023/02/03, 4:33 PM EST]

Dear Phoenix Users, 

The Phoenix cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. We will host a Slurm Orientation Session (for users new to Slurm) on Friday, Feburary 17, 11am. 

The transfer of remaining funds on Phoenix Moab/Torque to Slurm is ongoing and is expected to be completed next week.  January statements will report the accurate balance when they are sent out. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Complete] [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Complete][Phoenix] New Phoenix login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message  
  • [Complete] [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [Complete] [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files were updated 
  • [Complete] [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Complete] [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [Complete] [Network] Code upgrade to PACE departmental Palo Alto 
  • [Complete] [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Complete] [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Complete] [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Complete] [Storage] Update sysctl parameters on ZFS servers 
  • [Complete] [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Complete] [Datacenter] Databank: High Temp Chiller & Tower Maintenance 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu 

Thank You,
– The PACE Team

[Updated 2023/02/02, 4:20 PM EST]

Dear Hive Users, 

The Hive cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Thank You,
– The PACE Team

[Updated 2023/02/02, 4:05 PM EST]

Dear Firebird Users, 

The Firebird cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu 

Thank You,
– The PACE Team

[Updated 2023/02/01, 4:05 PM EST]

Dear Buzzard Users,

The Buzzard cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available. 

Status of activities: 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu 

Thank You,
– The PACE Team

[Updated 2023/02/01, 4:00 PM EST]

The PACE-ICE and COC-ICE instructional clusters are ready for learning. As usual, we have released all user jobs that were held by the scheduler. You may resume using PACE-ICE and COC-ICE at this time. PACE’s research clusters remain under maintenance as planned.

[Updated 2023/01/31, 6:00AM EST]

WHEN IS IT HAPPENING?
Maintenance Period starts now at 6 AM EST on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023. 

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work are complete.

WHAT DO YOU NEED TO DO?
During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window.

Torque/Moab will no longer be available to Phoenix users starting now. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
PACE Maintenance Period starts now and will run until it is complete. Phoenix downtime could last until Tuesday, 02/07/2023 or beyond. 

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/27, 2:06PM EST]

WHEN IS IT HAPPENING?
Reminder that the next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work is complete.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/20, 8:45AM EST]

WHEN IS IT HAPPENING?
The next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work is complete.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 119 additional nodes for a final total of about 1319). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

Upcoming Maintenance Period Extension Required January 31-February 7 (estimated)

WHAT IS HAPPENING? 
PACE is updating the group ID of every group & file in our storage infrastructure to remove the conflicts with those assigned campus wide by OIT. The expected time per cluster varies greatly due to the size of the related storage. During the maintenance period, PACE will release clusters as soon as each is complete.  

WHEN IS IT HAPPENING? 
Maintenance period starts on Tuesday, January 31, 2023. The changes to the Phoenix project file system are estimated to take seven days to complete. Thus, the maintenance period will be extended from the typical three days to seven days for the Phoenix cluster. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier. PACE will release them as soon as maintenance and migration work are complete. Researchers can expect those clusters to be ready earlier.   

WHY IS IT HAPPENING? 
This is a critical step for us to be able to make new storage available to campus users, removing group-id conflicts we currently have with the Georgia Tech Enterprise Directory (GTED). This will allow us to provide campus and pace mountable storage to our researchers and provide a foundation for additional self-service capabilities. This change will also allow us to evolve the PACE user management tools and processes. We understand that the short-term impact of this outage is problematic, but as we increase storage utilization, the problem will only get worse if left unaddressed. We expect the long-term impact of this update to be low, since the ownership, group names and permissions will remain unchanged. 

WHO IS AFFECTED? 
All users across all PACE’s clusters.  

WHAT DO YOU NEED TO DO? 
Please plan accordingly for an extended Maintenance Period for the Phoenix cluster starting Tuesday, January 31, 2023. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know and we can collaborate on possible alternatives.  

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. If escalation is required, please email me, Pam Buffington pam@gatech.edu directly as well.

Best, 
– The PACE Team 
– Pam Buffington – PACE Director 

Phoenix Cluster Migration to Slurm Scheduler – Phase 5

[Updated 2023/01/17, 4:02PM EST]

Dear Phoenix researchers,  

The fifth phase of Phoenix Slurm Migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated about 1200 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster. 

As a reminder, the final phase of the migration is scheduled to complete later this month, during which the remaining 119 nodes will join Phoenix-Slurm:

  • Phase 6: January 31, 2023 (PACE Maintenance Period) – remaining 119 nodes 

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled this Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. 

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition. 

Best,
– The PACE Team

[Updated 2023/01/17, 6:00AM EST]

WHAT IS HAPPENING? 
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster. 

WHEN IS IT HAPPENING? 
Today – Tuesday, January 17th at 6am ET 

WHY IS IT HAPPENING? 
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters. 

WHO IS AFFECTED? 
All Phoenix cluster users. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
We will follow up with additional updates and reminders as needed. If you should have any additional questions about the migration, please email us if you have any questions or concerns about the migration.

Best,
– The PACE Team

[Updated 2023/01/13, 1:05PM EST]

WHAT IS HAPPENING? 
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster. 

WHEN IS IT HAPPENING? 
Tuesday, January 17th at 6am ET 

WHY IS IT HAPPENING? 
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. So far, we have successfully migrated about 1100 nodes (out of about 1319 total). For this fifth phase of the migration, 100 additional nodes will join the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters. 

WHO IS AFFECTED? 
All Phoenix cluster users. 

WHAT DO YOU NEED TO DO? 
As recommended at the beginning of the migration, we strongly encourage all researchers to continue shifting their workflows to the Slurm side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm.

Best,
– The PACE Team

Phoenix Project & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 1PM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

WHAT’S HAPPENING?
Two cables connecting one of the two controllers of the Hive Lustre device need to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 2 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 10AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

PyTorch Security Risk: Please Check & Update

WHAT’S HAPPENING?

Researchers who install their own copies of PyTorch may have downloaded a compromised package and should uninstall it immediately.

WHEN IS IT HAPPENING?

Pytorch-nightly for December 25-30, 2022, is impacted. Please uninstall it immediately if you have installed this version.

WHY IS IT HAPPENING?

A malicious Triton dependency was added to the Python Package Index. See https://pytorch.org/blog/compromised-nightly-dependency/ for details.

WHO IS AFFECTED?

Researchers who install PyTorch on PACE or other services and updated with nightly packages December 25-30. PACE has scanned all .conda and .local directories on our systems and has not identified any copies of the Triton package.

Affected services: All PACE clusters

WHAT DO YOU NEED TO DO?

Please uninstall the compromised package immediately. Details are available at https://pytorch.org/blog/compromised-nightly-dependency/. In addition, please alert PACE at pace-support@oit.gatech.edu to let us know that you have identified an installation on our systems.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions, or if you are unsure if you have installed the compromised package on PACE.