Posts

Phoenix filesystem intermittent slowness

Summary: The Phoenix’s filesystem response has been inconsistent starting today. We are noticing that there is a high utilization on all the head-nodes. 

Details: File access is intermittently slow on home storage, project storage, and scratch. Executing any command such as ‘ls’ on the head-node can have a slow response. Slowness in file access was first detected by a couple users around 3pm yesterday, and we have started getting more reports this afternoon. PACE team is actively working on the issue to identify the root cause and resolve this at the earliest. 

Impact: Users may continue to experience intermittent slowness in using the head-node, submitting jobs, compiling code, using interactive sessions, and file read/write. 

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions. We will continue to watch the performance and follow-up with another status message tomorrow morning.

06/08/2023 Update

Phoenix home, project storage and scratch are all fully functional. The filesystem performance has been normal for the last 12 hours. We will continue our investigation on the root cause and continue to monitor the performance.

As of now, the utilization on our servers has stabilized. The issue has not impacted any jobs running or waiting in queue. Users can resume using Phoenix as usual.

For questions, please contact PACE at pace-support@oit.gatech.edu.

PACE Maintenance Period, May 9-11, 2023

[Update 5/11/23]

The Phoenix, Hive, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler. 

The ICE instructional cluster remains under maintenance until tomorrow. Summer instructors will be notified when the upgraded ICE is ready for use.

The next maintenance period for all PACE clusters is August 8, 2023, at 6:00 AM through August 10, 2023, at 11:59 PM. An additional maintenance period for 2023 is tentatively scheduled for October 24-26, 2023 (note revised date).  

Status of activities:

  • [Complete][Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [In progress][ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Complete][Phoenix Storage] Phoenix scratch will be migrated to a new Lustre device, which will result in fully independent project & scratch filesystems. Researchers will find their scratch data remains accessible at the same path via symbolic link or directly via the same mount location.
  • [Complete][Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Complete][Datacenter] High-temperature loop pump maintenance
  • [Complete][Storage] Replace cables on Hive and Phoenix parallel filesystems
  • [Complete]Network] Upgrade ethernet switch code in Enterprise Hall
  • [Complete][Network] Configure virtual pair between ethernet switches in Research Hall

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/2/23]

This is a reminder that the next PACE Maintenance Period starts at 6:00AM on Tuesday, 05/09/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 05/11/2023.

Maintenance on the ICE instructional cluster is expected to continue through Friday, 05/12/2023.

Updated planned activities:

WHAT IS HAPPENING?  

ITEMS NOT REQUIRING USER ACTION: 

  • [Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Phoenix Storage] Phoenix scratch will be migrated to a new Lustre device, which will result in fully independent project & scratch filesystems. Researchers will find their scratch data remains accessible at the same path via symbolic link or directly via the same mount location.
  • [Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Datacenter] High-temperature loop pump maintenance
  • [Storage] Replace cables on Hive and Phoenix parallel filesystems
  • [Network] Upgrade ethernet switch code in Enterprise Hall
  • [Network] Configure virtual pair between ethernet switches in Research Hall

[Original Announcement 4/24/23]

WHEN IS IT HAPPENING?
The next PACE Maintenance Period starts at 6:00AM on Tuesday, 05/09/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 05/11/2023.

Maintenance on the ICE instructional cluster is expected to continue through Friday, 05/12/2023.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During the Maintenance Period, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

WHAT IS HAPPENING?  

ITEMS NOT REQUIRING USER ACTION: 

  • [Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Datacenter] High-temperature loop pump maintenance
  • [Storage] Replace Input/Output Modules on two storage devices

WHY IS IT HAPPENING? 
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. Future maintenance dates may be found on our homepage.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Phoenix Scratch Storage & Scheduler Outages

[Update 4/3/23 5:30 PM]

Phoenix’s scratch storage & scheduler are again fully functional.

The scratch storage system was repaired by 3 PM. We rebooted one of the storage servers, with the redundant controllers taking over the load, and brought it back online to restore responsiveness.

The scheduler outage was caused by a number of communication timeouts, later exacerbated by stuck jobs on scratch storage. After processing the backlog, the scheduler began allowing jobs to begin around 4:20 PM this afternoon. We have been monitoring it since then. At this time, due high utilization, the Phoenix CPU nodes are nearly completely occupied.

We will refund any job that failed after 10:30 AM today due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 4/3/23 2:30 PM]

Summary: Scratch storage is currently inaccessible on Phoenix. In addition, jobs are not able to start. The login nodes experienced high load earlier today, rendering them non-responsive, which was resolved through a reboot.

Details: Phoenix is currently experiencing multiple issues, and the PACE team is investigating. The scratch storage system is inaccessible as the Lustre service has been timing out since approximately 11:30 AM today. The scheduler is also failing to launch jobs, which started by 10:30 AM today. Finally, we experienced high load on all four Phoenix login nodes around 1:00 PM today. The login nodes were repaired through a reboot. All issues, including any potential root cause, are being investigated by the PACE team today.

Impact: Researchers on login nodes may have been disconnected during the reboots required to restore functionality. Scratch storage is unreachable at this time. Home and project storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in scratch storage may not be working. New jobs are not launching and will remain in queue.

Thank you for your patience as we investigate these issues and restore Phoenix to full functionality. For questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Project & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Cables will be replaced one at a time, taking about 3 hours to complete the work.

WHEN IS IT HAPPENING?
Monday, April 3rd, 2023 starting 9AM EDT.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

WHAT’S HAPPENING?
Two cables connecting one of the two controllers of the Hive Lustre device need to be replaced. Cables will be replaced one at a time, taking about 3 hours to complete the work.

WHEN IS IT HAPPENING?
Monday, April 3rd, 2023 starting 9AM EDT.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Connecting new cooling doors to power

[Updated 2023/04/04, 12:25PM ET]

Electricians needed to complete some additional checks before performing the final connection, so the task has been re-scheduled for Thursday, 6-April.

[Original post 2023/03/31, 4:51PM ET]

WHAT’S HAPPENING?
In order to complete the Coda data center expansion on time and under budget, low risk electrical work will be performed, to connect the additional 12 uSystems cooling doors will be wired to the distribution panels, and left powered off. Adding the circuit breaker is the only work on the “powered” side of the circuits.

WHEN IS IT HAPPENING?
Tuesday, April 4th, 2023, the work will be performed during business hours.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
None of the user’s jobs processing. The connection work is very low risk, most of the work will be done on the “unpowered” side of the panel. Worst case scenario is that we’ll lose power to up to 20 cooling doors, which expected to be recovered in less than 1 minute. If it takes longer than 5 minutes, we will initiate an emergency power down on the affected nodes.

WHAT DO YOU NEED TO DO?
Nothing.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix project storage outage

[Updated 2023/03/17 3:30 PM]

Phoenix project storage is again available, and we have resumed the scheduler, allowing new jobs to begin. Queued jobs will begin as resources are available.

The storage issue arose when one metadata server rebooted shortly after 1:00 PM yesterday, and the high-availability configuration automatically switched to the secondary server, which became overloaded. After extensive investigation yesterday evening and today, in collaboration with our storage vendor, we identified and stopped a specific series of jobs heavily taxing storage and also replaced several cables to fully restored Phoenix project storage availability.

Jobs that were running as of 1:00 PM yesterday that will fail or have failed due to the project storage outage will be refunded to the charge account provided. Please resubmit these failed jobs to Slurm to continue research.

Thank you for your patience as we repaired project storage. Please contact us with any questions.

[Updated 2023/03/16, 11:55PM ET]

We’re still experiencing significant slowness of the filesystem. We’re going to keep job scheduling paused for tonight and PACE team will resume troubleshooting in the morning as early as possible.

[Updated 2023/03/16, 6:50PM ET]

Troubleshooting continues with the vendor’s assistance. The file system is currently stable, but one of the meta data servers continues with an abnormal workload. We are working to resolve this issue to avoid additional file system failures.

[Original post 2023/03/16, 2:48PM ET]

Summary: Phoenix project storage is currently unavailable. The scheduler is paused, preventing any additional jobs from starting until the issue is resolved.

Details: An MDS server for the Phoenix Lustre parallel filesystem for project storage has encountered errors and rebooted. The PACE team is investigating at this time and working to restore project storage availability.

Impact: Project storage is slow or unreachable at this time. Home and scratch storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in project storage may not be working. To avoid further job failures, we have paused the scheduler, so no new jobs will start on Phoenix, regardless of the storage used.

Thank you for your patience as we investigate this issue and restore Phoenix storage to full functionality.

For questions, please contact PACE at pace-support@oit.gatech.edu.

New compute nodes on the Phoenix cluster

In the month of February, we added several compute nodes to the Phoenix cluster. This will give the Phoenix users the opportunity to use more powerful nodes for their computations, and to decrease the waiting time for high-demand hardware.

There are three groups of new nodes:

  1. 40 32-core Intel-CPU high-memory nodes (768 GB of RAM per node). These nodes are part of our “cpu-large” partition, and this addition increases the number of “cpu-large” nodes from 68 to 108. The nodes have Dual Intel Xeon Gold 6226R processors @ 2.9 GHz (with 32 instead of 24 cores per node). Any jobs that require more than 16 GB of memory per CPU will end up on the nodes from the “cpu-large” partition.
  2. 4 128-core AMD CPU nodes with 128 cores per node. These nodes are part of our “cpu-amd” partition, and this addition increases the number of “cpu-amd” nodes from 4 to 8. The cores are Dual AMD Epyc 7713 processors @ 2.0 GHz (128 cores per node) with 512 GB of memory. For comparison, most of the older Phoenix compute nodes have 24 cores per node (and have Intel processors rather than AMD). To target these nodes specifically, you can specify the flag “-C amd” in your sbatch script or salloc command:https://docs.pace.gatech.edu/phoenix_cluster/slurm_guide_phnx/#amd-cpu-jobs
  3. 7 64-core AMD CPU nodes with Nvidia A100 GPUs (two GPUs per node) with 40 GB of GPU memory. These nodes are part of our “gpu-a100” partition, and this addition increases the number of “gpu-a100” nodes from 5 to 12. These nodes have Dual AMD Epyc 7513 processors @ 2.6 GHz (64 cores per node) with 512 GB of RAM. To target these nodes, you can specify the flag “–gres=gpu:A100:1” (to get one GPU per node) or “–gres=gpu:A100:2” (to get both GPUs for each requested node) in your sbatch script or salloc command:https://docs.pace.gatech.edu/phoenix_cluster/slurm_guide_phnx/#gpu-jobs

To see the up-to-date specifications of the Phoenix compute nodes, please refer to our website: 

https://docs.pace.gatech.edu/phoenix_cluster/slurm_resources_phnx/

If you have any other questions, please send us a ticket by emailing pace-support@oit.gatech.edu.

OIT Network Maintenance, Saturday, February 25

WHAT’S HAPPENING?

OIT Network Services will be upgrading the Coda Data Center firewall appliances. This will briefly disrupt connections to PACE, impacting login sessions, interactive jobs, and Open OnDemand sessions. Details on the maintenance are available on the OIT status page.

WHEN IS IT HAPPENING?
Saturday, February 25, 2023, 6:00 AM – 12:00 noon

WHY IS IT HAPPENING?
Required maintenance

WHO IS AFFECTED?

Any researcher or student with an active connection to PACE clusters (Phoenix, Hive, Buzzard, PACE-ICE, and COC-ICE) may lose their connection briefly during the maintenance window. Firebird will not be impacted.

This impacts ssh sessions and interactive jobs. Running batch jobs will not be impacted. Open OnDemand sessions that are disrupted may be resumed via the web interface once the network is restored if their walltime has not expired.

WHAT DO YOU NEED TO DO?

No action is required.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Storage & Scratch Storage Cables Replacement

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 3 hours.

WHEN IS IT HAPPENING?
Tuesday, February 21st, 2023 starting 9AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.