Posts

PACE Data Retention Updates

PACE is nearing the conclusion of our user records and data audit, and we would like to share with you several updates to how PACE handles data stored on our storage systems and the role of schools in managing that data.

PACE takes responsibility for the storage of all data from our old systems in the Rich datacenter that has not been assigned to a current member of the faculty. All of this data has been inaccessible to all users since December 1, 2020. Any school that wishes to do so may request data of a former faculty member and thereby assume responsibility for the data and for the cost of continuing to store it on PACE (unless it is relocated), in accordance with the Data Retention Guidelines provided by the GT administration and shared with you at this time. If the school does not make such a request, PACE will cover the cost of storing this data until July 2024, then delete anything that has not been requested.

All data left on PACE by faculty who departed the Institute after July 1, 2021, will follow the Data Retention Guidelines. Under these guidelines, the faculty member’s former school will be responsible for the data and the cost of storing it on PACE, relocating it to another storage system, or determining it can be deleted while complying with any relevant regulations and contracts. The 1 TB provided to each school at no charge on Phoenix may be used to store these files. Schools also have the option of purchasing PACE Archive storage, which is designed for long-term retention of data that does not need to be used regularly on PACE compute clusters.

If you have any questions about PACE storage, please contact PACE at pace-support@oit.gatech.edu, and we’ll be happy to discuss it with you.

[Resolved] Phoenix scheduler outage

[Update 3/1/22 6:40 PM]

The Phoenix scheduler is restored after the PACE team rebooted the server hosting the Moab workload manager. Thank you for your patience while we performed this repair. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 3/1/22 5:20 PM]

Summary: Phoenix scheduler outage prevents new jobs from starting

What’s happening and what are we doing: The Phoenix scheduler is currently experiencing an outage as the server hosting the Moab workload manager is unavailable. The PACE team is actively investigating and working to bring the scheduler back in service.

How does this impact me: New jobs on Phoenix will not be able to start. While you can submit jobs, they will remain in the queue until service is restored. Running jobs should not be impacted and will continue to run. You may find that Moab commands, including “showq” and others, are not responsive.

What we will continue to do: PACE will continue working to restore functionality to the Phoenix scheduler. We will provide updates as they are available.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Battery replacement on Phoenix project & scratch storage will impact performance on Thursday, March 3

Summary: Battery replacement on Phoenix project & scratch storage will impact performance on Thursday, March 3.

What’s happening and what are we doing: Power supply units on the Phoenix Lustre storage device, holding project and scratch storage, need to be replaced. During the replacement, which will begin at approximately 10 AM on Thursday, March 3, storage will shift to write-through mode, and performance will be impacted. Once the UPS batteries in the new units are sufficiently charged, performance will return to normal.

How does this impact me: Phoenix project and scratch performance will be impacted until the fresh batteries have sufficiently charged, which should take several hours. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Phoenix storage throughout this procedure.

Thank you for your patience as we complete this replacement. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved] Phoenix Lustre project & scratch storage degraded performance

[Update 2/23/2022 3:00 PM]

Phoenix Lustre project and scratch storage performance has now recovered. You may now resume submitting jobs as normal. Thank you for your patience as we investigated the root cause.

This issue was caused by certain jobs engaging in heavy read/write on network storage. Thank you to the researchers we contacted for your cooperation in adjusting your jobs.

If your workflow requires extensive access of large files on multiple nodes, please contact us, and we will be happy to work with you to create a workflow that may speed up your research while simultaneously ensuring network stability. PACE will also continue to work on improvements to our systems and monitoring.

If your work requires generating temporary files during a run, especially if they are large and/or numerous, you may benefit from using local disk on Phoenix compute nodes. Writing intermediate files to local storage avoids network latency and can speed up your calculations while lessening load on the system. Most Phoenix nodes have at least 1 TB of local NVMe storage available, while our SAS nodes have at least 7 TB of local storage. At the end of your job, you can transfer only the relevant output files to network storage (project or scratch).

We apologize for this inconvenience, and if you have any questions or concerns, please do not hesitate to reach out to us at pace-support@oit.gatech.edu.

[Update 2/22/2022 5:28pm]

We are following up with an update on the Phoenix storage (project and scratch) slowness issues that have persisted since previous reporting.  We have engaged our storage and fabric vendors as we work to address this issue.  Based on our current assessment, we have identified possible problematic server racks, which are offlined. Scheduler remains online, but Phoenix is operating under reduced capacity, and we ask users to refrain from submitting new jobs unless they are urgent.  We will continue to provide updates to users daily going forward as we work to address this issue.

Please accept our sincere apologies for this inconvenience, and if you have any questions or concerns, please do not hesitate to reach out to us at pace-support@oit.gatech.edu.

 

[Update 2/17/22 5:15 PM]

The PACE team and our storage vendor continue actively working to restore Lustre’s performance. We will provide updates as additional information becomes available. Please contact us at pace-support@oit.gatech.edu if you have any questions.

[Original Post 2/17/22 10:30 AM]

Summary: Phoenix Lustre project & scratch storage degraded performance

What’s happening and what are we doing: Phoenix project and scratch storage have been performing more slowly than normal since late yesterday afternoon. We have determined that the Phoenix Lustre device, hosting project and scratch storage, is experiencing errors and are working with our storage support vendor to restore performance.

How does this impact me: Researchers may experience slow performance using Phoenix project and scratch storage. This may include slowness in listing files in directories, reading files, or running jobs on Lustre storage. Home directories should not be impacted.

What we will continue to do: PACE is actively working, in coordination with our support vendor, to restore Lustre to full performance. We will update you as more information becomes available.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Complete PACE Maintenance Period – February 9 – 11, 2022] PACE Clusters Ready for Research!

Dear PACE Users,

All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all user jobs that were held by the scheduler.

Due to complications with the RHEL7.9 upgrade, 36% of Phoenix compute nodes remain under maintenance. We will work to return the cluster to full strength in the coming days. All node classes and queues have nodes available, and all storage is accessible.

Researchers who did not complete workflow testing on our Testflight environments on Phoenix and Hive, and Firebird users for whom a testing environment was not available, could experience errors related to the upgrade (see blog post). Please submit a support ticket to pace-support@oit.gatech.edu for assistance if you encounter any issues.

Our next maintenance period is tentatively scheduled to begin at 6:00 A on Wednesday, May 11, 2022, and conclude by 11:59 PM on Friday, May 13, 2022. Additional maintenance periods are tentatively scheduled for August 10-12 and November 2-4.

The following tasks were part of this maintenance period:

ITEMS REQUIRING USER ACTION:

  • [Complete on most nodes][System] Phoenix, Hive and Firebird clusters’ operating system will be upgraded to RHEL7.9.

ITEMS NOT REQUIRING USER ACTION:

  • [Deferred][Datacenter] Databank will repair/replace the DCR, requiring that all PACE compute nodes be powered off.
  • [Complete][Storage/Hive] Upgrade GPFS controller firmware
  • [Complete][Storage/Phoenix] Reintegrate storage previously borrowed for scratch into project storage
  • [Complete][Storage/Phoenix] Replace redundant storage controller and cables
  • [Complete][System] System configuration management updates
  • [Complete][Network] Upgrade IB switch firmware

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Resolved] Coda Datacenter Cooling Issue

[Update – 02/04/2022 10:24AM]

Dear PACE Researchers,

We are following up to inform you that all PACE clusters have resumed normal operations and clusters are accepting new user jobs. After the cooling loop was restored last night, datacenter’s operating temperatures had returned to normal and remained stable.

As previously mentioned, this outage should not have impacted any running jobs as PACE had only powered off idle compute nodes, so there is no user action required. Thank you for your patience as we worked through this emergency outage in coordination with Databank. If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

Best,
The PACE Team

 

[Original Post]

Dear PACE Researchers,

Due to a cooling issue in the Coda datacenter, we were asked to power off as many nodes as possible to control temperature in the research hall. At this time, Databank has recovered the cooling loop, and temperatures have stabilized. However, all PACE job schedulers will remain paused to help expedite the return to normal operating temperatures in the datacenter.

These events should have had no impact on running jobs, so no action is required at this time. We expect normal operation to resume in the morning. As always, if you have any questions, please contact us at pace-support@oit.gatech.edu.

Best,
The PACE Team

[Postponed] Phoenix Project & Scratch Storage Cable Replacement

[Update 1/26/22 6:00 PM]

Due to complications associated with a similar repair on the Hive cluster this morning, we have decided to postpone replacement of the storage cable on the Phoenix cluster. This repair to the Phoenix Lustre project & scratch storage will now occur during our upcoming maintenance period.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 1/25/22 9:30 AM]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Phoenix Lustre device, hosting project and scratch storage, to one of its controllers needs to be replaced, beginning around 12:00 noon Wednesday (January 26). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved] Hive Project & Scratch Storage Cable Replacement

[Update 1/26/22 5:45 PM]

The PACE team, working with our support vendor, has restored the Hive GPFS project & scratch storage system, and the scheduler is again starting jobs.

We have followed up directly with all individuals with potentially impacted jobs from this morning. Please resubmit any jobs that failed.

Please accept our sincere apology for any inconvenience that this outage may have caused you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Update 1/26/22 10:40 AM]

The Hive GPFS storage system is down at this time, so Hive project (data) and scratch storage are unavailable. The PACE team is currently working to restore access. In order to avoid further disruption, we have paused the Hive scheduler, so no additional jobs will start. Jobs that were already running may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

We will update you when the system is restored.

[Original Post 1/24/22 1:25 PM]

Summary: Hive project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10:00 AM Wednesday (January 26). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Operating System Upgrade to RHEL7.9

[Update 1/10/22 3:45 PM]

Testflight environments are now available for you to prepare for the upgrade of PACE’s Phoenix, Hive, and Firebird clusters to the Red Hat Enterprise Linux (RHEL) 7.9 operating system from RHEL 7.6 during the February 9-11 maintenance period. The required upgrade will improve the security of our clusters to comply with GT Cybersecurity policies. 

All PACE researchers are strongly encouraged to test all workflows they regularly run-on PACE. Please conduct your testing at your earliest convenience to avoid delays to your research. An OpenFabrics Enterprise Distribution (OFED) upgrade requires rebuilding our MPI software, including updates and modifications to our scientific software repository. PACE is providing updated modules for all our Message Passing Interface (MPI) options. 

For details of what to test and how to access our Testflight-Coda (Phoenix) and Testflight-Hive environments, please visit our RHEL7.9 upgrade documentation.  

Please let us know if you encounter any issues with the upgraded environment. Our weekly PACE Consulting Sessions are a great opportunity to work with PACE’s facilitation team on your testing and upgrade preparation. Visit the schedule of upcoming sessions to find the next opportunity.  

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 12/7/21 3:30 PM]

Summary: Operating System Upgrade to RHEL7.9

What’s happening and what are we doing: PACE will upgrade our Phoenix, Hive, and Firebird clusters to the Red Hat Enterprise Linux (RHEL) 7.9 operating system from RHEL 7.6 during the February 9-11 maintenance period. The upgrade timing of the ICE clusters will be announced later. The required upgrade will improve the security of our clusters to comply with GT Cybersecurity policies and will also update our software repository.

PACE will provide researchers with access to a “testflight” environment in advance of the upgrade, allowing you the opportunity to ensure your software works in the new environment. More details will follow at a later time, including how to access the testing environment for each research cluster.

How does this impact me:

  • An OpenFabrics Enterprise Distribution (OFED) upgrade requires rebuilding our MPI software. PACE is providing updated modules for all of our Message Passing Interface (MPI) options and testing their compatibility with all software PACE installs in our scientific software repository.
  • Researchers who built their own software may need to rebuild it in the new environment and are encouraged to use the testflight environment to do so. Researchers who have contributed to PACE Community applications (Tier 3) should test their software and upgrade it if necessary to ensure continued functionality.
  • Researchers that have installed their own MPI code independent of PACE’s MPI installations will need to rebuild it in the new environment.
  • Due to the pending upgrade, software installation requests may be delayed in the coming months. Researchers are encouraged to submit a software request and discuss their specific needs with our software team research scientists. As our software team focuses on preparing the new environment and ensuring that existing software is compatible, requests for new software may take longer than usual to be fulfilled.

What we will continue to do: PACE will ensure that our scientific software repository is compatible with the new environment and will provide researchers with a testflight environment in advance of the migration, where you will be able to test the upgraded software or rebuild your own software. We will provide additional details as they become available.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Improvements to job accounting and queue wait times on PACE clusters

We would like to share two updates with you regarding improvements to job accounting and queue wait times on the Phoenix and Firebird clusters.
  • Due to an error, some users have seen the wrong account names listed in our pace-quota and pace-whoami utilities in recent months. We have corrected this, and all users can now use pace-quota to see available charge accounts and balances on Phoenix or Firebird. At the same time, a new improvement to our utility now makes balances visible for all accounts, including multi-PI or school-owned accounts that previously displayed a zero balance, so researchers can always check available balances. Read our documentation for more details about the charge accounts available to you and what they mean. The pace-quota command is available on Phoenix, Hive, Firebird, and ICE. It provides user-specific details:
    • your storage usage on that cluster
    • your charge account information for that cluster (Phoenix and Firebird only)
  • Additionally, in order to improve utilization of our clusters and reduce wait times, we have enabled spillover between node classes, allowing waiting jobs to run on underutilized, more capable nodes rather than those requested, requiring no user action, at no additional charge. Spillover on GPU nodes was enabled in September, while CPU nodes gained the capability last week, on both Phoenix and Firebird.
Please note that targeting a specific/more expensive node class to reduce wait time is no longer effective or necessary. Please request the resources required for your job. Your job will continue to be charged based on the rate for the resources it requests, even if it ends up being assigned to run on more expensive hardware.
As always, please contact us if you have any questions.