Posts

[RESOLVED] Phoenix Scratch Outage

Starting around 4 PM Sunday, the Phoenix scratch filesystem became non-responsive, causing issues with access to files and directories stored in ~/scratch. Functionality was restored promptly Monday morning, and at this time, all systems are performing as expected. If you were running jobs that utilized scratch storage during this outage, they may have been negatively impacted; please reach out to pace-support@oit.gatech.edu with related IDs for any such jobs.

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

[Update – 08/13/2021 – 10:00AM]

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Original Message – 07/13/2021 – 4:15PM that was updated on August 4, 2021 with list of tasks] 

Dear PACE Users,

This is another friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/11/2021, which is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.  Please note, as usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

Please see the list of activities to be completed:

ITEMS REQUIRING USER ACTION:

  • Currently, none.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [System/Security] Operating system patch installs
  • [System/Security] Endpoint Protection Updates
  • [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [System] Update Nvidia drivers and add Nvidia specific libraries
  • [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Resolved] OIT’s Data Warehouse Service Outage

[Update – July 13, 2021] 

OIT has restored operation to Data Warehouse service on July 12, 11:22AM.  Shortly after, PACE has restored functionality to our database and our administrative services.   OIT has continued to monitor the Data Warehouse service.  At this time, all PACE user facing utilities such as pace-check-queue, pace-quota, and pace-whoami are operational.

Please accept our sincere apology for any inconvenience that this temporary limitation may have caused you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

[Original Message – July 12, 2021]

Dear PACE Users,

We are reaching out to inform you that on Saturday at about 10:00am, there was an outage to OIT’s Enterprise Data Warehouse service, which PACE relies on for hosting our database instance that subsequently went down at 11:07am.  The impact to PACE from this service outage is mainly limited to administrative side, and there is some impact to user facing utilities such as pace-check-queue; however, there is no impact to users’ jobs or ability to submit jobs.

What’s happening and what we are doing:  Currently, OIT is investigating the outage impacting the Data Warehouse service that occurred on Saturday, and this outage is tracked at OIT’s status page.   PACE is monitoring this development closely.

How does this impact me:  This data warehouse service outage impacts user facing utilities such as pace-check-queue, pace-quota, pace-whoami that are partially or fully nonfunctional.   In addition, until the Data Warehouse service is restored, PACE will be unable to create new user and PI account requests.  

What we will continue to do:  PACE team will continue to monitor this development, and we will report as needed.   

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

OIT NetApp upgrade

A low-risk upgrade is planned for Georgia Tech OIT’s NetApp storage appliances, beginning Saturday, July 10, at 6:00 AM. We do not expect any impact on PACE systems from this upgrade.

OIT’s NetApp appliance is in use on PACE’s Phoenix, PACE-ICE, and COC-ICE clusters. It hosts home directories as well as pace-apps, our software module repository. Should there be an unexpected disruption, users may face issues with logins, access to home directories, and loading or using PACE-supported software modules. We will provide updates in the unlikely event of a disruption this weekend.

Please contact us at pace-support@oit.gatech.edu with any questions.

pace-support.sh is disabled on PACE Clusters — please email pace-support directly for inquiries

Dear PACE Users,

It has come to our attention that we are not receiving support requests generated by the pace-support.sh script, which allows submission of support tickets directly from PACE clusters. Our investigation is ongoing.

At this time, please email us at pace-support@oit.gatech.edu from a non-PACE system for all support requests, to ensure that we receive your message.

From our initial investigation, it appears that this outage began at some point in May. We apologize for any lost messages since then. If you have been trying to reach us via the pace-support script, please email us instead. You should receive an automated acknowledgement email from Service Desk when your request is successfully processed.

Please contact us at pace-support@oit.gatech.edu with questions.

The PACE Team

[Urgent] Hive Cluster Storage Controller Cable Replacement – Performance Impact

[Update – 06/25 11:40PM]

The storage controller cable on Hive cluster was replaced this evening and brought back online.  Unfortunately, after the repairs, GPFS storage mounts became unavailable, which had interrupted users’ running jobs this evening.   We’ve paused the scheduler briefly while we restarted the GPFS services across the cluster.  The storage mounts were restored, and scheduler has been resumed.

User’s jobs that have been running/queued between about 7:00pm and 10:30pm today (6/25/2021)  may have been interrupted, and we recommend the users to check on their jobs and resubmit your jobs as needed.  Please accept our sincerest apology for this inconvenience.

We will continue to monitor the services and update as needed.  If you have any questions, please contact us at pace-support@oit.gatech.edu.

[Original Message – 06/25 5:12PM]

Dear Hive Users,

We are reaching out to inform you that one of our storage controllers for Hive cluster has a bad cable that needs to be replaced to ensure optimal performance and data integrity.   We have the cable at hand, and are in a process of replacing this cable this evening, Friday 06/25/2021.  This work will impact storage performance briefly, which users may experience as storage slowness as we are routing all our traffic to a secondary controller during this operation. 

What’s happening and what we are doing:  More specifically, PACE has assessed a high failure rate of the disks in one of the enclosures for the storage controller with a bad cable.  As a precaution, we will be shutting down the controller with the bad cable to unfail the disks and to ensure data integrity of the system.  We will work on replacing the cable this evening during which the controller will be shutdown.  During this work, all storage traffic will be routed to a secondary controller that is fully operational.   Given the anticipated load on the secondary controller, we anticipate users experiencing performance degradation.  

How does this impact me:  With only one storage control in operation, users may experience storage slowness.  In a highly unlikely event, this could cause downtime to the storage which would impact all users’ running jobs; however, we do not anticipate any storage outage during this operation.

What we will continue to do:  PACE team will work on the cable replacement and restore the storage to optimal operation, and update the community as needed. 

Please accept our sincere apology for any inconvenience that this  may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

Phoenix scratch storage update

We would like to remind you about scratch storage policy on Phoenix. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Phoenix’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who used PACE prior to the migration to Phoenix or who use Hive are already familiar with this workflow.

Some of you will receive such an email this week. The first deletion of old scratch files in Phoenix will occur on July 7, covering files noted in these messages. We are extending the time beyond the normal one-week notification for this first round to give you time to adjust to this weekly process again.

Phoenix project storage is the intended location for your important research data. You can find out more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Phoenix.

[Resolved] Hive scheduler outage

The Hive scheduler experienced an outage this afternoon, as the resource and workload managers were unable to communicate. Our team identified the issue as relating to a missing library file and corrected the issue, restoring functionality at approximately 5 PM today.
Jobs submitted this afternoon would not have been able to start until the repair was implemented. Already-running jobs should not have been affected.
Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Phoenix scratch outage

[Update 6/12/21 6:30 PM]

Phoenix Lustre scratch has been restored. We paused the scheduler at 4:40 PM to prevent additional jobs from starting and resumed scheduling at 6:20 PM. As noted, please contact us with the job number for any job that began prior to 4:40 PM and was affected by the scratch outage, in order to receive a refund.

[Original post, 6/12/21 4:30 PM]

We are experiencing an outage on Phoenix’s Lustre scratch storage. Our team is currently investigating and has confirmed that this issue is related to the scratch mount and does not affect home or project storage. Users may be unable to list, read, or write files in their scratch directories.
If your running job has failed or runs without producing output as a result of this outage, please contact us at pace-support@oit.gatech.edu with the affected job number(s), and we will refund the value of the job(s) to your charge account. Please refrain from submitting additional jobs utilizing your networked Lustre scratch directory until the service is repaired, in order to avoid increasing the number of failed jobs.

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period 5/19/2021-5/21/2021

[Update – 05/20/2021, 2:10PM]

Dear PACE Users,

Our scheduled maintenance has completed 1 day ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 08/11/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.

Here is an update on the tasks performed during this maintenance period, which includes an additional task that was added to our list as the maintenance progressed:

New task added during maintenance period:

  • [COMPLETE] [Datacenter/Network] Departmental Firewall firmware upgrade:  This task is part of a scheduled OIT maintenance for Friday, 05/21/2021 (8:00pm – 2:00am on 5/22), which PACE was able to decouple from the OIT maintenance period and include that task in our current maintenance.  This allows us to avoid any further interruptions to the research community after PACE maintenance period completes.

Items Not Requiring User Action:

  • [COMPLETE] [Network] Replace InfiniBand cables on login-hive1.
  • [COMPLETE] [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [COMPLETE] [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [COMPLETE] [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [CANCELLED] [Network] Update KVM/qemu hosts in CUI clusters. UPDATE: Cancelled as it was deemed unnecessary due to firmware being up-to-date for our RHEL version.
  • [COMPLETE] [Archive] Removal of InfiniteIO from pace-archive.
  • [COMPLETE] [System] Remove /opt/pace directories everywhere.
  • [COMPLETE] [Firewall] PACE departmental firewall will be updated.
  • [COMPLETE] [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [COMPLETE] [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [COMPLETE] [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!
The PACE Team

 

[Update – 05/18/2021, 5:18pm]

We are writing to notify you of our next Maintenance Period, which will begin at 6:00 AM on Wednesday, 5/19/2021, and is scheduled to conclude by 11:59 PM on Friday, 5/21/2021. As the systems will be powered off during the period, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to the computational and storage resources will be unavailable.

For your reference, the following tasks are scheduled:

Items Requiring User Action:

  • None currently scheduled.

Items Not Requiring User Action:

  • [Network] Replace InfiniBand cables on login-hive1.
  • [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [Network] Update KVM/qemu hosts in CUI clusters.
  • [Archive] Removal of InfiniteIO from pace-archive.
  • [System] Remove /opt/pace directories everywhere.
  • [Firewall] PACE departmental firewall will be updated.
  • [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.