[Urgent] Hive Cluster Storage Controller Cable Replacement – Performance Impact

[Update – 06/25 11:40PM]

The storage controller cable on Hive cluster was replaced this evening and brought back online.  Unfortunately, after the repairs, GPFS storage mounts became unavailable, which had interrupted users’ running jobs this evening.   We’ve paused the scheduler briefly while we restarted the GPFS services across the cluster.  The storage mounts were restored, and scheduler has been resumed.

User’s jobs that have been running/queued between about 7:00pm and 10:30pm today (6/25/2021)  may have been interrupted, and we recommend the users to check on their jobs and resubmit your jobs as needed.  Please accept our sincerest apology for this inconvenience.

We will continue to monitor the services and update as needed.  If you have any questions, please contact us at pace-support@oit.gatech.edu.

[Original Message – 06/25 5:12PM]

Dear Hive Users,

We are reaching out to inform you that one of our storage controllers for Hive cluster has a bad cable that needs to be replaced to ensure optimal performance and data integrity.   We have the cable at hand, and are in a process of replacing this cable this evening, Friday 06/25/2021.  This work will impact storage performance briefly, which users may experience as storage slowness as we are routing all our traffic to a secondary controller during this operation. 

What’s happening and what we are doing:  More specifically, PACE has assessed a high failure rate of the disks in one of the enclosures for the storage controller with a bad cable.  As a precaution, we will be shutting down the controller with the bad cable to unfail the disks and to ensure data integrity of the system.  We will work on replacing the cable this evening during which the controller will be shutdown.  During this work, all storage traffic will be routed to a secondary controller that is fully operational.   Given the anticipated load on the secondary controller, we anticipate users experiencing performance degradation.  

How does this impact me:  With only one storage control in operation, users may experience storage slowness.  In a highly unlikely event, this could cause downtime to the storage which would impact all users’ running jobs; however, we do not anticipate any storage outage during this operation.

What we will continue to do:  PACE team will work on the cable replacement and restore the storage to optimal operation, and update the community as needed. 

Please accept our sincere apology for any inconvenience that this  may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period 5/19/2021-5/21/2021

[Update – 05/20/2021, 2:10PM]

Dear PACE Users,

Our scheduled maintenance has completed 1 day ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 08/11/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.

Here is an update on the tasks performed during this maintenance period, which includes an additional task that was added to our list as the maintenance progressed:

New task added during maintenance period:

  • [COMPLETE] [Datacenter/Network] Departmental Firewall firmware upgrade:  This task is part of a scheduled OIT maintenance for Friday, 05/21/2021 (8:00pm – 2:00am on 5/22), which PACE was able to decouple from the OIT maintenance period and include that task in our current maintenance.  This allows us to avoid any further interruptions to the research community after PACE maintenance period completes.

Items Not Requiring User Action:

  • [COMPLETE] [Network] Replace InfiniBand cables on login-hive1.
  • [COMPLETE] [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [COMPLETE] [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [COMPLETE] [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [CANCELLED] [Network] Update KVM/qemu hosts in CUI clusters. UPDATE: Cancelled as it was deemed unnecessary due to firmware being up-to-date for our RHEL version.
  • [COMPLETE] [Archive] Removal of InfiniteIO from pace-archive.
  • [COMPLETE] [System] Remove /opt/pace directories everywhere.
  • [COMPLETE] [Firewall] PACE departmental firewall will be updated.
  • [COMPLETE] [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [COMPLETE] [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [COMPLETE] [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!
The PACE Team

 

[Update – 05/18/2021, 5:18pm]

We are writing to notify you of our next Maintenance Period, which will begin at 6:00 AM on Wednesday, 5/19/2021, and is scheduled to conclude by 11:59 PM on Friday, 5/21/2021. As the systems will be powered off during the period, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to the computational and storage resources will be unavailable.

For your reference, the following tasks are scheduled:

Items Requiring User Action:

  • None currently scheduled.

Items Not Requiring User Action:

  • [Network] Replace InfiniBand cables on login-hive1.
  • [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [Network] Update KVM/qemu hosts in CUI clusters.
  • [Archive] Removal of InfiniteIO from pace-archive.
  • [System] Remove /opt/pace directories everywhere.
  • [Firewall] PACE departmental firewall will be updated.
  • [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[RESOLVED] Phoenix Scheduler is Down

Update (5/13 2:00pm): We are happy to report that the Phoenix Scheduler is now online and accepting jobs.

We are sorry for the inconvenience this has caused and please let us know if you continue to observe any problems (pace-support@oit.gatech.edu)
—-
At around 10:30am this morning, we restarted the Phoenix scheduler to apply a new license file. The scheduler is having trouble coming back online and we are actively troubleshooting this issue. So far we know the issue is unrelated to the license, rather some left over job files may be causing the issue. We are working on reviving the scheduler as soon as possible.

This issue doesn’t impact any running jobs, or those submitted before the incident. Only new job submissions will fail with an error.

We’ll update this post (http://blog.pace.gatech.edu/?p=7075) and send a follow up message once the issue is resolved.

Thank you for your patience and sorry for this inconvenience.

 

 

OIT Scheduled Service for MATLAB- 05/07/2021, 10:00AM – noon

OIT will perform work on Georgia Tech’s MATLAB license server tomorrow morning, 05/07/2021, 10:00 AM – noon, which will impact any MATLAB jobs running on PACE at the time of the outage (as well as elsewhere on campus).

During the outage window, attempts to open new MATLAB instances in batch or interactive jobs will fail. In addition, we expect running MATLAB instances will stop working, but the job will continue running.

PACE aims to identify affected jobs tomorrow morning and follow up with the impacted users.

We recommend that you avoid submitting additional MATLAB jobs to PACE that will not finish before 10 AM on Friday (May 6) and instead submit them after work is complete.

OIT will be providing up-to-date progress on Georgia Tech’s Status page, http://status.gatech.edu. 

If you have any questions, please contact us at pace-support@oit.gatech.edu.

 

 

PACE Advisory Committee Assembled

Dear PACE research community, 

We are pleased to announce that the faculty-led PACE advisory committee was formed and assembled on March 30, 2021. The PACE Advisory Committee is a joint effort between the EVPR and OIT to ensure that shared research computing services are both meeting faculty needs and resourced in a sustainable way.   The committee consists of a representative group of PACE and faculty members, encompassing a wide range of experience and expertise on advanced computational and data capabilities provided by OIT’s research cyberinfrastructure.  An important goal of the committee is to provide essential feedback which will help continuously improve this critical service. The committee will meet regularly and: 

  1. Function as communication channel between the broader research computing community and PACE.
  2. Serve as a sounding board for major changes to the PACE infrastructure
  3. Maintain an Institute-level view of the shared resource
  4. Help craft strategies that balance the value and benefits provided by the resources with a sustainable cost structure in the face of ever-increasing demand.

PACE Advisory Committee Members: 

  • Srinivas AluruIDEaSDirector (ex-officio) 
  • Omar Asensio, Public Policy 
  • Dhruv Batra, Interactive Computing/ML@GT 
  • Mehmet Belgin, PACE 
  • Annalisa Bracco, Earth and Atmospheric Sciences 
  • Neil Bright, PACE 
  • Laura Cadonati, Physics 
  • Umit  Catalyurek, Computational Science and Engineering 
  • Sudheer ChavaScheller College of Business 
  • Yongtao Hu, Civil and Environmental Engineering 
  • Lew Lefton, EVPR/Math (ex-officio) 
  • Steven Liang, Mechanical Engineering/GTMI 
  • AJ Medford, Chemical and Biomolecular Engineering 
  • Joe Oefelein, Aerospace Engineering 
  • Annalise Paaby, Biological Sciences 
  • Tony Pan, IDEaS 
  • David Sherrill, Chemistry and Biochemistry 
  • Huan Tran, Materials Science and Engineering  

If you have any questions or comments, please direct them to the PACE Team <pace-support@oit.gatech.edu> and/or to Dr. Lew Lefton <lew.lefton@gatech.edu>.  

All the best, 

The PACE Team 

PACE Update: Compute and Storage Billing

Dear PACE research community,

During our extended grace period, nearly 1M user jobs from nearly 160 PI groups have completed,  consuming nearly 40M CPU hours on the Phoenix cluster. The average wait time in queue per job was less than 0.5 hours, confirming the effectiveness of the measures to ensure fair use of the Phoenix cluster to maintain an exceptional level of quality of service.

With the billing for both storage and compute usage in effect as of April 1st, we are following up to provide an update on a few important points.

Compute billing started April 1: 

Throughout March, we’ve sent communications to all PIs  in accordance with the PACE’s new cost model, including  the amount of compute credits  based on the refreshed compute equipment as part of the migration to Coda data center and/or  recently purchased equipment from FY20 Phase 1/2/3 purchase(s).

PACE has identified and fixed some discrepancies since our initially communicated information as part of our compute audit, which included resources purchased but not provisioned on time. We apologize for this oversight and encourage users to run pace-quota command to verify the updated list of charge accounts. We’ll follow up with the impacted PIs/users in a separate communication.

Please note that most school-owned accounts, as well as those jointly purchased by multiple faculty members, will show a zero balance, but you can still run jobs with them. We are working to make the balances in those accounts visible to you.

As of April 1, all the jobs that run on Phoenix and/or Firebird clusters will be debited/charged to the provided charge account (e.g., GT-gburdell3, GT-gburdell3-CODA20), and a statement will be sent to PIs at the start of May.

This does NOT necessarily mean that you must immediately begin providing funding to use Phoenix. All faculty and their research groups have access to our free tier. Additionally, if you had access to paid resources in Rich, they have been refreshed with an equivalent prepaid account intended to last for 5 years. 

Project storage billing started on April 1: 

As announced, quotas to Phoenix project storage were applied on March 31 based on PI choices as part of our storage audit.   Users may run pace-quota command to check their research group’s utilization and quota at any time.  For further information about the Phoenix storage, please see our documentation.  April is the first month where storage quotas incur charges for PIs who have chosen quotas above the 1 TB funded by the Institute.

Showback statements sent to PIs: 

Throughout March, we sent out “showback” statements for the prior months’ usage on the Phoenix cluster, which covered the usage for Oct 2020 through Feb 2021.   We are in a process of sending the March 2021 showback statements that will also include a storage report.  Overall, these statements provided PIs with an opportunity to review their group’s usage and follow up with PACE as needed.  Explanations for each of the metrics can be found in our documentation.

No charges were incurred for usage during the grace period, so the showback statements are solely for your information and to guide your usage plans going forward. 

User account audit completed: 

Users of ECE and Prometheus resources migrated in Nov 2020 did not have all their charge accounts provisioned  during their groups’ migration.  Since then, we have provided access to these additional accounts for the impacted users.  We apologize for any inconvenience this may have caused.   Also, as part of our preparation to start billing for computation, on Feb 8, the PACE team sent out a notification to PIs to conduct a review of the job submission accounts and corresponding user lists.  We appreciate PIs input throughout this process, and if any changes have occurred in your group since then, or if you would like to add a new user(s) to your account(s), please don’t hesitate to send a request to pace-support@oit.gatech.edu.  Users may run the pace-whoami command to see a list of charge accounts they may use.

Additionally, we have created a blog page for the frequently asked questions we have received from our community after the end of the extended grace period on March 31, which we would like to share with you at this time.

If you have any questions, concerns or comments about the Phoenix cluster or the new cost model, please direct them to pace-support@oit.gatech.edu.

Thank you,

The PACE Team

FAQ after the end of the grace period on the Phoenix cluster

The following are frequently asked questions we have received from our user community after the end of the extended grace period on March 31 in accordance to the new cost model., which we are sharing with the community:

Q: Where can I find an updated NSF style facilities and equipment document? 

A:  Please see our page at https://pace.gatech.edu/sample-nsf-application-boilerplate-describing-pace-hpc  

Q: I had a cluster I bought back in 2013, can I access this cluster? 

A: No.  We have decommissioned all clusters from Rich datacenter as part of the Rich to Coda datacenter migration plan.   As part of our earlier communication to PIs, if a PI owned a cluster in Rich datacenter, they received a detailed summary of their charge account(s) for the Phoenix cluster that included the amount of compute credits allocated to their account based on the compute equipment that was refreshed.  To see your list of available charge account(s) and their credit balance, please run pace-quota on the Phoenix cluster. 

Q: I do not have funds to pay for the usage of the Phoenix cluster at this time, can I get  access to Phoenix at no cost? 

A: As part of this transition, PACE has taken the opportunity to provide all Institute Faculty with computational and data resources at a modest level.    All academic and research faculty (“PIs”) participating in PACE are automatically granted a certain level of resources in addition to any additional funding they may bring. Each PI is provided 1TB of project storage and compute credits (68) equivalent to 10,000 CPU-hours (per month) on a 192GB compute node. These credits may be used towards any computational resources (e.g., GPUs, high memory nodes) that are available within the Phoenix cluster. In addition, all PACE users also have access to the preemptable backfill queue at no cost.   

Q: Do I need to immediately begin providing funding to use Phoenix beyond the free tier? 

A: Not necessarily. If you had access to paid resources in Rich, you now have access to a refresh CODA20 account with an existing balance, as described to each faculty owner. The number of credits in that account is equivalent in computational power to 5 years of continuous use of your old cluster in the Rich Datacenter  

PACE Archive Storage Update and New Allocation Moratorium

Dear PACE Users,

We are reaching out to provide you a status update on PACE’s Archive storage service, and to inform you about the moratorium for new archive storage user creation and allocations that we are instituting effective immediately.  This moratorium on new archive storage deployments decreases any potential negative impacts on transfer and backups due to the potential for large influx of new files.

What’s happening and what we are doing: Currently, the original PACE Archive storage is hosted on vendor hardware that is at limited support capacity as the vendor has ceased operations.  PACE has initiated a two phase plan to transfer PACE Archive storage from the current hardware to a permanent storage solution.  At this time, phase 1 is underway, and archive storage data is being replicated to a temporary storage solution.   PACE aims to finish the archive system transfer and configuration of this phase by May Maintenance Period (5/19/2021 – 5/21/2021).   The phase 1 is a temporary solution as PACE explores a more cost-efficient solution that will require a second migration of the data to the permanent storage solution, which will be part of the phase 2 of the plan, and we will follow-up with details accordingly.   

How does this impact me:  There is no service impact to current PACE archive storage users.   With the moratorium in effect, new users/allocations requests for archive storage are delayed until after the maintenance period.  New requests for archive storage may be processed starting 05/22/2021.  

What we will continue to do:  PACE team will continue to monitor the transfer of the data to the NetApp storage, and we will report as needed. 

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] Network Connectivity Issues

[Update – March 25, 2021 – 3:07pm] 

This is a follow up to yesterday’s message about the campus network connectivity issues that impacted PACE.  By 3:24pm yesterday, OIT’s network team was able to rapidly resolve the connectivity issues, which was quickly updated on the status page link that was provided earlier.   Analysis of the incident, which was made available to us at a later point, revealed that the issue was identified to be a network spanned into the Coda data center from the Rich building that experienced a spanning tree issue (a network loop).   This caused a cascade of issues  with core network equipment due to specific failure scenario that caused widespread connectivity issues across the campus.  OIT’s network team resolved the issue with the affected network that resolved the other connectivity issues affecting the campus.   OIT’s network team will conduct further investigation regarding this to prevent future occurrence.

Since yesterday at about 3:30pm, all PACE users should have been able to access PACE managed resources without issues.  There was no impact to running jobs unless they required external resources (outside of PACE).   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

 

[Original Message – March 24, 2021 2:48pm]

Dear PACE Users,

At around 2:30pm, OIT’s network team reported connectivity issues. This may impact users’ ability to connect to PACE managed resources at Coda, such as Phoenix, Hive, Firebird, PACE-ICE, CoC-ICE and Testflight-Coda. Currently, the source of the problem is being investigated, but at this time, there is no impact to running jobs unless they require external resources (i.e., from the Web). We will provide further information as it’s available.

Please refer to the OIT’s status page for the developments on this issue: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/605b8495e2838505358d3af3

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

We apologize for this inconvenience.

Best,

The PACE Team

Update on the New Cost Model: April 1, 2021 is the Start Date for Compute and Storage Billing

Dear PACE research community,

Since the start of 2021, nearly 500k user jobs have completed from nearly 150 PI groups that’s accounted for nearly 16M CPU hours on the Phoenix cluster while maintaining an exceptional quality of service in which the average wait time in queue per user’s job was about 0.5 hours.   The measures that we implemented on December 14 (see blog post)  to ensure fair use of the Phoenix cluster have been effective in enabling the research groups to leverage the scalability of the Phoenix and the new system while maintaining a high level of quality service for the user community.

At this time, we want to share an update in reference to the new cost model.   We are updating the start date for compute and storage billing from March 1, 2021 to April 1, 2021 as the new start date for billing. This means that users will not be charged for their usage of compute and storage resources until April 1, 2021.  This grace period extension allows us to achieve the following:

  • Gain input from the faculty led PACE advisory committee that is being organized.
  • Align the start of our compute and storage billing for all services (including CUI)
  • Provide additional time for the research community to adopt to the Phoenix cluster and the new cost model.
  • Provide an opportunity to send  “showback” statements for the prior months during March, which would provide time for PIs to review these past statements and follow up with PACE if they have any questions prior to the start of billing that begins on April 1, 2021

If you have any questions, concerns or comments, please direct them to pace-support@oit.gatech.edu.

Best,

The PACE Team