OIT NetApp upgrade

A low-risk upgrade is planned for Georgia Tech OIT’s NetApp storage appliances, beginning Saturday, July 10, at 6:00 AM. We do not expect any impact on PACE systems from this upgrade.

OIT’s NetApp appliance is in use on PACE’s Phoenix, PACE-ICE, and COC-ICE clusters. It hosts home directories as well as pace-apps, our software module repository. Should there be an unexpected disruption, users may face issues with logins, access to home directories, and loading or using PACE-supported software modules. We will provide updates in the unlikely event of a disruption this weekend.

Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scratch storage update

We would like to remind you about scratch storage policy on Phoenix. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Phoenix’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who used PACE prior to the migration to Phoenix or who use Hive are already familiar with this workflow.

Some of you will receive such an email this week. The first deletion of old scratch files in Phoenix will occur on July 7, covering files noted in these messages. We are extending the time beyond the normal one-week notification for this first round to give you time to adjust to this weekly process again.

Phoenix project storage is the intended location for your important research data. You can find out more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Phoenix.

[Resolved] Hive scheduler outage

The Hive scheduler experienced an outage this afternoon, as the resource and workload managers were unable to communicate. Our team identified the issue as relating to a missing library file and corrected the issue, restoring functionality at approximately 5 PM today.
Jobs submitted this afternoon would not have been able to start until the repair was implemented. Already-running jobs should not have been affected.
Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Phoenix scratch outage

[Update 6/12/21 6:30 PM]

Phoenix Lustre scratch has been restored. We paused the scheduler at 4:40 PM to prevent additional jobs from starting and resumed scheduling at 6:20 PM. As noted, please contact us with the job number for any job that began prior to 4:40 PM and was affected by the scratch outage, in order to receive a refund.

[Original post, 6/12/21 4:30 PM]

We are experiencing an outage on Phoenix’s Lustre scratch storage. Our team is currently investigating and has confirmed that this issue is related to the scratch mount and does not affect home or project storage. Users may be unable to list, read, or write files in their scratch directories.
If your running job has failed or runs without producing output as a result of this outage, please contact us at pace-support@oit.gatech.edu with the affected job number(s), and we will refund the value of the job(s) to your charge account. Please refrain from submitting additional jobs utilizing your networked Lustre scratch directory until the service is repaired, in order to avoid increasing the number of failed jobs.

Parallel Computing with MATLAB and Scaling to HPC on PACE clusters at Georgia Tech

MathWorks and PACE are partnering to offer a two-part parallel computing virtual workshop, taught by a MathWorks engineer, to PACE users and other members of the Georgia Tech community.

During this self-paced, hands-on workshop, you will be introduced to parallel and GPU computing in MATLAB for speeding up your application and offloading computations.  By working through common scenarios and workflows, you will gain an understanding of the parallel constructs in MATLAB, their capabilities, and some of the issues that may arise when using them. You will also learn how to take advantage of PACE resources, which are available to all researchers at Georgia Tech (including a free tier available at no cost), to scale your MATLAB computations.

Register by noon on May 14 at https://gatech.co1.qualtrics.com/jfe/form/SV_cD7prAcGZRthKCO.

Highlights
·      Speeding up programs with parallel computing
·      Working with large data sets
·      GPU computing
·      Scaling to PACE clusters (Phoenix, Hive, ICE, or Firebird)

Agenda
This virtual workshop will be held in two parts:
Part I, Tuesday, May 18, 1-4 PM, will focus on speeding up MATLAB with Parallel Computing Toolbox.
Part II, Tuesday, May 25, 1-4 PM, will focus on running MATLAB parallel code on PACE clusters.

Who should attend?
PhD students, post docs and faculty at Georgia Tech that want to (Part I) use parallel and GPU computing in MATLAB, and (Part II) scale their computations to take advantage of PACE resources.

Requirements
·      Basic working knowledge of MATLAB
·      Access to the Georgia Tech VPN. You do NOT need to be a PACE user, and all participants will receive access to PACE-ICE for hands-on activities.

Please contact PACE at pace-support@oit.gatech.edu with any questions.

Phoenix Project Storage Quotas Begin March 31

[Update 3/31/21 10:45 AM]

As previously announced, we have applied quotas to Phoenix project storage today based on each faculty PI’s choice. You can run the pace-quota command to check your research group’s utilization and quota at any time. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/. April will be the first month where storage quotas incur charges for PIs choosing quotas beyond the 1 TB funded by the Institute.

If your research group exceeds your quota, you will not be able to write to your project storage, and jobs running in your project storage may fail. We are in the process of directly contacting all users in storage projects over quota today.

Please contact PACE Support with any questions about Phoenix project storage quotas. Faculty may also choose to contact their PACE Research Scientist liaison.

 

[Update 3/24/21 12:30 PM]

We’d like to remind you that storage quotas on Phoenix project storage will be set one week from today, on March 31.

You can run the pace-quota command to check your research group’s utilization at any time. Your PI/faculty sponsor is choosing the quota that will be set, based on your group’s storage needs.  You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

After quotas are set on March 31, we will notify all users, and you will be able to see your quota via pace-quota.

Users and faculty should contact their PACE Research Scientist liaison or PACE Support with any questions about Phoenix project storage quotas.

 

[Original Post]

As part of completing the migration to Phoenix, we will set quotas on Phoenix project storage on March 31, ending the period of unlimited project storage. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

PACE’s free tier offers 1 TB of Institute-funded project storage to GT faculty. Faculty members must fund additional storage beginning with the month of April. PACE has provided faculty members with Phoenix storage allocations (except those recently created) with information regarding their group’s storage needs. Users can contact their advisors if they have concerns about their allocation.

All users can run the “pace-quota” command on Phoenix to see their research group’s storage usage. Quotas will generally show as unlimited (zero) until March 31.

Please contact us at pace-support@oit.gatech.edu with any questions about Phoenix project storage.

Hive Scratch Storage Update

We would like to remind you about scratch storage policy on Hive. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Hive’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who use the main PACE system are already familiar with this workflow.

Some of you received such an email yesterday. As always, if you need additional time to migrate valuable data off of scratch, please respond to the email as directed to request a delay.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Hive.

[Reopened] Network (Infiniband Subnet Manager) Issues in Rich

[ Update 8/14/20 7:00 PM ]

After an additional nearly-48-hour outage in the Rich datacenter due network/InfiniBand issues, we have brought back up PACE resources on the affected systems and released user jobs. We thank you for your patience and understanding during this unprecedented outage, as we understand the significant impact that this outage has continued to have on your research throughout this week. Please note that PACE clusters in the Coda datacenter (Hive and testflight-coda) and CUI clusters in Rich have not been impacted.

While new jobs have not begun over the past two days, already-running jobs have continued. Please check the output of any jobs that are still running. If they are failing or not producing output, please cancel them and resubmit to run again. Some running user jobs were killed in the process of repairing the network, and those should also be resubmitted to the queue.

In addition to previously reported repairs, we removed a problematic spine module from a network switch this morning and further adjusted connections. This module appeared to be causing intermittent failures when under heavy load.

Currently, our network is running at reduced capacity. We have ordered a replacement switch spine module that will be used to replace the removed part. We have conducted extensive stress testing of the network and storage today, which were far beyond tests conducted earlier in the week, that indicate the system is healthy. We will continue to monitor the systems for any further network abnormalities.

Again, thank you for your patience and understanding this week while we addressed one of the most significant outages in the history of PACE.

Please contact us at pace-support@oit.gatech.edu with any questions or if you observe unexpected behavior on the cluster.

[ Update 8/13/20 8:30 PM ]

We continue to work on the network issues impacting the Rich datacenter.  We have partitioned the network and adjusted connections in an effort to isolate the problem. As mentioned this morning, we have ordered parts to address potential problematic switches as we continue systematic troubleshooting of them. We continue to run tests on InfiniBand, and we are running an overnight stress test on the network to monitor for reoccurrence of errors. The schedulers remain paused to prevent further jobs being launched on the cluster. We will follow up tomorrow with an update on the Rich cluster network.

Thank you for your continued patience and understanding during this outage.

[ Update 8/13/20 10:10 AM ]

Unfortunately, after the nearly-80-hour outage earlier this week, we must report another network outage.  We apologize for this inconvenience, as we do understand the impact of this to your research. The network/InfiniBand issues in the Rich datacenter began reoccurring late yesterday evening, and we are aware of the issues. We are currently working to resolve them, and we have ordered replacements for the parts of the network switches that appear problematic. The issue was not detected via our deterministic testing methods and occurred only after restarting user production jobs caused very heavy network utilization. We will provide further updates once more information is available.  As before, you may experience slowness in accessing storage (home, project, and/or scratch) and/or issues with communication within MPI jobs.
We have paused all the schedulers for clusters in Rich datacenter that are accessed by the following headnodes/login nodes: login-s, login-d, login7-d, novazohar, gryphon, and testflight-login. This pause prevents additional jobs from starting, but already-running jobs have not been stopped. However, there is a chance they will be killed as we continue to work to resolve the network issues.
Please note that this network issue does not impact the Coda datacenter (Hive and testflight-coda) or CUI clusters in the Rich datacenter.
Thank you for your continued patience as we continue to work to resolve this issue.
Please contact us with any questions or concerns at pace-support@oit.gatech.edu.

[ Update 8/12/20 6:20 PM ]

After nearly 80 hours of Rich datacenter outage due to network/InfiniBand issues, we have been able to bring up the PACE compute nodes in the Rich datacenter, and user jobs have begun to run again. We thank you for your patience during this period, and we understand the significant impact of this outage on your research this week.
For any user jobs that were killed due to restarts yesterday, please resubmit the jobs to the queue at this time. Please check the output of any recent jobs and resubmit any that did not succeed.
As noted yesterday evening, we have carefully brought nodes back into production in small groups to identify issues, and we have turned off nodes that we identified as having network difficulties. Our findings point to multiple hardware problems that caused InfiniBand connectivity problems between nodes. We addressed these issues, and we are no longer observing the errors after our extensive testing. We will continue to monitor the systems, but please contact us immediately at pace-support@oit.gatech.edu if you notice your job running slowly or failing to produce output.
Please note that we will continue to work on problematic nodes that are currently offline in order to restore compute access to all PACE users, and we will contact affected users as needed.
Again, thank you for your patience and understanding this week while we addressed one of the most impactful outages in the history of PACE.
Please contact us at pace-support@oit.gatech.edu with any questions.

[ Update 8/12/20 12:30 AM ]

We continue to work to bring PACE nodes back into production. After turning off all the compute nodes and reseating faulty network connections we identified, we have been slowly bringing nodes back up to avoid overwhelming the network fabric, which has been clean so far.  We are carefully testing each group to ensure full functionality, and we continue to identify challenging nodes and repair them where possible. At this time, the schedulers remain paused while we turn on and test nodes. We will provide additional updates as more progress is made.

[ Update 8/11/20 5:15 PM]

We continue to troubleshoot the network issues in the Rich datacenter. Unfortunately, our efforts to avoid disturbing running jobs have complicated the troubleshooting, which has not led to a resolution. At this time, we need to begin systematic rebooting of many nodes, which will kill some running user jobs. We will contact users with current running jobs directly to alert you to the effect on your jobs.

Our troubleshooting today has included reseating multiple spine modules in the main datacenter switch, adjusting uplinks between the two main switches to isolate problems, and rebooting switches and some nodes already.

We will continue to provide updates as more information becomes available. Thank you for your patience during this outage.

[ Update 8/10/20 11:35 PM ]

We have made several changes to create a more stable Infiniband network, including deploying an updated subnet manager, bypassing bad switch links, and repairing GPFS filesystem errors. However, we have not yet been able to uncover all issues the network is facing, so affected schedulers remain paused for now, to ensure that new jobs do not begin when they cannot produce results.

We will provide an update on Tuesday as more information becomes available. We greatly appreciate your patience as we continue to troubleshoot.

[ Update 8/10/20 6:20 PM ]

We are continuing to troubleshoot network issues in Rich. At this time, we are working to deploy an older backup subnet manager, and we will test the network again to determine if communication has been restored after that step.

The schedulers on the affected clusters remain paused, to ensure that new jobs do not begin when they cannot produce results.

We recognize that this outage has a significant impact on your research, and we are working to restore functionality in Rich as soon as possible. We will provide an update when more information becomes available.

[ Update 8/9/20 11:55 PM]

We have restarted PACE’s Subnet Manager in Rich, but some network slowness remains. We are continuing to troubleshoot the problem. At this time, we plan to leave the Rich schedulers paused overnight in order to ensure that the issue is fully resolved before additional jobs begin, so that they will be able to run successfully.
We will provide further updates on Monday.

[ Original Post]

At approximately noon today, we began experiencing issues with our primary InfiniBand Subnet Manager in Rich data center.  PACE is investigating this issue.  We will provide an update when additional information or a resolution is available.  At this time, you may experience slowness in accessing storage (home, project, or scratch) or issues with communication within MPI jobs.

In order to minimize impact to jobs, we have paused all schedulers on the affected clusters (accessed via login-s, login-d, login7-d, novazohar, gryphon, and testflight-login headnodes). This will prevent additional jobs from starting, but jobs that are already running will not be stopped, although they may fail to produce results due to the network issues.

This issue does not impact the Coda data center (Hive & testflight-coda clusters) or CUI clusters in the Rich data center.

Please contact us with any questions or concerns at pace-support@oit.gatech.edu.

[Resolved] [testflight-coda] Lustre scratch outage

[ Update 8/11/20 10:15 AM]

Lustre scratch has been repaired. We identified a broken ethernet port on a switch and moved to another port, restoring access.

[ Original Post ]

There is an outage affecting our Lustre scratch, which is currently used only in testflight-coda. We are working with the vendor to restore the system. Storage on all PACE production systems is unaffected.

You may continue your testing in testflight-coda to prepare for your Coda migration by using Lustre project storage, accessed via the “data” symbolic link in your testflight-coda home directory.

We will provide an update when the Lustre scratch system is restored. Please contact us at pace-support@oit.gatech.edu with questions.