Georgia Power will be conducting additional tests of the MicroGrid powering the Coda datacenter (Hive and testflight-coda) this week. Unlike the last round, this new set of tests is expected to be a low risk for power interruption to compute nodes.
Author: mweiner3
[Reopened] Network (Infiniband Subnet Manager) Issues in Rich
[ Update 8/14/20 7:00 PM ]
After an additional nearly-48-hour outage in the Rich datacenter due network/InfiniBand issues, we have brought back up PACE resources on the affected systems and released user jobs. We thank you for your patience and understanding during this unprecedented outage, as we understand the significant impact that this outage has continued to have on your research throughout this week. Please note that PACE clusters in the Coda datacenter (Hive and testflight-coda) and CUI clusters in Rich have not been impacted.
While new jobs have not begun over the past two days, already-running jobs have continued. Please check the output of any jobs that are still running. If they are failing or not producing output, please cancel them and resubmit to run again. Some running user jobs were killed in the process of repairing the network, and those should also be resubmitted to the queue.
In addition to previously reported repairs, we removed a problematic spine module from a network switch this morning and further adjusted connections. This module appeared to be causing intermittent failures when under heavy load.
Currently, our network is running at reduced capacity. We have ordered a replacement switch spine module that will be used to replace the removed part. We have conducted extensive stress testing of the network and storage today, which were far beyond tests conducted earlier in the week, that indicate the system is healthy. We will continue to monitor the systems for any further network abnormalities.
Again, thank you for your patience and understanding this week while we addressed one of the most significant outages in the history of PACE.
Please contact us at pace-support@oit.gatech.edu with any questions or if you observe unexpected behavior on the cluster.
[ Update 8/13/20 8:30 PM ]
We continue to work on the network issues impacting the Rich datacenter. We have partitioned the network and adjusted connections in an effort to isolate the problem. As mentioned this morning, we have ordered parts to address potential problematic switches as we continue systematic troubleshooting of them. We continue to run tests on InfiniBand, and we are running an overnight stress test on the network to monitor for reoccurrence of errors. The schedulers remain paused to prevent further jobs being launched on the cluster. We will follow up tomorrow with an update on the Rich cluster network.
Thank you for your continued patience and understanding during this outage.
[ Update 8/13/20 10:10 AM ]
[ Update 8/12/20 6:20 PM ]
[ Update 8/12/20 12:30 AM ]
We continue to work to bring PACE nodes back into production. After turning off all the compute nodes and reseating faulty network connections we identified, we have been slowly bringing nodes back up to avoid overwhelming the network fabric, which has been clean so far. We are carefully testing each group to ensure full functionality, and we continue to identify challenging nodes and repair them where possible. At this time, the schedulers remain paused while we turn on and test nodes. We will provide additional updates as more progress is made.
[ Update 8/11/20 5:15 PM]
We continue to troubleshoot the network issues in the Rich datacenter. Unfortunately, our efforts to avoid disturbing running jobs have complicated the troubleshooting, which has not led to a resolution. At this time, we need to begin systematic rebooting of many nodes, which will kill some running user jobs. We will contact users with current running jobs directly to alert you to the effect on your jobs.
Our troubleshooting today has included reseating multiple spine modules in the main datacenter switch, adjusting uplinks between the two main switches to isolate problems, and rebooting switches and some nodes already.
We will continue to provide updates as more information becomes available. Thank you for your patience during this outage.
[ Update 8/10/20 11:35 PM ]
We have made several changes to create a more stable Infiniband network, including deploying an updated subnet manager, bypassing bad switch links, and repairing GPFS filesystem errors. However, we have not yet been able to uncover all issues the network is facing, so affected schedulers remain paused for now, to ensure that new jobs do not begin when they cannot produce results.
We will provide an update on Tuesday as more information becomes available. We greatly appreciate your patience as we continue to troubleshoot.
[ Update 8/10/20 6:20 PM ]
We are continuing to troubleshoot network issues in Rich. At this time, we are working to deploy an older backup subnet manager, and we will test the network again to determine if communication has been restored after that step.
The schedulers on the affected clusters remain paused, to ensure that new jobs do not begin when they cannot produce results.
We recognize that this outage has a significant impact on your research, and we are working to restore functionality in Rich as soon as possible. We will provide an update when more information becomes available.
[ Update 8/9/20 11:55 PM]
[ Original Post]
At approximately noon today, we began experiencing issues with our primary InfiniBand Subnet Manager in Rich data center. PACE is investigating this issue. We will provide an update when additional information or a resolution is available. At this time, you may experience slowness in accessing storage (home, project, or scratch) or issues with communication within MPI jobs.
In order to minimize impact to jobs, we have paused all schedulers on the affected clusters (accessed via login-s, login-d, login7-d, novazohar, gryphon, and testflight-login headnodes). This will prevent additional jobs from starting, but jobs that are already running will not be stopped, although they may fail to produce results due to the network issues.
This issue does not impact the Coda data center (Hive & testflight-coda clusters) or CUI clusters in the Rich data center.
Please contact us with any questions or concerns at pace-support@oit.gatech.edu.
[Resolved] [testflight-coda] Lustre scratch outage
[ Update 8/11/20 10:15 AM]
Lustre scratch has been repaired. We identified a broken ethernet port on a switch and moved to another port, restoring access.
[ Original Post ]
There is an outage affecting our Lustre scratch, which is currently used only in testflight-coda. We are working with the vendor to restore the system. Storage on all PACE production systems is unaffected.
You may continue your testing in testflight-coda to prepare for your Coda migration by using Lustre project storage, accessed via the “data” symbolic link in your testflight-coda home directory.
We will provide an update when the Lustre scratch system is restored. Please contact us at pace-support@oit.gatech.edu with questions.
[Mitigated] Globus Access Restored
PACE’s globus-internal server, which hosts the PACE Internal endpoint, experienced an outage beginning earlier this afternoon. We have redirected traffic to an alternate interface, and access to PACE storage via Globus is restored.
The PACE Internal endpoint provides access to the main PACE system in Rich, including home, project, and scratch storage, in addition to serving as the interface to PACE Archive storage. Hive is accessed via a separate Globus endpoint and was not affected.
As a reminder, you can find instructions on how to use Globus for file transfer to/from PACE at http://docs.pace.gatech.edu/storage/globus/. Please contact us at pace-support@oit.gatech.edu with any questions.
VPN Upgrades
We would like to inform you of several upcoming updates to Georgia Tech’s VPNs, which you use to connect to PACE from off-campus locations.
The GlobalProtect VPN client will be updated on August 4, 8-10 PM. This will improve support for macOS 10.15.4+ (removing the Legacy System Extension message) and address other bugs. There will be an automatic update, but you may choose to test it early, as described at faq.oit.gatech.edu/content/how-do-i-get-started-globalprotect-campus-vpn#labportal.
The AnyConnect VPN client will also be getting upgraded. As with previous upgrades, your client will automatically download the new client the first time you attempt to connect after the update. You may choose to upgrade early by connecting your client to dev.vpn.gatech.edu, then returning to the normal address when the update is installed. The PACE VPN (used for CUI/ITAR clusters only) will be upgraded on August 4, 8-10 PM. The anyc VPN (used for most PACE resources and the rest of the GT campus) will be upgraded on August 11, 8-10 PM.
Please visit status.gatech.edu for further details on all pending updates to Georgia Tech’s VPN service.
[Resolved] Georgia Power Micro Grid Testing (continued)
[Update 7/22/20 1:00 PM]
Hive and testflight-coda systems were restored early this morning. Systems have returned to normal operation, and user jobs are running. If you were notified of a lost job, please resubmit it at this time.
Georgia Power does not plan to conduct any tests today. No additional information about the cause of yesterday’s outage is available at this time.
[Update 7/21/20 11:00 PM]
[Update 7/21/20 3:15 PM]
Unfortunately, the planned testing of the Georgia Power Micro Grid this week has led to a loss of power in the Coda research hall, home to compute nodes for Hive & testflight-coda. Any running jobs on those clusters will have failed at this time. Access to login nodes and storage, housed in the Coda enterprise hall, is uninterrupted.
We are sorry for what we know if a significant interruption to your work.
At this time, teams are working to restore power to the system. We will provide an update when available.
[Update 7/14/20 4:00 PM]
Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.
As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.
Please contact us at pace-support@oit.gatech.edu with any questions.
Visit http://blog.pace.gatech.edu/?p=6778 for full details on this power testing.
[Resolved] PACE License Server Outage
The PACE license server experienced an outage earlier this afternoon, which has since been resolved.
The following software licenses were not available on PACE during the outage: Intel compiler, Gurobi, Allinea, PGI. If you experienced difficulty accessing these services earlier today, please retry your job at this time.
The outage did not affect the College of Engineering license server, which hosts campus-wide licenses for some licensed software widely used on PACE, including MATLAB, COMSOL, Abaqus, and Ansys.
Please contact us at pace-support@oit.gatech.edu with any questions.
DNS/DHCP maintenance
OIT will be conducting scheduled maintenance on Thursday, June 25, 5:00 – 8:00 AM to patch gtipam and DNS/DHCP servers. Due to redundant servers, the risk of any interruption to PACE is very low. If there is an interruption, you may find yourself unable to connect to PACE or lose your open connection to a login node, interactive job, VNC session, or Jupyter notebook. Running batch jobs should not be affected, even in the event of an interruption.
Please contact us at pace-support@oit.gatech.edu with any questions.
Emergency Network Maintenance Tomorrow (6/18)
The network team is beginning additional emergency network maintenance immediately (at noon today), continuing through 7 PM this evening, to reverse changes from yesterday evening. It will have the same effect as yesterday’s outage, so you will likely lose your VPN connection and/or PACE connection at some point this afternoon during intermittent outages.
Please contact us at pace-support@oit.gatech.edu with any questions.
- At some point during this maintenance, users may experience up to a 20-minute interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will likely lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working tomorrow evening. Note that this may also interrupt any connection you have made over the GT VPN to non-PACE locations. Connections to PACE from within the campus firewalls may also be interrupted, which means that resources outside of PACE required for PACE jobs, such as queries to some software licenses used on PACE, including MATLAB or COMSOL, may be interrupted. Batch jobs already running on PACE should not be affected.
- In addition, about midway through the maintenance, there will be a period of approximately 20-30 minutes where authentication will be unavailable. This will prevent any new connections to the VPN, to PACE, and to any cloud service that authenticates using GT credentials. It is also possible for this interruption to cause new job starts to fail due to the loss of access to the authentication service.
We will alert you if there is any change of plans for this emergency maintenance.