[Mitigated] Globus Access Restored

PACE’s globus-internal server, which hosts the PACE Internal endpoint, experienced an outage beginning earlier this afternoon. We have redirected traffic to an alternate interface, and access to PACE storage via Globus is restored.

The PACE Internal endpoint provides access to the main PACE system in Rich, including home, project, and scratch storage, in addition to serving as the interface to PACE Archive storage. Hive is accessed via a separate Globus endpoint and was not affected.

As a reminder, you can find instructions on how to use Globus for file transfer to/from PACE at http://docs.pace.gatech.edu/storage/globus/. Please contact us at pace-support@oit.gatech.edu with any questions.

 

VPN Upgrades

We would like to inform you of several upcoming updates to Georgia Tech’s VPNs, which you use to connect to PACE from off-campus locations.

The GlobalProtect VPN client will be updated on August 4, 8-10 PM. This will improve support for macOS 10.15.4+ (removing the Legacy System Extension message) and address other bugs. There will be an automatic update, but you may choose to test it early, as described at faq.oit.gatech.edu/content/how-do-i-get-started-globalprotect-campus-vpn#labportal.

The AnyConnect VPN client will also be getting upgraded. As with previous upgrades, your client will automatically download the new client the first time you attempt to connect after the update. You may choose to upgrade early by connecting your client to dev.vpn.gatech.edu, then returning to the normal address when the update is installed. The PACE VPN (used for CUI/ITAR clusters only) will be upgraded on August 4, 8-10 PM. The anyc VPN (used for most PACE resources and the rest of the GT campus) will be upgraded on August 11, 8-10 PM.

Please visit status.gatech.edu for further details on all pending updates to Georgia Tech’s VPN service.

[Resolved] Georgia Power Micro Grid Testing (continued)

[Update 7/22/20 1:00 PM]

Hive and testflight-coda systems were restored early this morning. Systems have returned to normal operation, and user jobs are running. If you were notified of a lost job, please resubmit it at this time.

Georgia Power does not plan to conduct any tests today. No additional information about the cause of yesterday’s outage is available at this time.

[Update 7/21/20 11:00 PM]

The power outage in CODA has been bypassed, and power is returning to the Coda research hall.  However, because the cooling plant has been offline for so long, it will require about 2 hours to restart and stabilize before we can resume full operation.  Due to the late hour, we will begin to bring systems back on in the morning and provide another update when we’re back to normal operation.  Georgia Power will be researching the root cause of this outage in the morning, and we will share details if available.

[Update 7/21/20 3:15 PM]

Unfortunately, the planned testing of the Georgia Power Micro Grid this week has led to a loss of power in the Coda research hall, home to compute nodes for Hive & testflight-coda. Any running jobs on those clusters will have failed at this time. Access to login nodes and storage, housed in the Coda enterprise hall, is uninterrupted.

We are sorry for what we know if a significant interruption to your work.

We will follow up with users who had jobs running at the time of the power outage to provide more specific information.

At this time, teams are working to restore power to the system. We will provide an update when available.

 

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

Visit http://blog.pace.gatech.edu/?p=6778 for full details on this power testing.

[Resolved] PACE License Server Outage

The PACE license server experienced an outage earlier this afternoon, which has since been resolved.

The following software licenses were not available on PACE during the outage: Intel compiler, Gurobi, Allinea, PGI. If you experienced difficulty accessing these services earlier today, please retry your job at this time.

The outage did not affect the College of Engineering license server, which hosts campus-wide licenses for some licensed software widely used on PACE, including MATLAB, COMSOL, Abaqus, and Ansys.

Please contact us at pace-support@oit.gatech.edu with any questions.

DNS/DHCP maintenance

OIT will be conducting scheduled maintenance on Thursday, June 25, 5:00 – 8:00 AM to patch gtipam and DNS/DHCP servers. Due to redundant servers, the risk of any interruption to PACE is very low. If there is an interruption, you may find yourself unable to connect to PACE or lose your open connection to a login node, interactive job, VNC session, or Jupyter notebook. Running batch jobs should not be affected, even in the event of an interruption.
Please contact us at pace-support@oit.gatech.edu with any questions.

Emergency Network Maintenance Tomorrow (6/18)

[Update 6/19/20 12:10 PM]

The network team is beginning additional emergency network maintenance immediately (at noon today), continuing through 7 PM this evening, to reverse changes from yesterday evening. It will have the same effect as yesterday’s outage, so you will likely lose your VPN connection and/or PACE connection at some point this afternoon during intermittent outages.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post]
The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tomorrow (Thursday) night, with targeted completion by 2AM Friday morning. Although every effort is being made to avoid outages, this maintenance may cause two interruptions:
  • At some point during this maintenance, users may experience up to a 20-minute interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will likely lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working tomorrow evening. Note that this may also interrupt any connection you have made over the GT VPN to non-PACE locations. Connections to PACE from within the campus firewalls may also be interrupted, which means that resources outside of PACE required for PACE jobs, such as queries to some software licenses used on PACE, including MATLAB or COMSOL, may be interrupted.  Batch jobs already running on PACE should not be affected.
  • In addition, about midway through the maintenance, there will be a period of approximately 20-30 minutes where authentication will be unavailable. This will prevent any new connections to the VPN, to PACE, and to any cloud service that authenticates using GT credentials.  It is also possible for this interruption to cause new job starts to fail due to the loss of access to the authentication service.

We will alert you if there is any change of plans for this emergency maintenance.

Please contact us at pace-support@oit.gatech.edu with any questions.

Emergency Network Maintenance

The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tonight, with targeted completion by midnight. At some point during this maintenance, users will experience a brief interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working this evening.
Note that this will also interrupt any connection you have made over the GT VPN to non-PACE locations.
Batch jobs running on PACE should not be affected, nor will connections from within the campus firewall.
We will alert you if there is any change of plans for this emergency maintenance.
Please contact us at pace-support@oit.gatech.edu with any questions.

Georgia Power Micro Grid Testing (Week of June 8)

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 6/15/20 12:45 PM]

Georgia Power will continue low-risk testing of the power supply to PACE’s Hive and testflight-coda clusters in the Coda data center this week.

In addition, Georgia Power is planning further testing in CODA for a later time, and we are working with them and other stakeholders to identify the best times and lowest-risk manner for completing this work in Coda.

[Update 6/12/20 6:45 PM]

Georgia Power will continue low-risk testing of the power supply to the Coda data center next week.

[Original Post]

During the week of June 8, Georgia Power will perform a series of bypass tests for the power that feeds the Coda data center, housing PACE’s Hive and testflight-coda clusters. This is a further step in establishing a Micro Grid power generation facility for Coda, after progress during the last maintenance period.
Georgia Power has classified all of these tests as low risk, and we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.
Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Home directory failures

[Update 5/18/20 4:25 PM]

Reliable access to home directories was restored early this afternoon. There was an issue with DNS on the GT network, where the DNS server allowing for a connection to the home and utility storage devices was reacting slowly but not completely down, so it did not fail over onto the backup server. In concert with OIT, we have reordered the DNS servers, and access is restored. Please contact us at pace-support@oit.gatech.edu with any questions.

If jobs failed due to the outage, please resubmit them to run again.

[ Issue began approximately 2 PM on 5/17/20 ]

We are experiencing an intermittent outage on PACE affecting home directories and certain other mounted utility directories. We are currently working to restore access. Thank you to those of you who reported the issue to us this afternoon. This intermittent mount failure can cause the following issues:

  • Home directories not loading on login nodes.
  • Login sessions starting with “bash” instead of “~” as the prompt and having warning messages displayed
  • Batch or interactive jobs failing immediately after launch due to an inability to load files with an error message such as “no such file or directory”
  • “pace-check-queue” and other PACE utilities failing to report information as expected
  • Missing home directories on file transfer utilities (scp or sftp)

 

For jobs that have failed, please wait until after we have completed the repair and then resubmit your jobs.

We will provide updates as they become available. Thank you for your patience.

[Resolved] Emergency Switch Reboot

[Resolved 5/15/20 9:30 PM]

Extensive repairs during our quarterly maintenance period resolved remaining Infiniband issues.

[Update 5/9/20 4:20 PM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have made additional adjustments since this morning, which have improved connectivity and reliability. Read/write access to GPFS (data and scratch) at the normal rate has been restored to nearly all nodes, and the few nodes with remaining difficulties have been offlined, so no new jobs will start on them, although jobs that were already running may hang. However, we continue to see intermittent issues with MPI jobs on active nodes, and we will continue to investigate next week. Please check any running jobs to see if they are producing output or hanging. If they are hanging, please cancel the job. Please resubmit any jobs that have failed, as most non-MPI and MPI jobs should work if resubmitted at this point. Keep in mind that any job with a walltime request that will not complete by 6 AM on Thursday will be held until after the schedule maintenance period.
Thank you for your patience during this emergency repair.

[Update 5/9/20 10:45 AM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have deployed the replacement switch, previously planned for the upcoming maintenance period, and it has been in place since approximately 11:45 PM Friday evening. We are continuing to troubleshoot access to GPFS (data and scratch) and MPI job functionality.
Users who are most affected with long-running jobs have been contacted directly with instructions to check progress of jobs.
Thank you for your patience during this emergency repair.

[Update 5/8/20 11:00 AM]

Our team worked into the early hours of this morning to complete the emergency maintenance, but we have not yet completely resolved all issues. New jobs were released to run around 1:15 AM. We are continuing to isolate and fix errors in the InfiniBand network affecting read/write on GPFS storage (data and scratch) and possibly MPI jobs. Please contact us at pace-support@oit.gatech.edu about any running jobs where you encounter slow performance, which will help us in identifying specific nodes with issues.
Many affected jobs may run more slowly than normal. In order to mitigate loss of research due to these issues, we have administratively added 24 hours to the walltime request of any job currently running. Please note that this extension will not extend job completion times beyond 6:00 AM on Thursday, when our scheduled maintenance period begins. If you resubmit a job, please keep in mind that any job that will not complete by Thursday morning will be held until after scheduled maintenance is complete.
We apologize for the disruption, and we will continue to update you on the status of this repair.

[Update 5/7/20 4:05 PM]

We encountered a complication during the reboot, and our engineers are currently working to complete the repair. We will provide updates as they become available.

[Original message]

We have an emergency need to reboot an InfiniBand switch in the Rich datacenter today, as it is likely to fail shortly without intervention. We will conduct this reboot at 3 PM today, and we expect the outage to last approximately 15 minutes. Any jobs running at 3 PM today are likely to fail if they attempt to read/write files to/from data or scratch directories during the outage or if they are employing MPI. We have stopped all new jobs from beginning in order to reduce the number of affected jobs, and we will release them after the reboot. For any job that is already running, please check the output and resubmit if your job fails. Jobs that do not read/write in the data or scratch directories during the outage window should not be affected.
We have planned a long-term repair to this equipment during next week’s maintenance period, but this emergency reboot is necessary in the meantime.
PACE resources in the Coda datacenter, including Hive and testflight-coda, will not be impacted. CUI/ITAR resources in Rich are also unaffected.