Posts

[UPDATE] shared-scheduler Degraded Performance

7/31/2020 UPDATE

Dear Researchers,

In addition to the previously announced maintenance day activities, we will be migrating the Torque component of shared-sched to a dedicated server to address the recent performance issues. This move should improve the scheduler’s response time to client queries such as qstat, and decrease job submission and start times when compute resources are available. While you do not need to do anything to prepare for this migration, we advise that you make note of any jobs queued at the start of maintenance just in case. As always, please direct any questions or concerns to pace-support@oit.gatech.edu. We thank you for your patience.

The PACE Team

 

7/29/2020 UPDATE

Dear Researchers,

At this time the scheduler is functional, although some commands may be slow to respond. We will continue investigating to ascertain the source of these problems, and will update accordingly. Thank you.

[ORIGINAL MESSAGE]

We are aware of a significant slowdown in the performance of the shared-scheduler since last week. Initial attempts to resolve the issue towards the end of the week appeared successful, but the problems have restarted and we are continuing our investigation along with scheduler support. We appreciate your patience as we work to restore full functionality to shared-scheduler.

The PACE Team

[Resolved]: PACE Maintenance Days 8/6/2020-8/8/2020

Dear PACE Users,

RESOLVED: PACE is now ready for research.

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on August 6th, 2020 and conclude at 11:59 PM on August 8th, 2020. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:

– None Current.

ITEMS NOT REQUIRING USER ACTION:

– [Resolved] Coda Lustre Upgrade (This will start on Wednesday (08/05), which will impact testflight-coda only, and a scheduler reservation was put in place to prevent any jobs from running past 6:00AM on Wednesday – August 5).

– [Resolved] Install additional line cards for CS8500 Infiniband switch.

– [Resolved] Deploy PBSToools RPM on schedulers

– [Resolved] Upgrade Hive Infiniband switches firmware to version 3.9.0914

– [Resolved] Upgrade Coda Infiniband director switches firmware to version 3.9.0914

– [Resolved] Move DNS appliance from Rich to Coda.

– [Resolved] Update coda-apps file system mounts to use qtrees from NetApp on all servers.

– [Deferred] Update Nvidia GPU Drivers in Coda to support Cuda 11 SDK.

– [Resolved] Reboot of all nodes.

– [Resolved] Rebooted the subnet manager.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

The PACE Team

[Resolved] Georgia Power Micro Grid Testing (continued)

[Update 7/22/20 1:00 PM]

Hive and testflight-coda systems were restored early this morning. Systems have returned to normal operation, and user jobs are running. If you were notified of a lost job, please resubmit it at this time.

Georgia Power does not plan to conduct any tests today. No additional information about the cause of yesterday’s outage is available at this time.

[Update 7/21/20 11:00 PM]

The power outage in CODA has been bypassed, and power is returning to the Coda research hall.  However, because the cooling plant has been offline for so long, it will require about 2 hours to restart and stabilize before we can resume full operation.  Due to the late hour, we will begin to bring systems back on in the morning and provide another update when we’re back to normal operation.  Georgia Power will be researching the root cause of this outage in the morning, and we will share details if available.

[Update 7/21/20 3:15 PM]

Unfortunately, the planned testing of the Georgia Power Micro Grid this week has led to a loss of power in the Coda research hall, home to compute nodes for Hive & testflight-coda. Any running jobs on those clusters will have failed at this time. Access to login nodes and storage, housed in the Coda enterprise hall, is uninterrupted.

We are sorry for what we know if a significant interruption to your work.

We will follow up with users who had jobs running at the time of the power outage to provide more specific information.

At this time, teams are working to restore power to the system. We will provide an update when available.

 

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

Visit http://blog.pace.gatech.edu/?p=6778 for full details on this power testing.

[Resolved] PACE License Server Outage

The PACE license server experienced an outage earlier this afternoon, which has since been resolved.

The following software licenses were not available on PACE during the outage: Intel compiler, Gurobi, Allinea, PGI. If you experienced difficulty accessing these services earlier today, please retry your job at this time.

The outage did not affect the College of Engineering license server, which hosts campus-wide licenses for some licensed software widely used on PACE, including MATLAB, COMSOL, Abaqus, and Ansys.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Emergency Shutdown of all Compute Nodes, Schedulers, and Login Nodes in Rich Data Center

[Update – June 28, 2020, 2:42pm]

We are following up with another update.  The cooling on campus is currently set up to support buildings as best possible but it’s not “normal” operation. Facilities has indicated to us  that we should be able to resume operation.

According to the most recent news from Atlanta Water, they have isolated the 36″ water main failure and are working on the repairs that may conclude late on Wednesday at the earliest, Friday at the latest.

State of PACE: We have brought compute nodes online along with the remaining services.  Frequently, there are a few nodes that require specific manual action.  We will continue to work on bringing back those straggling nodes.  We will contact the users whose jobs were terminated due to yesterday’s emergency shutdown.  We encourage all users to verify their recent jobs.  Again, our storage system did not lose data.

Monitoring and Risk: OIT Operations staff will continue to monitor the temperature and cooling systems and will alert us upon any major change.  PACE will remain on standby should we need to shutdown services again in case we are unable to maintain cooling.

Coda data center that includes TestFlight-Coda and Hive Clusters and our backup data facilities are not affected by this outage.

Thank you again for your patience while we address emergency operations.

[Update – June 27, 2020, 9:36pm]

Water pressure and cooling have been partially restored at the Rich data center.  During this emergency shutdown, our storage did not experience data loss.  At this time, we have partially restored services  to cluster login nodes and we continue to work on restoring gryphon login node.  We have restored storage, schedulers, and data mover/Globus services.

For safety, we will keep the compute nodes offline overnight, and we aim to begin restoring the compute nodes on Sunday, June 28, along with any other services.

Thank you for your patience as we work through this incident.

 

[Original Note – June 27, 2020, 4:22pm]

Dear PACE Users,

There has been a water main break on a 36-inch transmission main at Ferst Dr NW and Hemphill Ave NW causing a loss of water pressure to campus chiller plants providing cooling to the Rich and other data centers. GT Facilities are in progress of shutting down chiller plants. Operations team is monitoring temperature in Rich and starting to deploy spot chillers.

This issue does not impact CODA datacenter (hive and testflight-hive clusters).

We are initiating an emergency shutdown of Rich resources to prevent overheating. This will impact running jobs. We will keep storage systems online as long as possible, but may need to power them as the situation requires.

Please save your work if possible, and refrain from submitting new jobs. We’ll keep you updated via emails and PACE blog as we continue to monitor the developments.

[Resolved] Issue with InfiniBand Fabric and subnet managers

Early today, the InfiniBand Fabric located in the Rich Datacenter (where most PACE resources are located) developed issues reaching the subnet managers. After on-site troubleshooting, the subnet manager was initialized. As of 11:30 AM local time, the InfiniBand Fabric is operational.

Some running jobs might have been affected during the outage period as well as potential issues in new jobs using MPI.

Please check any jobs for any potential issues and we deeply apologize for any inconvenience that may have occurred.

DNS/DHCP maintenance

OIT will be conducting scheduled maintenance on Thursday, June 25, 5:00 – 8:00 AM to patch gtipam and DNS/DHCP servers. Due to redundant servers, the risk of any interruption to PACE is very low. If there is an interruption, you may find yourself unable to connect to PACE or lose your open connection to a login node, interactive job, VNC session, or Jupyter notebook. Running batch jobs should not be affected, even in the event of an interruption.
Please contact us at pace-support@oit.gatech.edu with any questions.

Emergency Network Maintenance Tomorrow (6/18)

[Update 6/19/20 12:10 PM]

The network team is beginning additional emergency network maintenance immediately (at noon today), continuing through 7 PM this evening, to reverse changes from yesterday evening. It will have the same effect as yesterday’s outage, so you will likely lose your VPN connection and/or PACE connection at some point this afternoon during intermittent outages.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post]
The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tomorrow (Thursday) night, with targeted completion by 2AM Friday morning. Although every effort is being made to avoid outages, this maintenance may cause two interruptions:
  • At some point during this maintenance, users may experience up to a 20-minute interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will likely lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working tomorrow evening. Note that this may also interrupt any connection you have made over the GT VPN to non-PACE locations. Connections to PACE from within the campus firewalls may also be interrupted, which means that resources outside of PACE required for PACE jobs, such as queries to some software licenses used on PACE, including MATLAB or COMSOL, may be interrupted.  Batch jobs already running on PACE should not be affected.
  • In addition, about midway through the maintenance, there will be a period of approximately 20-30 minutes where authentication will be unavailable. This will prevent any new connections to the VPN, to PACE, and to any cloud service that authenticates using GT credentials.  It is also possible for this interruption to cause new job starts to fail due to the loss of access to the authentication service.

We will alert you if there is any change of plans for this emergency maintenance.

Please contact us at pace-support@oit.gatech.edu with any questions.

Emergency Network Maintenance

The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tonight, with targeted completion by midnight. At some point during this maintenance, users will experience a brief interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working this evening.
Note that this will also interrupt any connection you have made over the GT VPN to non-PACE locations.
Batch jobs running on PACE should not be affected, nor will connections from within the campus firewall.
We will alert you if there is any change of plans for this emergency maintenance.
Please contact us at pace-support@oit.gatech.edu with any questions.

Georgia Power Micro Grid Testing (Week of June 8)

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 6/15/20 12:45 PM]

Georgia Power will continue low-risk testing of the power supply to PACE’s Hive and testflight-coda clusters in the Coda data center this week.

In addition, Georgia Power is planning further testing in CODA for a later time, and we are working with them and other stakeholders to identify the best times and lowest-risk manner for completing this work in Coda.

[Update 6/12/20 6:45 PM]

Georgia Power will continue low-risk testing of the power supply to the Coda data center next week.

[Original Post]

During the week of June 8, Georgia Power will perform a series of bypass tests for the power that feeds the Coda data center, housing PACE’s Hive and testflight-coda clusters. This is a further step in establishing a Micro Grid power generation facility for Coda, after progress during the last maintenance period.
Georgia Power has classified all of these tests as low risk, and we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.
Please contact us at pace-support@oit.gatech.edu with any questions.