ssarajlic3 – Page 6 – Partnership for an Advanced Computing Environment

OIT’s Planned Network Maintenance

We are following up with an update on the schedule for OIT’s planned network maintenance. OIT’s Network Engineering team will be conducting two maintenance activities scheduled for the evening of Friday, September 11^th, from 7:00pm to 4:00am. These maintenance activities will affect connections of PACE to the outside Internet that is anticipated to last 30 minutes or less from the start of the activity, which may occur at any point during this updated and longer maintenance time window.

What’s about to happen: On Friday, September 11^th, starting at 7:00pm – 4:00am (September 12^th), Network Engineering team will be upgrading the data center firewall appliances to the latest code that is recommended by Palo Alto who has addressed serious security vulnerabilities with their latest released code. To reassure you, OIT’s network team has been operating with some controls in place to address these vulnerabilities, and this planned upgrade will further reduce our risk. Second maintenance activity also starts on Friday, September 11^th, Network Engineering team will be swapping service to a more capable Network Address Translation (NAT) appliance in Coda datacenter, as the one currently in Coda is being overloaded. These activities will affect PACE’s connection to/from the Internet.

Who is impacted: PACE users will not be able to connect to PACE resources and/or they may lose connection during this maintenance window that may last 30 minutes or less from the start of the activity at any point during the maintenance time window. We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a loss of connection to the PACE resources. Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during these maintenance activities. These maintenance activities will not affect any of the PACE storage systems.

What PACE will do: PACE will remain on standby during these activities to monitor the systems, conduct testing and report on any interruptions in service. For up-to-date progress, please check the Georgia Tech’s status page, https://status.gatech.edu.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

OIT’s Planned Network Maintenance

[Update – September 8, 2020, 1:31PM]

[Update – August 28, 2020, 6:05PM]

As of 4:46PM, the Office of Information Technology has postponed this evening’s data center firewall upgrade and network migration. The maintenance will be rescheduled in an effort to maintain access to systems that are critical to COVID-19 data collection and contact tracing. Please expect additional communications within the next two weeks regarding new dates.

[Original Note – August 26, 2020, 5:44PM]

OIT’s Network Engineering team will be conducting two maintenance activities scheduled for the evening of Friday, August 28^th, starting at 8:00pm, which are anticipated to last 3 – 4 hours. These maintenance activities will affect connections of PACE to the outside Internet that is anticipated to last 30 minutes or less from the start of the activity, which may occur at any point during this maintenance time window.

What’s about to happen: On Friday, August 28^th, starting at 8:00pm – 11:59pm, Network Engineering team will be upgrading the data center firewall appliances to the latest code that is recommended by Palo Alto who has addressed serious security vulnerabilities with their latest released code. To reassure you, OIT’s network team has been operating with some controls in place to address these vulnerabilities, and this planned upgrade will further reduce our risk. Second maintenance activity also starts on Friday, August 28^th, starting at 8:00pm – 11:00pm, Network Engineering team will be swapping service to a more capable Network Address Translation (NAT) appliance in Coda datacenter, as the one currently in Coda is being overloaded. These activities will affect PACE’s connection to/from the Internet.

Who is impacted: PACE users will not be able to connect to PACE resources and/or they may lose connection during this maintenance window that may last 30 minutes or less from the start of the activity at any point during the maintenance time window. We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a lose of connection to the PACE resources. Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during these maintenance activities. These maintenance activities will not affect any of the PACE storage systems.

What PACE will do: PACE will remain on standby during these activities to monitor the systems, conduct testing and report on any interruptions in service.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved/Monitoring] Brief Network/InfiniBand Interruptions

Dear Researchers,

As we continue to monitor our network closely after the recent issues with our network/InfiniBand, we wanted to alert you about a brief network glitch from this afternoon that’s impacted the connection between the GPFS and the compute nodes as well as node to node communication.

What happened and what we did: At 12:45pm, we started to experience issues in connection between our two main InfiniBand switches that GPFS connects to along with compute nodes. We observed various errors that we were able to quickly diagnose, and by 1:55pm we resolved the issues after rebooting one of the main switches.

Who is impacted: During this brief network glitch, users may have experienced slow read/write and/or errors on GPFS directories from the compute nodes. This may have impacted running MPI jobs. We encourage users to check on their running jobs from earlier this afternoon, and resubmit any jobs that may have been interrupted.

What we will continue to do: We will continue to monitor the network and report as needed. We appreciate your continued understanding and patience during these recent network interruptions. Please rest assured that we are doing everything we can to keep this network fabric operational.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved/Monitoring] GPFS – network issues

We began experiencing a network issue earlier today at approximately 2:00AM with the connection between our GPFS filesystem (data and scratch directories) and about one third of PACE’s compute nodes in the Rich datacenter. Affected nodes are on these racks, indicated by the second section of the node name (e.g., rich133-s40-20 or iw-s40-21 would be on rack s40):

b13, b14, b16, b17, c32, c34, c36, c38, g13, g14, g15, g16, g17, h31, h33, k35.

As a result of this network issue, users may have experienced slow read/write on GPFS directories from these nodes that may also have impacted MPI running jobs on these nodes. We finished making a repair late this afternoon, but the slowness could return, and we are continuing to monitor the system. Thank you to users who have been reporting the issue today via support tickets. Please continue to contact us if the slowness returns.

If your jobs have been running on the impacted nodes and not producing output, please cancel and resubmit them. To check what nodes your job is running on, please run the following command: qstat -u USER_NAME -n, replacing USER_NAME with your username, eg. “qstat -u.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved]: PACE Maintenance Days 8/6/2020-8/8/2020

Dear PACE Users,

RESOLVED: PACE is now ready for research.

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on August 6th, 2020 and conclude at 11:59 PM on August 8th, 2020. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:

– None Current.

ITEMS NOT REQUIRING USER ACTION:

– [Resolved] Coda Lustre Upgrade (This will start on Wednesday (08/05), which will impact testflight-coda only, and a scheduler reservation was put in place to prevent any jobs from running past 6:00AM on Wednesday – August 5).

– [Resolved] Install additional line cards for CS8500 Infiniband switch.

– [Resolved] Deploy PBSToools RPM on schedulers

– [Resolved] Upgrade Hive Infiniband switches firmware to version 3.9.0914

– [Resolved] Upgrade Coda Infiniband director switches firmware to version 3.9.0914

– [Resolved] Move DNS appliance from Rich to Coda.

– [Resolved] Update coda-apps file system mounts to use qtrees from NetApp on all servers.

– [Deferred] Update Nvidia GPU Drivers in Coda to support Cuda 11 SDK.

– [Resolved] Reboot of all nodes.

– [Resolved] Rebooted the subnet manager.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

The PACE Team

[Resolved] Emergency Shutdown of all Compute Nodes, Schedulers, and Login Nodes in Rich Data Center

[Update – June 28, 2020, 2:42pm]

We are following up with another update. The cooling on campus is currently set up to support buildings as best possible but it’s not “normal” operation. Facilities has indicated to us that we should be able to resume operation.

According to the most recent news from Atlanta Water, they have isolated the 36″ water main failure and are working on the repairs that may conclude late on Wednesday at the earliest, Friday at the latest.

State of PACE: We have brought compute nodes online along with the remaining services. Frequently, there are a few nodes that require specific manual action. We will continue to work on bringing back those straggling nodes. We will contact the users whose jobs were terminated due to yesterday’s emergency shutdown. We encourage all users to verify their recent jobs. Again, our storage system did not lose data.

Monitoring and Risk: OIT Operations staff will continue to monitor the temperature and cooling systems and will alert us upon any major change. PACE will remain on standby should we need to shutdown services again in case we are unable to maintain cooling.

Coda data center that includes TestFlight-Coda and Hive Clusters and our backup data facilities are not affected by this outage.

Thank you again for your patience while we address emergency operations.

[Update – June 27, 2020, 9:36pm]

Water pressure and cooling have been partially restored at the Rich data center. During this emergency shutdown, our storage did not experience data loss. At this time, we have partially restored services to cluster login nodes and we continue to work on restoring gryphon login node. We have restored storage, schedulers, and data mover/Globus services.

For safety, we will keep the compute nodes offline overnight, and we aim to begin restoring the compute nodes on Sunday, June 28, along with any other services.

Thank you for your patience as we work through this incident.

[Original Note – June 27, 2020, 4:22pm]

Dear PACE Users,

There has been a water main break on a 36-inch transmission main at Ferst Dr NW and Hemphill Ave NW causing a loss of water pressure to campus chiller plants providing cooling to the Rich and other data centers. GT Facilities are in progress of shutting down chiller plants. Operations team is monitoring temperature in Rich and starting to deploy spot chillers.

This issue does not impact CODA datacenter (hive and testflight-hive clusters).

We are initiating an emergency shutdown of Rich resources to prevent overheating. This will impact running jobs. We will keep storage systems online as long as possible, but may need to power them as the situation requires.

Please save your work if possible, and refrain from submitting new jobs. We’ll keep you updated via emails and PACE blog as we continue to monitor the developments.

[Resolved] Issue with InfiniBand Fabric and subnet managers

Early today, the InfiniBand Fabric located in the Rich Datacenter (where most PACE resources are located) developed issues reaching the subnet managers. After on-site troubleshooting, the subnet manager was initialized. As of 11:30 AM local time, the InfiniBand Fabric is operational.

Some running jobs might have been affected during the outage period as well as potential issues in new jobs using MPI.

Please check any jobs for any potential issues and we deeply apologize for any inconvenience that may have occurred.

OIT Network Services Team Firewall upgrades (5/5/2020)

PACE has been informed that the OIT Network Services Team is preparing for software upgrades on multiple firewall servers across the Georgia Institute of Technology Atlanta campus on 5/5/2020 20:00 – 23:59, 5/7/2020 20:00 – 23:59, 5/8/2020 19:00 – 5/9/2020 02:00. While there are no direct impacts on the Rich and Coda Datacenter networks, there is potential for interruptions in connections to license servers, which can lead to job failures. Applications which may be impacted include

Abaqus
Ansys
Comsol
Dymola
Matlab

and any other application that may have a license server not internal to PACE. Due to potential interruptions, please check any jobs scheduled to run during these periods. PACE apologizes for any impact on your research workflow that this may cause.

The Network Team will report their status for the project via the status.gatech.edu. Please check blog.pace.gatech.edu for updates.

[RESOLVED] Rich data center storage problems (/user/local)– Paused Jobs

Dear PACE Users,

At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[Original Message]

In addition to this morning’s ongoing project/data and scratch storage problems, our fileserver that serves the shared “/usr/local” on all PACE machines located in Rich Data Center started experiencing problems. This issue causes several wide-spread problems including:

Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
Crashing of newly started jobs that run applications in the PACE repository
New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

This storage problem and scheduler pause does not impact Coda data center’s Hive and TestFlight-Coda cluster.

We are working to resolve these problems ASAP and will keep you updated on this post.

Hive Cluster — Scheduler modifications/Policy Update

Dear Hive Users,

Hive cluster has been in production for over half a year, and we are pleased with the continued growth of user community on Hive cluster along with the utilization of the cluster that has remained highly utilized. As the cluster has begun to near 100% utilization more frequently, we have received additional feedback from users that compels us to make additional changes that are to ensure continued productivity for all users of Hive. Recently, Hive PIs have approved the following changes listed below that will be deployed on April 10:

Hive-gpu-short: We are creating a new gpu queue with a maximum walltime of 12 hours. This queue will consist of 2 nodes that will be migrated from hive-gpu queue. This will address the longer job wait times that select users experienced on hive-gpu queue, and this queue will promote users who have short/interactive/machine learning jobs to further develop and grow on this cluster.
Adjust dynamic priority: We will adjust the dynamic priority to reflect the PI groups in addition to individual users. This will provide an equal and fair opportunity for each of the research teams to access this cluster.
Hive-interact: We will reduce hive-interact queue to 16 nodes from the current 32 nodes due to this queue’s utilization being low.

Who is impacted: All Hive users will be impacted by the adjustment to the dynamic priority.

User Action: For users who want to use the new hive-gpu-short queue, users will need to update their PBS scripts with the new queue and the walltime may not exceed 12 hours.

Our documentation will be updated on April 10 to reflect these changes to queues that you will be able to access from the http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Best,
The Past Team