mweiner3 – Page 9 – Partnership for an Advanced Computing Environment

Emergency Network Maintenance

The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tonight, with targeted completion by midnight. At some point during this maintenance, users will experience a brief interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working this evening.

Note that this will also interrupt any connection you have made over the GT VPN to non-PACE locations.

Batch jobs running on PACE should not be affected, nor will connections from within the campus firewall.

We will alert you if there is any change of plans for this emergency maintenance.

You can find more details from the OIT status page at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5ee24846390bc604b71b8a15.

Please contact us at pace-support@oit.gatech.edu with any questions.

Georgia Power Micro Grid Testing (Week of June 8)

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 6/15/20 12:45 PM]

Georgia Power will continue low-risk testing of the power supply to PACE’s Hive and testflight-coda clusters in the Coda data center this week.

In addition, Georgia Power is planning further testing in CODA for a later time, and we are working with them and other stakeholders to identify the best times and lowest-risk manner for completing this work in Coda.

[Update 6/12/20 6:45 PM]

Georgia Power will continue low-risk testing of the power supply to the Coda data center next week.

[Original Post]

During the week of June 8, Georgia Power will perform a series of bypass tests for the power that feeds the Coda data center, housing PACE’s Hive and testflight-coda clusters. This is a further step in establishing a Micro Grid power generation facility for Coda, after progress during the last maintenance period.
Georgia Power has classified all of these tests as low risk, and we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.
Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Home directory failures

[Update 5/18/20 4:25 PM]

Reliable access to home directories was restored early this afternoon. There was an issue with DNS on the GT network, where the DNS server allowing for a connection to the home and utility storage devices was reacting slowly but not completely down, so it did not fail over onto the backup server. In concert with OIT, we have reordered the DNS servers, and access is restored. Please contact us at pace-support@oit.gatech.edu with any questions.

If jobs failed due to the outage, please resubmit them to run again.

[ Issue began approximately 2 PM on 5/17/20 ]

We are experiencing an intermittent outage on PACE affecting home directories and certain other mounted utility directories. We are currently working to restore access. Thank you to those of you who reported the issue to us this afternoon. This intermittent mount failure can cause the following issues:

Home directories not loading on login nodes.
Login sessions starting with “bash” instead of “~” as the prompt and having warning messages displayed
Batch or interactive jobs failing immediately after launch due to an inability to load files with an error message such as “no such file or directory”
“pace-check-queue” and other PACE utilities failing to report information as expected
Missing home directories on file transfer utilities (scp or sftp)

For jobs that have failed, please wait until after we have completed the repair and then resubmit your jobs.

We will provide updates as they become available. Thank you for your patience.

[Resolved] Emergency Switch Reboot

[Resolved 5/15/20 9:30 PM]

Extensive repairs during our quarterly maintenance period resolved remaining Infiniband issues.

[Update 5/9/20 4:20 PM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have made additional adjustments since this morning, which have improved connectivity and reliability. Read/write access to GPFS (data and scratch) at the normal rate has been restored to nearly all nodes, and the few nodes with remaining difficulties have been offlined, so no new jobs will start on them, although jobs that were already running may hang. However, we continue to see intermittent issues with MPI jobs on active nodes, and we will continue to investigate next week. Please check any running jobs to see if they are producing output or hanging. If they are hanging, please cancel the job. Please resubmit any jobs that have failed, as most non-MPI and MPI jobs should work if resubmitted at this point. Keep in mind that any job with a walltime request that will not complete by 6 AM on Thursday will be held until after the schedule maintenance period.
Thank you for your patience during this emergency repair.

[Update 5/9/20 10:45 AM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have deployed the replacement switch, previously planned for the upcoming maintenance period, and it has been in place since approximately 11:45 PM Friday evening. We are continuing to troubleshoot access to GPFS (data and scratch) and MPI job functionality.
Users who are most affected with long-running jobs have been contacted directly with instructions to check progress of jobs.
Thank you for your patience during this emergency repair.

[Update 5/8/20 11:00 AM]

Our team worked into the early hours of this morning to complete the emergency maintenance, but we have not yet completely resolved all issues. New jobs were released to run around 1:15 AM. We are continuing to isolate and fix errors in the InfiniBand network affecting read/write on GPFS storage (data and scratch) and possibly MPI jobs. Please contact us at pace-support@oit.gatech.edu about any running jobs where you encounter slow performance, which will help us in identifying specific nodes with issues.
Many affected jobs may run more slowly than normal. In order to mitigate loss of research due to these issues, we have administratively added 24 hours to the walltime request of any job currently running. Please note that this extension will not extend job completion times beyond 6:00 AM on Thursday, when our scheduled maintenance period begins. If you resubmit a job, please keep in mind that any job that will not complete by Thursday morning will be held until after scheduled maintenance is complete.
We apologize for the disruption, and we will continue to update you on the status of this repair.

[Update 5/7/20 4:05 PM]

We encountered a complication during the reboot, and our engineers are currently working to complete the repair. We will provide updates as they become available.

[Original message]

We have an emergency need to reboot an InfiniBand switch in the Rich datacenter today, as it is likely to fail shortly without intervention. We will conduct this reboot at 3 PM today, and we expect the outage to last approximately 15 minutes. Any jobs running at 3 PM today are likely to fail if they attempt to read/write files to/from data or scratch directories during the outage or if they are employing MPI. We have stopped all new jobs from beginning in order to reduce the number of affected jobs, and we will release them after the reboot. For any job that is already running, please check the output and resubmit if your job fails. Jobs that do not read/write in the data or scratch directories during the outage window should not be affected.
We have planned a long-term repair to this equipment during next week’s maintenance period, but this emergency reboot is necessary in the meantime.
PACE resources in the Coda datacenter, including Hive and testflight-coda, will not be impacted. CUI/ITAR resources in Rich are also unaffected.

[Complete] PACE Maintenance – May 14-16

[Update 5/15/20 9:30 PM]

We are pleased to announce that our May 2020 maintenance period has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible.
As usual, there are a small number of straggling nodes that will require additional intervention.

A summary of the changes and actions accomplished during this maintenance period:
– (Completed) [Hive/Testflight-Coda] Georgia Power began work to establish a Micro Grid power generation facility for Coda. Power has been restored.
– (Completed) [Hive] Default modules were changed to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

– (Completed) Performed upgrades and replacements on several infiniband switches in the Rich datacenter.
– (Completed) Replaced other switches and hardware in the Rich datacenter.
– (Completed) Updated software modules in Hive.
– (Completed) Updated salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
Thank you for your patience!

[Update 5/13/20 10:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM tomorrow and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

ITEMS REQUIRING USER ACTION:

– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.

Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.

Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.

PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

ITEMS NOT REQUIRING USER ACTION:

– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.

– Replace other switches and hardware in the Rich datacenter.

– Update software modules in Hive.

– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/11/20 8:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Original Post]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Resolved again] Rich scratch mount down

[Update 4/19/20 7:15 AM]

In coordination with our support vendor, we restored access to all scratch volumes at approximately 11:30 PM last night. Users on the affected scratch volumes should check any jobs that ran yesterday and resubmit if the job failed.
We are continuing to work with the support vendor to determine the source of the issue and make hardware changes to improve reliability of the scratch system in Rich going forward. Thank you for your patience yesterday. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

[Update 4/18/20 8:00 PM]

We are experiencing ongoing issues with our scratch filesystem. Users on volumes 1, 2, and 6 of scratch are currently unable to access their scratch directories. Volumes 0, 3, 4, 5, 7, 8, and 9 are unaffected.
You can identify your scratch volume by running the command “ll” in your home directory and looking for the scratch symbolic link’s destination. The volume is a digit 0-9 immediately preceding a slash and then your username at the end of the path.
e.g. “scratch -> /gpfs/scratch1/8/gburdell3” means that George is in scratch volume 8.

We are currently working to repair access to scratch and will update you when that is complete. We apologize for the continued disruption.

[Update 4/18/20 5:15 PM]

We have restored access to the GPFS mounted scratch filesystem in Rich, and compute nodes are again online and accepting jobs.
During a routine disk swap this morning, one of the dual controllers needed to be restarted, which caused an unexpected disruption. The system was automatically offlined to preserve data integrity. We have recovered and verified the filesystem, and nodes are back online. Users should check any jobs that were running earlier today, especially those that were accessing scratch, and resubmit if the job failed.
A few nodes will need additional fixes and remain offline. These will be released individually as they are repaired.
Please note that systems in Coda (Hive and testflight-coda) were unaffected. CUI/ITAR clusters in Rich were also unaffected.
Again, we apologize for the disruption. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

[Original Post]

The GPFS mounted scratch system (~/scratch) in Rich is currently down again. This means that you cannot currently access your scratch directory, and jobs writing to scratch will fail.
Due to the loss of the scratch mount, most PACE nodes are now marked “down or offline” to prevent new jobs from starting and failing.
We are working to restore the mount and will update you when a repair is in place. We apologize for the disruption.

PACE systems in Coda (Hive and testflight-coda) are unaffected.

[Resolved] Scratch inaccessible on datamover node

[Update]

This issue has been resolved. We still encourage users to take advantage of Globus for an improved data transfer experience.

[Original Post]

While the scratch filesystem is once again available on the login & compute nodes, it is still inaccessible on the datamover node (iw-dm-4), which many of you use to access your files via scp or sftp protocols. Your data directories are currently available there. We always encourage you to use Globus instead of scp or sftp, and that is the best workaround at this time to move files between scratch and non-PACE locations. For instructions on using Globus, please visit http://docs.pace.gatech.edu/storage/globus/. The datamover node may eventually be decommissioned, so now is a good time to begin using Globus if you have not already done so. Please contact us at pace-support@oit.gatech.edu if you have any questions. We apologize for the ongoing disruption.

[Resolved] Scratch filesystem issue

[Update 2/20/20 4:40 PM]

Use of the scratch filesystem is restored. It appears that the automated migration task did run but could not keep up with the rate of scratch usage. We will monitor scratch for recurrence of this issue.

Please check any running jobs for errors and resubmit if necessary.

[Original message 2/20/20 4:30 PM]

Shortly before 4 PM, we noticed that PACE’s mounted GPFS scratch filesystem (~/scratch) is experiencing an issue that is preventing users from writing to their scratch directories. Any running jobs that write to scratch may experience failures due to write errors.

The scratch filesystem writes first to SSDs, and an automated task migrates data to another location when those drives near capacity. This task did not run as expected, prompting users to receive errors that scratch was full. We have manually begun the migration and will update the blog post when scratch is again available.

We apologize for this disruption. Please contact us at pace-support@oit.gatech.edu with any concerns.

[Restored] GPFS Filesystem Issue

[Update 1/29/20 5:32 PM]

We are happy to report that our GPFS filesystem was restored to functionality early this afternoon. Our CI team was able to identify a failed switch as the source of problems on a group of nodes. We restored the switch, and we are investigating the deployment of improved backup systems to handle such cases in the future.

We apologize for the recent issues you have faced. As always, please send an email to pace-support@oit.gatech.edu with any concerns, so we can investigate.

[Original Post 1/28/20 12:46 PM]

We have been experiencing intermittent disruptions on our GPFS filesystem, especially on the mounted GPFS scratch (i.e., ~/scratch) filesystem, since yesterday. The PACE team is actively investigating the source of this issue, and we are working with our support vendor to restore the system to full functionality. A number of users have reported slow reads of files, hanging commands, and jobs that run more slowly than usual or do not appear to progress. We apologize for any interruptions you may be experiencing on PACE resources at this time, and we will alert you when the issue is resolved.

Globus authentication and endpoints

We became aware this morning of an issue with Globus authentication to the “gatechpace#datamover” endpoint that many of you use to transfer files to/from PACE resources. We are working to repair this right now, but please use the “PACE Internal” endpoint instead. This endpoint provides access to the same filesystem that you use with the datamover endpoint (plus PACE Archive storage, for those who have signed up for our archive service). Going forward, you may continue to use this newer endpoint instead of the older datamover one, even once we have datamover functioning again soon. For full instructions on using Globus with PACE, visit our Globus documentation page. PACE Internal functions in exactly the same way as gatechpace#datamover when interacting with Globus.

Please keep in mind that Globus is the best way to transfer files to/from PACE resources. Contact us at pace-support@oit.gatech.edu if you have any questions about using Globus.

Partnership for an Advanced Computing Environment

Author: mweiner3

Emergency Network Maintenance

Georgia Power Micro Grid Testing (Week of June 8)

[Resolved] Home directory failures

[Resolved] Emergency Switch Reboot

[Complete] PACE Maintenance – May 14-16

[Resolved again] Rich scratch mount down

[Resolved] Scratch inaccessible on datamover node

[Resolved] Scratch filesystem issue

[Restored] GPFS Filesystem Issue

Globus authentication and endpoints

Georgia Institute of Technology