[Complete] PACE Maintenance – May 14-16

[Update 5/15/20 9:30 PM]

We are pleased to announce that our May 2020 maintenance period has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible.
As usual, there are a small number of straggling nodes that will require additional intervention.

A summary of the changes and actions accomplished during this maintenance period:
– (Completed) [Hive/Testflight-Coda] Georgia Power began work to establish a Micro Grid power generation facility for Coda. Power has been restored.
– (Completed) [Hive] Default modules were changed to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

– (Completed) Performed upgrades and replacements on several infiniband switches in the Rich datacenter.
– (Completed) Replaced other switches and hardware in the Rich datacenter.
– (Completed) Updated software modules in Hive.
– (Completed) Updated salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
Thank you for your patience!

[Update 5/13/20 10:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM tomorrow and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

 

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

 

ITEMS REQUIRING USER ACTION:

– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.

Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.

Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.

PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

 

ITEMS NOT REQUIRING USER ACTION:

– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.

– Replace other switches and hardware in the Rich datacenter.

– Update software modules in Hive.

– Update salt configuration management settings on all the production servers.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/11/20 8:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

[Original Post]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. A link with detailed documentation of this change and necessary action by users will be provided prior to the maintenance period.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

OIT Network Services Team Firewall upgrades (5/5/2020)

PACE has been informed that the OIT Network Services Team is preparing for software upgrades on multiple firewall servers across the Georgia Institute of Technology Atlanta campus on 5/5/2020 20:00 – 23:59, 5/7/2020 20:00 – 23:59, 5/8/2020 19:00 – 5/9/2020 02:00. While there are no direct impacts on the Rich and Coda Datacenter networks, there is potential for interruptions in connections to license servers, which can lead to job failures. Applications which may be impacted include

  • Abaqus
  • Ansys
  • Comsol
  • Dymola
  • Matlab

and any other application that may have a license server not internal to PACE. Due to potential interruptions, please check any jobs scheduled to run during these periods. PACE apologizes for any impact on your research workflow that this may cause. 

The Network Team will report their status for the project via the status.gatech.edu. Please check blog.pace.gatech.edu for updates. 

[Resolved again] Rich scratch mount down

[Update 4/19/20 7:15 AM]

In coordination with our support vendor, we restored access to all scratch volumes at approximately 11:30 PM last night. Users on the affected scratch volumes should check any jobs that ran yesterday and resubmit if the job failed.
We are continuing to work with the support vendor to determine the source of the issue and make hardware changes to improve reliability of the scratch system in Rich going forward. Thank you for your patience yesterday. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

 

[Update 4/18/20 8:00 PM]

We are experiencing ongoing issues with our scratch filesystem. Users on volumes 1, 2, and 6 of scratch are currently unable to access their scratch directories. Volumes 0, 3, 4, 5, 7, 8, and 9 are unaffected.
You can identify your scratch volume by running the command “ll” in your home directory and looking for the scratch symbolic link’s destination. The volume is a digit 0-9 immediately preceding a slash and then your username at the end of the path.
e.g. “scratch -> /gpfs/scratch1/8/gburdell3” means that George is in scratch volume 8.

We are currently working to repair access to scratch and will update you when that is complete. We apologize for the continued disruption.

 

[Update 4/18/20 5:15 PM]

We have restored access to the GPFS mounted scratch filesystem in Rich, and compute nodes are again online and accepting jobs.
During a routine disk swap this morning, one of the dual controllers needed to be restarted, which caused an unexpected disruption. The system was automatically offlined to preserve data integrity. We have recovered and verified the filesystem, and nodes are back online. Users should check any jobs that were running earlier today, especially those that were accessing scratch, and resubmit if the job failed.
A few nodes will need additional fixes and remain offline. These will be released individually as they are repaired.
Please note that systems in Coda (Hive and testflight-coda) were unaffected. CUI/ITAR clusters in Rich were also unaffected.
Again, we apologize for the disruption. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

 

[Original Post]

The GPFS mounted scratch system (~/scratch) in Rich is currently down again. This means that you cannot currently access your scratch directory, and jobs writing to scratch will fail.
Due to the loss of the scratch mount, most PACE nodes are now marked “down or offline” to prevent new jobs from starting and failing.
We are working to restore the mount and will update you when a repair is in place. We apologize for the disruption.

PACE systems in Coda (Hive and testflight-coda) are unaffected.

[Resolved] Scratch inaccessible on datamover node

[Update]

This issue has been resolved. We still encourage users to take advantage of Globus for an improved data transfer experience.

[Original Post]

While the scratch filesystem is once again available on the login & compute nodes, it is still inaccessible on the datamover node (iw-dm-4), which many of you use to access your files via scp or sftp protocols. Your data directories are currently available there. We always encourage you to use Globus instead of scp or sftp, and that is the best workaround at this time to move files between scratch and non-PACE locations. For instructions on using Globus, please visit http://docs.pace.gatech.edu/storage/globus/. The datamover node may eventually be decommissioned, so now is a good time to begin using Globus if you have not already done so. Please contact us at pace-support@oit.gatech.edu if you have any questions. We apologize for the ongoing disruption.

[RESOLVED] Rich data center storage problems (/user/local)– Paused Jobs

Dear PACE Users,

At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue),  please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[Original Message]

In addition to this morning’s ongoing project/data and scratch storage problems, our fileserver that serves the shared “/usr/local” on all PACE machines located in Rich Data Center started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

At this point, we have paused all schedulers for Rich-based resources.   With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved.   Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

This storage problem and scheduler pause does not impact Coda data center’s Hive and TestFlight-Coda cluster.

We are working to resolve these problems ASAP and will keep you updated on this post.

[RESOLVED] Rich Data/Project and Scratch Storage Slow Performance

[RESOLVED]:
At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[UPDATE]:
The issues from this morning’s storage problems are still ongoing. At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

[Original Post]:
We have identified slow performance in the Rich data/project and scratch storage volumes. Jobs utilizing these volumes may experience problems, so please verify results accordingly. We are actively working to resolve the issue.

PACE License Manager and Server Issues

Overnight we experienced issues with several of our servers, including our License manager, GTLib server, and the Testflight and Novazohar queues. We are actively addressing the problem, having restored functionality to the License manager and Novazohar. We are still working on Testflight, and will provide updates as they are available. As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu accordingly.

Hive Cluster — Scheduler modifications/Policy Update

Dear Hive Users,

Hive cluster has been in production for over half a year, and we are pleased with the continued growth of user community on Hive cluster along with the utilization of the cluster that has remained highly utilized.  As the cluster has begun to near 100% utilization more frequently, we have received additional feedback from users that compels us to make additional changes that are to ensure continued productivity for all users of Hive.  Recently, Hive PIs have approved the following changes listed below that will be deployed on April 10:

  1. Hive-gpu-short: We are creating a new gpu queue with a maximum walltime of 12 hours.  This queue will consist of 2 nodes that will be migrated from hive-gpu queue.  This will address the longer job wait times that select users experienced on hive-gpu queue, and this queue will promote users who have short/interactive/machine learning jobs to further develop and grow on this cluster.
  2. Adjust dynamic priority: We will adjust the dynamic priority to reflect the PI groups in addition to individual users.  This will provide an equal and fair opportunity for each of the research teams to access this cluster.
  3. Hive-interact: We will reduce hive-interact queue to 16 nodes from the current 32 nodes due to this queue’s utilization being low.

Who is impacted:  All Hive users will be impacted by the adjustment to the dynamic priority.

User Action:  For users who want to use the new hive-gpu-short queue, users will need to update their PBS scripts with the new queue and the walltime may not exceed 12 hours.

Our documentation will be updated on April 10 to reflect these changes to queues that you will be able to access from the http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Best,
The Past Team

Emergency Firewall Maintenance

Dear Researchers,

The GT network team will undertake an emergency code upgrade on the departmental Palo Alto firewalls beginning at 8pm tonight.  Because this is a high availability pair of devices, this upgrade should not be a major disruption to any traffic to or from the PACE systems.  The specific upgrade has already been successfully accomplished on other firewall devices of the same hardware and software versions and it was observed to not cause any disruptions.

With that said, there is a possibility that connections to the PACE login servers may see a temporary interruption between 8pm and 11pm TONIGHT as the firewalls are upgraded. This should not impact any running jobs except if there is a request for a license on a license server elsewhere on campus (e.g., abaqus) that happens to coincide with the exact moment of the firewall changeover.  Additionally, there is possibility that users may experience interruptions during their interactive sessions (e.g., edit session, screen, VNC Job, Jupyter notebook).  The batch jobs that are already scheduled and/or running on the clusters should otherwise progress normally.

Please check the status and completion of your jobs that have run this evening for any unexpected errors and re-submit should you believe an interruption was the cause.  We apologize in advance for any inconvenience this required emergency code upgrade may cause.

You may follow the status of this maintenance at GT’s status page

As always, if you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu .

Best,

The PACE Team

 

[RESOLVED] RHEL7 Dedicated Scheduler Down

[RESOLVED] We have restored functionality to the RHEL7 dedicated scheduler. Thank you for your patience.

[UPDATE] The RHEL7 dedicated scheduler, accessed via login7-d, is again down. We are actively working to resolve the issue at this time, and we will update you when the scheduler is restored. Please follow the same blog post (http://blog.pace.gatech.edu/?p=6715) for updates. If you have any questions, please contact pace-support@oit.gatech.edu.

[RESOLVED] We have rebooted the RHEL7 Dedicated scheduler, and functionality has been restored. Thank you for your patience.

[ORIGINAL MESSAGE] Roughly 30 minutes ago we determined an issue with the scheduler for dedicated RHEL7 clusters; this scheduler is responsible for all jobs submitted from the dedicated RHEL7 headnode, login7-d. All other schedulers are operating as expected. We are actively working to resolve the problem, but in the meantime you will be unable to submit new jobs or query the status of queued or running jobs.

If you have any questions, please contact pace-support@oit.gatech.edu.