Posts

[Resolved] Emergency Switch Reboot

[Resolved 5/15/20 9:30 PM]

Extensive repairs during our quarterly maintenance period resolved remaining Infiniband issues.

[Update 5/9/20 4:20 PM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have made additional adjustments since this morning, which have improved connectivity and reliability. Read/write access to GPFS (data and scratch) at the normal rate has been restored to nearly all nodes, and the few nodes with remaining difficulties have been offlined, so no new jobs will start on them, although jobs that were already running may hang. However, we continue to see intermittent issues with MPI jobs on active nodes, and we will continue to investigate next week. Please check any running jobs to see if they are producing output or hanging. If they are hanging, please cancel the job. Please resubmit any jobs that have failed, as most non-MPI and MPI jobs should work if resubmitted at this point. Keep in mind that any job with a walltime request that will not complete by 6 AM on Thursday will be held until after the schedule maintenance period.
Thank you for your patience during this emergency repair.

[Update 5/9/20 10:45 AM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have deployed the replacement switch, previously planned for the upcoming maintenance period, and it has been in place since approximately 11:45 PM Friday evening. We are continuing to troubleshoot access to GPFS (data and scratch) and MPI job functionality.
Users who are most affected with long-running jobs have been contacted directly with instructions to check progress of jobs.
Thank you for your patience during this emergency repair.

[Update 5/8/20 11:00 AM]

Our team worked into the early hours of this morning to complete the emergency maintenance, but we have not yet completely resolved all issues. New jobs were released to run around 1:15 AM. We are continuing to isolate and fix errors in the InfiniBand network affecting read/write on GPFS storage (data and scratch) and possibly MPI jobs. Please contact us at pace-support@oit.gatech.edu about any running jobs where you encounter slow performance, which will help us in identifying specific nodes with issues.
Many affected jobs may run more slowly than normal. In order to mitigate loss of research due to these issues, we have administratively added 24 hours to the walltime request of any job currently running. Please note that this extension will not extend job completion times beyond 6:00 AM on Thursday, when our scheduled maintenance period begins. If you resubmit a job, please keep in mind that any job that will not complete by Thursday morning will be held until after scheduled maintenance is complete.
We apologize for the disruption, and we will continue to update you on the status of this repair.

[Update 5/7/20 4:05 PM]

We encountered a complication during the reboot, and our engineers are currently working to complete the repair. We will provide updates as they become available.

[Original message]

We have an emergency need to reboot an InfiniBand switch in the Rich datacenter today, as it is likely to fail shortly without intervention. We will conduct this reboot at 3 PM today, and we expect the outage to last approximately 15 minutes. Any jobs running at 3 PM today are likely to fail if they attempt to read/write files to/from data or scratch directories during the outage or if they are employing MPI. We have stopped all new jobs from beginning in order to reduce the number of affected jobs, and we will release them after the reboot. For any job that is already running, please check the output and resubmit if your job fails. Jobs that do not read/write in the data or scratch directories during the outage window should not be affected.
We have planned a long-term repair to this equipment during next week’s maintenance period, but this emergency reboot is necessary in the meantime.
PACE resources in the Coda datacenter, including Hive and testflight-coda, will not be impacted. CUI/ITAR resources in Rich are also unaffected.

[Complete] PACE Maintenance – May 14-16

[Update 5/15/20 9:30 PM]

We are pleased to announce that our May 2020 maintenance period has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible.
As usual, there are a small number of straggling nodes that will require additional intervention.

A summary of the changes and actions accomplished during this maintenance period:
– (Completed) [Hive/Testflight-Coda] Georgia Power began work to establish a Micro Grid power generation facility for Coda. Power has been restored.
– (Completed) [Hive] Default modules were changed to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

– (Completed) Performed upgrades and replacements on several infiniband switches in the Rich datacenter.
– (Completed) Replaced other switches and hardware in the Rich datacenter.
– (Completed) Updated software modules in Hive.
– (Completed) Updated salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
Thank you for your patience!

[Update 5/13/20 10:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM tomorrow and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

 

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

 

ITEMS REQUIRING USER ACTION:

– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.

Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.

Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.

PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

 

ITEMS NOT REQUIRING USER ACTION:

– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.

– Replace other switches and hardware in the Rich datacenter.

– Update software modules in Hive.

– Update salt configuration management settings on all the production servers.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/11/20 8:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

[Original Post]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. A link with detailed documentation of this change and necessary action by users will be provided prior to the maintenance period.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

OIT Network Services Team Firewall upgrades (5/5/2020)

PACE has been informed that the OIT Network Services Team is preparing for software upgrades on multiple firewall servers across the Georgia Institute of Technology Atlanta campus on 5/5/2020 20:00 – 23:59, 5/7/2020 20:00 – 23:59, 5/8/2020 19:00 – 5/9/2020 02:00. While there are no direct impacts on the Rich and Coda Datacenter networks, there is potential for interruptions in connections to license servers, which can lead to job failures. Applications which may be impacted include

  • Abaqus
  • Ansys
  • Comsol
  • Dymola
  • Matlab

and any other application that may have a license server not internal to PACE. Due to potential interruptions, please check any jobs scheduled to run during these periods. PACE apologizes for any impact on your research workflow that this may cause. 

The Network Team will report their status for the project via the status.gatech.edu. Please check blog.pace.gatech.edu for updates. 

[Resolved again] Rich scratch mount down

[Update 4/19/20 7:15 AM]

In coordination with our support vendor, we restored access to all scratch volumes at approximately 11:30 PM last night. Users on the affected scratch volumes should check any jobs that ran yesterday and resubmit if the job failed.
We are continuing to work with the support vendor to determine the source of the issue and make hardware changes to improve reliability of the scratch system in Rich going forward. Thank you for your patience yesterday. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

 

[Update 4/18/20 8:00 PM]

We are experiencing ongoing issues with our scratch filesystem. Users on volumes 1, 2, and 6 of scratch are currently unable to access their scratch directories. Volumes 0, 3, 4, 5, 7, 8, and 9 are unaffected.
You can identify your scratch volume by running the command “ll” in your home directory and looking for the scratch symbolic link’s destination. The volume is a digit 0-9 immediately preceding a slash and then your username at the end of the path.
e.g. “scratch -> /gpfs/scratch1/8/gburdell3” means that George is in scratch volume 8.

We are currently working to repair access to scratch and will update you when that is complete. We apologize for the continued disruption.

 

[Update 4/18/20 5:15 PM]

We have restored access to the GPFS mounted scratch filesystem in Rich, and compute nodes are again online and accepting jobs.
During a routine disk swap this morning, one of the dual controllers needed to be restarted, which caused an unexpected disruption. The system was automatically offlined to preserve data integrity. We have recovered and verified the filesystem, and nodes are back online. Users should check any jobs that were running earlier today, especially those that were accessing scratch, and resubmit if the job failed.
A few nodes will need additional fixes and remain offline. These will be released individually as they are repaired.
Please note that systems in Coda (Hive and testflight-coda) were unaffected. CUI/ITAR clusters in Rich were also unaffected.
Again, we apologize for the disruption. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

 

[Original Post]

The GPFS mounted scratch system (~/scratch) in Rich is currently down again. This means that you cannot currently access your scratch directory, and jobs writing to scratch will fail.
Due to the loss of the scratch mount, most PACE nodes are now marked “down or offline” to prevent new jobs from starting and failing.
We are working to restore the mount and will update you when a repair is in place. We apologize for the disruption.

PACE systems in Coda (Hive and testflight-coda) are unaffected.

[Resolved] Scratch inaccessible on datamover node

[Update]

This issue has been resolved. We still encourage users to take advantage of Globus for an improved data transfer experience.

[Original Post]

While the scratch filesystem is once again available on the login & compute nodes, it is still inaccessible on the datamover node (iw-dm-4), which many of you use to access your files via scp or sftp protocols. Your data directories are currently available there. We always encourage you to use Globus instead of scp or sftp, and that is the best workaround at this time to move files between scratch and non-PACE locations. For instructions on using Globus, please visit http://docs.pace.gatech.edu/storage/globus/. The datamover node may eventually be decommissioned, so now is a good time to begin using Globus if you have not already done so. Please contact us at pace-support@oit.gatech.edu if you have any questions. We apologize for the ongoing disruption.

[RESOLVED] Rich data center storage problems (/user/local)– Paused Jobs

Dear PACE Users,

At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue),  please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[Original Message]

In addition to this morning’s ongoing project/data and scratch storage problems, our fileserver that serves the shared “/usr/local” on all PACE machines located in Rich Data Center started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

At this point, we have paused all schedulers for Rich-based resources.   With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved.   Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

This storage problem and scheduler pause does not impact Coda data center’s Hive and TestFlight-Coda cluster.

We are working to resolve these problems ASAP and will keep you updated on this post.

[RESOLVED] Rich Data/Project and Scratch Storage Slow Performance

[RESOLVED]:
At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[UPDATE]:
The issues from this morning’s storage problems are still ongoing. At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

[Original Post]:
We have identified slow performance in the Rich data/project and scratch storage volumes. Jobs utilizing these volumes may experience problems, so please verify results accordingly. We are actively working to resolve the issue.

PACE License Manager and Server Issues

Overnight we experienced issues with several of our servers, including our License manager, GTLib server, and the Testflight and Novazohar queues. We are actively addressing the problem, having restored functionality to the License manager and Novazohar. We are still working on Testflight, and will provide updates as they are available. As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu accordingly.

Hive Cluster — Scheduler modifications/Policy Update

Dear Hive Users,

Hive cluster has been in production for over half a year, and we are pleased with the continued growth of user community on Hive cluster along with the utilization of the cluster that has remained highly utilized.  As the cluster has begun to near 100% utilization more frequently, we have received additional feedback from users that compels us to make additional changes that are to ensure continued productivity for all users of Hive.  Recently, Hive PIs have approved the following changes listed below that will be deployed on April 10:

  1. Hive-gpu-short: We are creating a new gpu queue with a maximum walltime of 12 hours.  This queue will consist of 2 nodes that will be migrated from hive-gpu queue.  This will address the longer job wait times that select users experienced on hive-gpu queue, and this queue will promote users who have short/interactive/machine learning jobs to further develop and grow on this cluster.
  2. Adjust dynamic priority: We will adjust the dynamic priority to reflect the PI groups in addition to individual users.  This will provide an equal and fair opportunity for each of the research teams to access this cluster.
  3. Hive-interact: We will reduce hive-interact queue to 16 nodes from the current 32 nodes due to this queue’s utilization being low.

Who is impacted:  All Hive users will be impacted by the adjustment to the dynamic priority.

User Action:  For users who want to use the new hive-gpu-short queue, users will need to update their PBS scripts with the new queue and the walltime may not exceed 12 hours.

Our documentation will be updated on April 10 to reflect these changes to queues that you will be able to access from the http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Best,
The Past Team

Emergency Firewall Maintenance

Dear Researchers,

The GT network team will undertake an emergency code upgrade on the departmental Palo Alto firewalls beginning at 8pm tonight.  Because this is a high availability pair of devices, this upgrade should not be a major disruption to any traffic to or from the PACE systems.  The specific upgrade has already been successfully accomplished on other firewall devices of the same hardware and software versions and it was observed to not cause any disruptions.

With that said, there is a possibility that connections to the PACE login servers may see a temporary interruption between 8pm and 11pm TONIGHT as the firewalls are upgraded. This should not impact any running jobs except if there is a request for a license on a license server elsewhere on campus (e.g., abaqus) that happens to coincide with the exact moment of the firewall changeover.  Additionally, there is possibility that users may experience interruptions during their interactive sessions (e.g., edit session, screen, VNC Job, Jupyter notebook).  The batch jobs that are already scheduled and/or running on the clusters should otherwise progress normally.

Please check the status and completion of your jobs that have run this evening for any unexpected errors and re-submit should you believe an interruption was the cause.  We apologize in advance for any inconvenience this required emergency code upgrade may cause.

You may follow the status of this maintenance at GT’s status page

As always, if you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu .

Best,

The PACE Team