Posts

Release of Updated PACE User Documentation

PACE is pleased to announce the release of our updated PACE User Documentation . While we updated much of our existing guides, we have added additional guides with detailed instructions for various tasks along with examples such as PBS scripts for various applications you may want to submit batch/interactive jobs with.   The actual documentation is built using GitHub that helps us better maintain the documentation, and  this modular design will mitigate any interruption in the service should we decide to upgrade/change our webhosting technology for our overall site.

Over the Fall semester, we will begin phasing out the older documentation from our many pages on PACE website and redirecting them to our new documentation.   As some of you recall from attending our prior PACE Consulting Sessions and Clusters Orientation classes, your feedback about our documentation (which was Beta at the time) was invaluable, and you will find much of your feedback included  in this release.  We hope you find this documentation helpful, and if you have any questions or comments, please don’t hesitate to let us know.

Again, to access the new documentation, you may access it from our home page or the following links below:

https://pace.gatech.edu/pace-user-documentation

or

https://docs.pace.gatech.edu

 

PACE Ready for Research

Our August 2019 maintenance ( http://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

As usual, there are a small number straggling nodes we will address over the coming days.

  • (Complete) Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • (Complete) Additional space will be configured for license server.
  • (Complete) OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • (Complete) OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • (Complete) PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • (Complete) The submit filter for jobs on the RHEL 7 clusters will be modified to allow proper formatting of commands. This filter is not needed on RHEL 6 clusters.
  • (Complete) Upgrade DNS appliances; no downtime is expected due to redundant configuration.

[Resolved] Campus-wide network outage impacting PACE

A campus-wide DNS server failure occurred on the morning of Monday, August 5. OIT was able to resolve the issue at 10:06 AM, and all PACE services should now be working normally. The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.
We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu.
For details on the DNS failure, please visit the OIT status update.

Thank you for your attention to this, and we apologize for the inconvenience.

Network outages across the GT campus

On the morning of August 1, 2019, a distribution router in the Rich data center failed around 9:22 AM, producing network outages across the GT campus. This outage included the single sign-on server, which prevented login authentication to numerous systems across campus, including PACE. OIT has identified the issue, and connectivity was restored around 9:50 AM, but issues remain.

Logins to PACE should now be possible, though intermittent issues may remain. Running and queued jobs should be unaffected. Please contact us at pace-support@oit.gatech.edu if you have any questions or persisting issues. The login failures also affected our access to view user help requests, and we apologize for any delay in responding to requests this morning.

For details on the OIT issue, please visit the link below.

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d42eda56788b204bf9f11d4

We apologize for the inconvenience.

[Complete] PACE Quarterly Maintenance – August 8-10

[August 9, 2019 Update] Our August 2019 maintenance ( http://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

[August 2, 2019 Update]

NO USER ACTION NEEDED ITEMS:

  • Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • Additional space will be configured for license server.
  • OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • The submit filter for jobs on the RHEL 6 clusters will be modified to allow proper formatting of commands. This filter is already in place on RHEL 7 clusters.
  • Upgrade DNS appliances; no downtime is expected due to redundant configuration.

Please send questions and/or comments to pace-support@oit.gatech.edu

 

[July 23, 2019] We are preparing for a maintenance day on August 8 – 10, 2019. This maintenance day is planned for three days and will start on Thursday, August 8, and go through Saturday, August 10.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  

In general, we will be working on upgrading all of the RHEL7 production nodes to latest 7.6 kernel, update connection to and from PACE routers, and add additional disk capacity to our license server.  While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

[Resolved] Campus wide Intermittent network outage impacting PACE

Today at around 1:55pm,  OIT reported a campus wide intermittent network slowness as one of the DNS servers went down causing trouble with authentication, GRS and more.  OIT has resolved this issue as of 4:12pm, and we have recovered our storage that export home directories as a result of this related issue.  The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.

We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu

For details on the OIT issue reported, please visit their link 

Thank you for your attention to this, and apologies for the inconvenience.

[Resolved] Dedicated Scheduler – Job Submissions Paused

[Update – July 11, 2019 – 2:45pm] Dedicated scheduler is back online and operational after correcting the node associations with queues that has resulted from a faulty configuration.  We have taken measures to correct our automated procedure to prevent such an incident in the future.  We have removed the pause on job submission.  You may now resume submitting your jobs.  Please check on your jobs that were submitted since 3:30pm yesterday (7/10/2019) as many of those jobs have terminated.  

Again, apologies for the inconvenience this has caused.

[Original Post – July 11, 2019 – 10:38pm] Today, at approximately 10:10am we paused job submissions to queues that are managed by the dedicated scheduler. Researcher teams will not be able to submit new jobs to the following queues: kennedy-lab,granulous,atlas-dufek,chow,athena-debug,cochlea,atlas-6,njord-6,atlantis,jabberwocky6,  megatron,acceptance,hadoop,aces,drive,complexity,corso,blue,monkeys-k33,athena-6,core,ase1-debug-6,microbio-1,radius,medprint-6,monkeys_gpu,pampa-6,monkeys,keeneland,athena-intel,atlas-intel,apurimac-bg-6,staml,ofed-test,semap-6,martini,skade,tmlhpc-6,atlas-debug,wohler,rozell,mps,prv-5-6,aryabhata-6,hadean-gpu,epictetus,neutrons-6,davenporter,atlas,athena-8core,uranus-6,hadean,ase1-6,atlas-simon,enterprise,pampa-debug-6,skadi

This action is taken to resolve the issue that we experienced since evening on July 10, in which jobs erroneously were terminated after not reaching their appropriate nodes. We are working to resolve this issue as quickly as possible.   Also, by pausing the job submission we will prevent any new jobs from being terminated. While we work to resolve this issue, we ask that you refrain from trying to submit your jobs to the listed queues above. We will follow up with an update as we work through this issue.  Thank you for your attention to this, and we are sorry for this inconvenience.

Scheduled UPS Fan Replacement in Rich Data Center

[June 12, 2019 – 4:45pm] OIT operations team notified PACE of a planned maintenance on Saturday, June 15, from 7:00AM – 2:00PM to replace a fan in one  of the UPS units in Rich Data Center.  This work may require that particular UPS to run in maintenance bypass and therefore that room would temporarily be without power backup.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

For further details about this planned OIT maintenance, please visit the following  OIT link .

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

[Resolved] Temporary Network Interruption

On Friday, May 24, at 4pm, we experienced a partial failure of our primary subnet manger that may have impacted running and starting MPI jobs that use IP over IB. Our backup IB subnet manager (SM) did not take over due the primary SM experiencing a partial failure. On Saturday, May 25, at 12:15pm, we switched to a new Subnet Manager, and restored the network. This service outage lasted from Friday, May 24, 4:00pm – Saturday, May 25, 12:15pm. Since this brief network interruption may have impacted the running jobs, please check your jobs to identify if there are any crashed jobs and report any problems to pace-support@oit.gatech.edu

PACE Ready for Research

Our May 2019 maintenance (https://blog.pace.gatech.edu/?p=6473) is complete one day ahead of schedule! We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.  We are postponing the replacement of CMOS batteries on the servers due to scheduling conflict with the vendor.  As usual, there are a small number straggling nodes we will address over the coming days.

Compute

  • (Complete) Upgrade testflightcluster to RHEL 7.6
  • (Complete) Upgrade gemini-gpuand gemini-cpu clusters to RHEL7, which will require user action (only for gemini-cpu/gpu clusters‘ users)
  • (Complete) Switch nodes between chemxand gemini-cpu queues
  • (Postponed) Replace CMOS batteries on multiple servers

Network

  • (Complete) Replace a faulty InfiniBand switch, which affects a single rack with no impact to the complete fabric
  • (Complete) Migrate Rich to campus connections to 10Gbps

Storage

  • (Complete) Reboot ICE storage servers to correct issues with backup application
  • (Complete)  Perform detailed performance analysis of the GPFS environment, in order to fine tune parameters to improve performance

Other

  • (Postponed) Updates to the submit filters in the schedulers
  • (Complete) Update salt master and minions

 

If you have any questions or concerns, please contact pace-support@oit.gatech.edu