Posts

[Resolved] GPFS outage on Red Hat 7 queues

An issue occurred around 3:30 AM on several queues running on the Red Hat 7 operating system, where a number of nodes failed to mount GPFS, our project (data) and scratch storage system. This caused the nodes to be offlined and unavailable for jobs. We repaired the affected nodes at approximately 9:30 AM today, and all queues should be functioning normally. Any jobs that were held should have begun. Please check your overnight jobs for errors.

The following queues were impacted:
atlas-he
ece-gpu
flamel-gpu
gaanam-gpu
gemini-cpu
gemini-gpu
megatron
ml_gpu
sake
skylake-test
starscream
swarm
swarm-gpu

Should you notice the problem recur, or if you have any other concerns, please contact us at pace-support@oit.gatech.edu, and we will be happy to help you. We apologize for the inconvenience this morning.

New PACE Team Members and New Team Member Roles

Dear Researchers,

PACE is pleased to announce our new additions to the PACE team and recognitions of our team members who started new roles at PACE.

In Spring, our Software and Collaboration Support team grew with an addition of Dr. Kevin Manalo.  Kevin is a proud Georgia Tech graduate who cannot hold back his excitement about joining PACE, whose clusters he had heavily relied on during his PhD research!  Kevin comes to PACE as an HPC veteran with experience in HPC support and training from Johns Hopkins University and state supercomputer centers at Ohio and Alabama.

Over the summer, our Outreach and Faculty Interaction team has grown by three new members, Drs. Aaron Jezghani, Michael Weiner, and Chris Blanton.  As you may have already noticed, they have all hit the ground running as they have been very active in responding to support inquiries and hosting multiple PACE classes and workshops.  To tell you a little bit about our Outreach team members:

Dr. Aaron Jezghani recently defended his PhD in Physics at the University of Kentucky.   His research focused on nuclear physics and  involved work at both Los Alamos and Oak Ridge National Labs. Throughout Aaron’s multi-faceted dissertation work, he focused on development of detector readout electronics as well as techniques in acquiring, processing and analyzing data from the detectors, which is not an easy feat.

Dr. Michael Weiner received his undergraduate degree in physics from Yale University and his doctorate, also in physics, from Cornell University. He completed his doctoral research in computational biophysics in the laboratory of Gerald Feigenson, where he focused on Molecular Dynamics simulations of the biophysical chemistry of lipid bilayers as models of cell membranes.

Dr. Chris Blanton earned his Ph.D. from Syracuse University in Computational and Theoretical Chemistry. During his studies, he became deeply interested in computational research and HPC. After graduation, he joined the Pennsylvania State University’s Institute for CyberScience. He has worked with some of the most exciting and innovative computational researchers, and he looks forward to sharing and applying his experiences with Georgia Tech research community.

Also, over the summer, our Cyberinfrastructure team has added two members, and it’s our pleasure to reintroduce to you Trever Nightingale and Ken Suda.

Trever has returned to PACE to his position of Sr. Systems Support Engineer.  Trever has a bachelor’s degree from Amherst College and a master’s degree from the University of Minnesota, and experience in high performance and research computing centers including the Naval Research Lab, NERSC and the Centers for Disease Control (CDC) among his 20 years of UNIX experience.


Ken has been in IT professionally for almost 35 years and have filled most roles found in an IT organization.  The past couple of years, Ken has been a consultant and run a game development company.  As a consultant, Ken has been a generalist, filling whatever role the team or organization needed.

Now, with great pleasure, PACE is pleased to announce the new roles for our team members, Dan (Ann) Zhou, Andre McNeill, and Ruben Lara.

Dan (Ann) Zhou’s new role is Research Technologist Storage Architect for PACE.  Ann has been a PACE team member since August 2014 and has been an integral part of the PACE cyberinfrastructure team contributing to the operation of the many PACE storage systems, backup and the management of the storage systems among her responsibilities.  Ann received her bachelor’s degree in Electrical Engineering in China and her master’s degree in Electrical and Computer Engineering at Tennessee Technological University. She enjoys cooking, running, eating, and traveling.

Andre McNeill’s new role is Research Technologist Cloud Architect for PACE. Andre has been a member of PACE for nearly 10 years and continues to be a vital resource for both our PACE staff as well as our PACE customers to deliver a robust and reliable research computing environment including computing, networking and software systems.  A graduate of Purdue University, Andre has many interests within PACE and many outside the work place including being a DJ.

Ruben Lara’s new role is Systems Support Engineer Manager for PACE. Ruben has been a part of the PACE cyberinfrastructure team since February 2017. Ruben has many excellent managerial and organizational skills and is currently enrolled in the current MOR Leadership training.  Ruben enjoys baseball, ultimate frisbee, rock climbing and mountain biking. You can find him by the window at the Southwest end of the 10th floor of the Coda building.

Please join us in welcoming our new team members and congratulating our recently promoted team members!

Best,
The PACE Team

Release of Updated PACE User Documentation

PACE is pleased to announce the release of our updated PACE User Documentation . While we updated much of our existing guides, we have added additional guides with detailed instructions for various tasks along with examples such as PBS scripts for various applications you may want to submit batch/interactive jobs with.   The actual documentation is built using GitHub that helps us better maintain the documentation, and  this modular design will mitigate any interruption in the service should we decide to upgrade/change our webhosting technology for our overall site.

Over the Fall semester, we will begin phasing out the older documentation from our many pages on PACE website and redirecting them to our new documentation.   As some of you recall from attending our prior PACE Consulting Sessions and Clusters Orientation classes, your feedback about our documentation (which was Beta at the time) was invaluable, and you will find much of your feedback included  in this release.  We hope you find this documentation helpful, and if you have any questions or comments, please don’t hesitate to let us know.

Again, to access the new documentation, you may access it from our home page or the following links below:

https://pace.gatech.edu/pace-user-documentation

or

https://docs.pace.gatech.edu

 

PACE Ready for Research

Our August 2019 maintenance ( http://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

As usual, there are a small number straggling nodes we will address over the coming days.

  • (Complete) Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • (Complete) Additional space will be configured for license server.
  • (Complete) OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • (Complete) OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • (Complete) PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • (Complete) The submit filter for jobs on the RHEL 7 clusters will be modified to allow proper formatting of commands. This filter is not needed on RHEL 6 clusters.
  • (Complete) Upgrade DNS appliances; no downtime is expected due to redundant configuration.

[Resolved] Campus-wide network outage impacting PACE

A campus-wide DNS server failure occurred on the morning of Monday, August 5. OIT was able to resolve the issue at 10:06 AM, and all PACE services should now be working normally. The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.
We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu.
For details on the DNS failure, please visit the OIT status update.

Thank you for your attention to this, and we apologize for the inconvenience.

Network outages across the GT campus

On the morning of August 1, 2019, a distribution router in the Rich data center failed around 9:22 AM, producing network outages across the GT campus. This outage included the single sign-on server, which prevented login authentication to numerous systems across campus, including PACE. OIT has identified the issue, and connectivity was restored around 9:50 AM, but issues remain.

Logins to PACE should now be possible, though intermittent issues may remain. Running and queued jobs should be unaffected. Please contact us at pace-support@oit.gatech.edu if you have any questions or persisting issues. The login failures also affected our access to view user help requests, and we apologize for any delay in responding to requests this morning.

For details on the OIT issue, please visit the link below.

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d42eda56788b204bf9f11d4

We apologize for the inconvenience.

[Complete] PACE Quarterly Maintenance – August 8-10

[August 9, 2019 Update] Our August 2019 maintenance ( http://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

[August 2, 2019 Update]

NO USER ACTION NEEDED ITEMS:

  • Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • Additional space will be configured for license server.
  • OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • The submit filter for jobs on the RHEL 6 clusters will be modified to allow proper formatting of commands. This filter is already in place on RHEL 7 clusters.
  • Upgrade DNS appliances; no downtime is expected due to redundant configuration.

Please send questions and/or comments to pace-support@oit.gatech.edu

 

[July 23, 2019] We are preparing for a maintenance day on August 8 – 10, 2019. This maintenance day is planned for three days and will start on Thursday, August 8, and go through Saturday, August 10.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  

In general, we will be working on upgrading all of the RHEL7 production nodes to latest 7.6 kernel, update connection to and from PACE routers, and add additional disk capacity to our license server.  While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

[Resolved] Campus wide Intermittent network outage impacting PACE

Today at around 1:55pm,  OIT reported a campus wide intermittent network slowness as one of the DNS servers went down causing trouble with authentication, GRS and more.  OIT has resolved this issue as of 4:12pm, and we have recovered our storage that export home directories as a result of this related issue.  The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.

We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu

For details on the OIT issue reported, please visit their link 

Thank you for your attention to this, and apologies for the inconvenience.

[Resolved] Dedicated Scheduler – Job Submissions Paused

[Update – July 11, 2019 – 2:45pm] Dedicated scheduler is back online and operational after correcting the node associations with queues that has resulted from a faulty configuration.  We have taken measures to correct our automated procedure to prevent such an incident in the future.  We have removed the pause on job submission.  You may now resume submitting your jobs.  Please check on your jobs that were submitted since 3:30pm yesterday (7/10/2019) as many of those jobs have terminated.  

Again, apologies for the inconvenience this has caused.

[Original Post – July 11, 2019 – 10:38pm] Today, at approximately 10:10am we paused job submissions to queues that are managed by the dedicated scheduler. Researcher teams will not be able to submit new jobs to the following queues: kennedy-lab,granulous,atlas-dufek,chow,athena-debug,cochlea,atlas-6,njord-6,atlantis,jabberwocky6,  megatron,acceptance,hadoop,aces,drive,complexity,corso,blue,monkeys-k33,athena-6,core,ase1-debug-6,microbio-1,radius,medprint-6,monkeys_gpu,pampa-6,monkeys,keeneland,athena-intel,atlas-intel,apurimac-bg-6,staml,ofed-test,semap-6,martini,skade,tmlhpc-6,atlas-debug,wohler,rozell,mps,prv-5-6,aryabhata-6,hadean-gpu,epictetus,neutrons-6,davenporter,atlas,athena-8core,uranus-6,hadean,ase1-6,atlas-simon,enterprise,pampa-debug-6,skadi

This action is taken to resolve the issue that we experienced since evening on July 10, in which jobs erroneously were terminated after not reaching their appropriate nodes. We are working to resolve this issue as quickly as possible.   Also, by pausing the job submission we will prevent any new jobs from being terminated. While we work to resolve this issue, we ask that you refrain from trying to submit your jobs to the listed queues above. We will follow up with an update as we work through this issue.  Thank you for your attention to this, and we are sorry for this inconvenience.

Scheduled UPS Fan Replacement in Rich Data Center

[June 12, 2019 – 4:45pm] OIT operations team notified PACE of a planned maintenance on Saturday, June 15, from 7:00AM – 2:00PM to replace a fan in one  of the UPS units in Rich Data Center.  This work may require that particular UPS to run in maintenance bypass and therefore that room would temporarily be without power backup.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

For further details about this planned OIT maintenance, please visit the following  OIT link .

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.