REMINDER & UPDATE: PACE quarterly maintenance – July ’15

First, I’d like to remind folks of our quarterly maintenance activities NEXT WEEK starting at 6:00am Tuesday morning.

Second, we have a little more information regarding some of our high-level tasks. The storage we plan to use for home directories and /usr/local isn’t due to be delivered until Friday of this week. As such, we’ll not have time to get it installed and tested in time. We’ll defer this until a future maintenance period.

Our new data mover servers have been delivered, and we are beginning some tests. We’ll consider these a bonus objective at this point, pending the outcome of testing.

PACE quarterly maintenance – July ’15

Greetings!

The PACE team is again preparing for our quarterly maintenance that will occur Tuesday, July 21 and Wednesday July 22.  We’re approximately a month away, but I wanted to remind folks of our upcoming activities and give a preview of what we are planning.

  • Updated GPFS client – We are currently testing version 3.5.0-25 for deployment, as recommended by DDN.  Preliminary testing has shown it to have the fix for the problems encountered during our April maintenance.
  • “newrepo” becomes the default software repository – We will make the new PACE software repository (currently referred to as ‘newrepo’) the default. This means you will no longer need to explicitly switch to it using ‘module load newrepo’ and all of the modules will point to this new repository by default. The current repository will continue to be available, and can be accessed via loading a module we will continue to be available as ‘oldrepo’ as long as needed, but all new software installations, upgrades and fixes will go into newrepo.
  • Full reset of Infiniband fabric – We will reboot all of our Infiniband switches and subnet managers to ensure we have cleared out all of the gremlins from the Infiniband troubles earlier this month.
  • New storage devices for home directories and /usr/local – We’ve ordered some new storage servers to upgrade the aging servers that are currently providing home directories and /usr/local.  These new servers come in a high-availability configuration so as to better guard against equipment failures.  As a bonus item, we may begin the migration of our virtual machine backing storage to a separate new storage device.  Both of these items are contingent on the new equipment arriving in time to be installed and tested before the maintenance period.
  • New “data mover” servers – Also pending arrival and testing of new equipment, we will replace the “data mover” systems known as iw-dm3 and iw-dm4.  These servers are intended to be used for large data movement activities, and will come with 40-gigabit ethernet and Infiniband connectivity.

Infiniband problems in PACE

PACE is experiencing problems after a Infiniband (IB) network failure, which affects MPI jobs as well as IB connected storage including GPFS (project space) and PanFS (scratch space).  It is possible that this problem caused crashed or hanging jobs.

The Infiniband network is restored at this point and we are now working to restore the storage mounts. We also paused job submissions to prevent new jobs from starting. We will allow jobs once the problems are completely resolved.

Thank you for your patience.
PACE team

GPFS storage troubles

Dear PACE users,

As part of last week’s maintenance activities, we upgraded the GPFS client software on head nodes and compute nodes to a level recommended by the vendor.

Thanks to some troubling reports from PACE users, we have determined that the new client software has a subtle bug that will cause writes to fail under certain circumstances. We have identified two replicable cases so far, “CMAKE” failing to compile codes, and “LAMMPS” silently exiting after dumping a single line of text.

We have been in close contact with the vendor for an urgent resolution, and have escalated the incident to the highest executive levels. At this point, we have a couple paths to resolution, either moving forward to a newer release or reverting to the version we were running before last week. We are moving to quickly evaluate the merits of both approaches. Implementing either will likely involve a rolling reboot on compute and head nodes. We understand the inconvenience a downtime will cause, and will engage the vendor to find ways to address this problem with minimal interruption.

One way to find out if you are using GPFS is by running the “pace-quota” command, and checking if any of the paths begin with “gpfs”, “pme2” or “pet1”. If you are running on GPFS and having unexplainable problems with your codes, please contact pace-support@oit.gatech.edu and try to use other storage locations to which you have access (e.g. ~/scratch).

A more detailed description of this bug and the code we used to replicate it can be found here.

We will continue to keep you updated on the progress.

PACE clusters ready for research

Greetings,

Our quarterly maintenance is now complete.  We have no known outstanding issues affecting general operations, but do have some notes for specific clusters which have been sent separately.

Just a reminder that we have removed the modules for old MPI versions (and applications compiled with them), which are known to be incompatible with the new scheduler servers. Please make sure to check your module lists for compatibility before sending new jobs. Accordingly, we have new default versions of the MPI modules. If you do not explicitly specify a version, mvapich2/1.9 or openmpi/1.6.2 will be loaded by default. Our new repository is almost ready for testing, but it requires a post processing step and migration to shared storage, which may take another couple of days.  We will send another communication when this is ready.

Given the delays, we have deferred our stretch goals until a future maintenance window.  These filesystems successfully moved to their new locations across campus and are traversing the new dual 40-gigabit path between data centers.

I’ll take another opportunity to apologize for the extended downtime this week.  We will be taking a critical look at the events that lead to these delays and learn from the events of this week as much as we can.

–Neil Bright

PACE quarterly maintenance – April ’15

Greetings everybody.  It’s again time for the quarterly PACE maintenance.  As usual, we will have all PACE clusters down Tuesday and Wednesday of next week, April 21 and 22.  We’ll get started at 6:00am on Tuesday, and have things back to you as soon as possible.There’s some significant changes this time around, so please continue on.

Moab/Torque scheduler:
Last maintenance period, we deployed a new scheduler for the Atlas and Joe clusters.  This time around, we’re continuing that rollout to the rest of our clusters.  Some of the highlights are:

  • increased responsiveness from commands like qsub & showq
  • need to resubmit jobs that haven’t run yet
  • removal of older, incompatible versions of mvapich and openmpi
  • required changes to our /usr/local software repository

Mehmet Belgin has posted a detailed note about the scheduler upgrade on our blog here.  He has also posted a similar note about the related updates to our software repository here.

Additionally, the x2200-6.3 queues on the Atlas cluster will be renamed to atlas-6-sunib and atlas-6-sunge.

Networking:
We’ve deployed upgraded network equipment to upgrade the core of the PACE network to 40-gigabit ethernet and will transition to this new core during the maintenance period.  This new network brings additional capability to utilize data center space outside of the Rich building, and provides a path for future 100-gigabit external connectivity and ScienceDMZ services.  Stay tuned for further developments. 😉  Additionally, the campus network team will be upgrading the firmware of a number of our existing switches with some security related fixes.

Storage:
The network upgrades above will allow us to relocate some of our project directory servers to OIT data center space on Marietta Street, as we’re pressed for generator-protected space in Rich.  We will also be doing some security patching, highly recommended updates and performance optimizations on the DDN/GPFS storage.  As a stretch goal, we will also migrate some filesystems to GPFS.  If we are pressed for time, they will move with their old servers. Details regarding which filesystems are available on our blog here.

Operating System patching:
Last, but not least, we have a couple of OS patches.  We’ll complete the rollout of a glibc patch for the highly publicized “Ghost” vulnerability, as well as deploy a bug fix for autofs that addresses a bug which would sometimes cause a failure to mount /nv filesystems.

Important Changes to PACE storage

During our quarterly maintenance period next week, we will relocate some of our project directory servers to OIT data center space on Marietta Street, as we’re pressed for generator-protected space in Rich.  This is a major undertaking, with over 20 servers moving.  Our intent is that no change is needed on your part, but wanted to ensure transparency in our activities.  The list below contains all of the affected filesystems.  The list of filesystems to which you have access can be obtained with the ‘pace-quota’ command.

  • home directories for all clusters except Gryphon and Tardis
  • /nv/pb4, /nv/archive-bio1 (BioCluster)
  • /nv/hchpro1, /nv/pchpro1 (Chemprot)
  • /nv/pas1 (Enterprise)
  • /nv/pase1 (Ase1)
  • /nv/pb2 (Optimus)
  • /nv/pbiobot1 (BioBot)
  • /nv/pc4, /nv/pc5, /nv/pc6 (Cygnus)
  • /nv/pccl2 (Gryphon, legacy)
  • /nv/pcoc1 (Monkeys)
  • /nv/pe1, /nv/pe2, /nv/pe3, /nv/pe4, /nv/pe5, /nv/pe6, /nv/pe7, /nv/pe8, /nv/pe9, /nv/pe10, /nv/pe11, /nv/pe12, /nv/pe13, /nv/pe14 (Atlas)
  • /nv/hp1, /nv/pf1, /nv/pf2 (FoRCE)
  • /nv/pface1 (Faceoff)
  • /nv/pg1 (Granulous)
  • /nv/pggate1 (GGate)
  • /nv/planns (Lanns)
  • /nv/pmart1 (Martini)
  • /nv/pmeg1 (Megatron)
  • /nv/pmicro1 (Microcluster)
  • /nv/pska1 (Skadi)
  • /nv/ptml1 (Tmlhpc)
  • /nv/py2 (Uranus)
  • /nv/pz2 (Athena)
  • /nv/pzo1, /nv/pzo2 (backups for Zohar and NeoZhoar)

Additionally, the following filesystems will be migrated to GPFS:

  • /nv/pcee1 (cee.pace)
  • /nv/pme1, /nv/pme2, /nv/pme3, /nv/pme4, /nv/pme5, /nv/pme6, /nv/pme7, /nv/pme8 (Prometheus)

As a stretch goal, we will also migrate the following filesystems to GPFS.  If we are pressed for time, they will move with their old servers as listed above.

  • /nv/hp1, /nv/pf1, /nv/pf2 (FoRCE)
  • /nv/pas1 (Enterprise)
  • /nv/pbiobot1 (BioBot)
  • /nv/pccl2 (Gryphon, legacy)
  • /nv/pggate1 (GGate)
  • /nv/planns (Lanns)
  • /nv/ptml1 (Tmlhpc)

PACE clusters ready for research

Greetings!
Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.
No general issues to report, although we do have some notes for the Atlas and Joe users which have been sent separately.  We’ll apply some lessons learned here to the April maintenance.
As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!