Infiniband problems in PACE

PACE is experiencing problems after a Infiniband (IB) network failure, which affects MPI jobs as well as IB connected storage including GPFS (project space) and PanFS (scratch space).  It is possible that this problem caused crashed or hanging jobs.

The Infiniband network is restored at this point and we are now working to restore the storage mounts. We also paused job submissions to prevent new jobs from starting. We will allow jobs once the problems are completely resolved.

Thank you for your patience.
PACE team

GPFS storage troubles

Dear PACE users,

As part of last week’s maintenance activities, we upgraded the GPFS client software on head nodes and compute nodes to a level recommended by the vendor.

Thanks to some troubling reports from PACE users, we have determined that the new client software has a subtle bug that will cause writes to fail under certain circumstances. We have identified two replicable cases so far, “CMAKE” failing to compile codes, and “LAMMPS” silently exiting after dumping a single line of text.

We have been in close contact with the vendor for an urgent resolution, and have escalated the incident to the highest executive levels. At this point, we have a couple paths to resolution, either moving forward to a newer release or reverting to the version we were running before last week. We are moving to quickly evaluate the merits of both approaches. Implementing either will likely involve a rolling reboot on compute and head nodes. We understand the inconvenience a downtime will cause, and will engage the vendor to find ways to address this problem with minimal interruption.

One way to find out if you are using GPFS is by running the “pace-quota” command, and checking if any of the paths begin with “gpfs”, “pme2” or “pet1”. If you are running on GPFS and having unexplainable problems with your codes, please contact pace-support@oit.gatech.edu and try to use other storage locations to which you have access (e.g. ~/scratch).

A more detailed description of this bug and the code we used to replicate it can be found here.

We will continue to keep you updated on the progress.

PACE quarterly maintenance – April ’15

Greetings everybody.  It’s again time for the quarterly PACE maintenance.  As usual, we will have all PACE clusters down Tuesday and Wednesday of next week, April 21 and 22.  We’ll get started at 6:00am on Tuesday, and have things back to you as soon as possible.There’s some significant changes this time around, so please continue on.

Moab/Torque scheduler:
Last maintenance period, we deployed a new scheduler for the Atlas and Joe clusters.  This time around, we’re continuing that rollout to the rest of our clusters.  Some of the highlights are:

  • increased responsiveness from commands like qsub & showq
  • need to resubmit jobs that haven’t run yet
  • removal of older, incompatible versions of mvapich and openmpi
  • required changes to our /usr/local software repository

Mehmet Belgin has posted a detailed note about the scheduler upgrade on our blog here.  He has also posted a similar note about the related updates to our software repository here.

Additionally, the x2200-6.3 queues on the Atlas cluster will be renamed to atlas-6-sunib and atlas-6-sunge.

Networking:
We’ve deployed upgraded network equipment to upgrade the core of the PACE network to 40-gigabit ethernet and will transition to this new core during the maintenance period.  This new network brings additional capability to utilize data center space outside of the Rich building, and provides a path for future 100-gigabit external connectivity and ScienceDMZ services.  Stay tuned for further developments. 😉  Additionally, the campus network team will be upgrading the firmware of a number of our existing switches with some security related fixes.

Storage:
The network upgrades above will allow us to relocate some of our project directory servers to OIT data center space on Marietta Street, as we’re pressed for generator-protected space in Rich.  We will also be doing some security patching, highly recommended updates and performance optimizations on the DDN/GPFS storage.  As a stretch goal, we will also migrate some filesystems to GPFS.  If we are pressed for time, they will move with their old servers. Details regarding which filesystems are available on our blog here.

Operating System patching:
Last, but not least, we have a couple of OS patches.  We’ll complete the rollout of a glibc patch for the highly publicized “Ghost” vulnerability, as well as deploy a bug fix for autofs that addresses a bug which would sometimes cause a failure to mount /nv filesystems.

Important Changes to PACE storage

During our quarterly maintenance period next week, we will relocate some of our project directory servers to OIT data center space on Marietta Street, as we’re pressed for generator-protected space in Rich.  This is a major undertaking, with over 20 servers moving.  Our intent is that no change is needed on your part, but wanted to ensure transparency in our activities.  The list below contains all of the affected filesystems.  The list of filesystems to which you have access can be obtained with the ‘pace-quota’ command.

  • home directories for all clusters except Gryphon and Tardis
  • /nv/pb4, /nv/archive-bio1 (BioCluster)
  • /nv/hchpro1, /nv/pchpro1 (Chemprot)
  • /nv/pas1 (Enterprise)
  • /nv/pase1 (Ase1)
  • /nv/pb2 (Optimus)
  • /nv/pbiobot1 (BioBot)
  • /nv/pc4, /nv/pc5, /nv/pc6 (Cygnus)
  • /nv/pccl2 (Gryphon, legacy)
  • /nv/pcoc1 (Monkeys)
  • /nv/pe1, /nv/pe2, /nv/pe3, /nv/pe4, /nv/pe5, /nv/pe6, /nv/pe7, /nv/pe8, /nv/pe9, /nv/pe10, /nv/pe11, /nv/pe12, /nv/pe13, /nv/pe14 (Atlas)
  • /nv/hp1, /nv/pf1, /nv/pf2 (FoRCE)
  • /nv/pface1 (Faceoff)
  • /nv/pg1 (Granulous)
  • /nv/pggate1 (GGate)
  • /nv/planns (Lanns)
  • /nv/pmart1 (Martini)
  • /nv/pmeg1 (Megatron)
  • /nv/pmicro1 (Microcluster)
  • /nv/pska1 (Skadi)
  • /nv/ptml1 (Tmlhpc)
  • /nv/py2 (Uranus)
  • /nv/pz2 (Athena)
  • /nv/pzo1, /nv/pzo2 (backups for Zohar and NeoZhoar)

Additionally, the following filesystems will be migrated to GPFS:

  • /nv/pcee1 (cee.pace)
  • /nv/pme1, /nv/pme2, /nv/pme3, /nv/pme4, /nv/pme5, /nv/pme6, /nv/pme7, /nv/pme8 (Prometheus)

As a stretch goal, we will also migrate the following filesystems to GPFS.  If we are pressed for time, they will move with their old servers as listed above.

  • /nv/hp1, /nv/pf1, /nv/pf2 (FoRCE)
  • /nv/pas1 (Enterprise)
  • /nv/pbiobot1 (BioBot)
  • /nv/pccl2 (Gryphon, legacy)
  • /nv/pggate1 (GGate)
  • /nv/planns (Lanns)
  • /nv/ptml1 (Tmlhpc)

Important Notes on Coming PACE Scheduler Upgrades

We have been running an upgraded scheduler version on Joe and Atlas clusters, who had graciously volunteered to test it out since January. This version brings significant performance and stability improvements, and we are looking forward to roll out the upgrades to the rest of the PACE universe during this maintenance period. Please note the following important changes, which will apply to all PACE users.

  • The new schedulers are not compatible with MPI versions older than mvapich/1.9 and openmpi/1.6.2. If you are using one of the older MPI stacks (a warning would be printed when you load their modules), you will need to replace it with one of the recent versions. This motivated the creation of  a new and improved software repository, which will be available after the maintenance day. For more details, please see our related post.

 

  • The current version uses a different type of database, so we will not be able to migrate submitted jobs.  The scheduler will start with an empty queue, and you will need to resubmit your jobs after the maintenance day. This applies to Joe and Atlas jobs as well, as we are merging exclusive queues on a new and more powerful server with the exception of Tardis, Gryphon and Testflight.

 

  • We will start using the “node packing” policy which allocates as many jobs on a node as possible before jumping on the next one. With the current version, users can submit many single-core jobs, each landing on a separate node, making it more difficult for the scheduler to start jobs that require entire nodes.

 

  • This version fixes a bug that prevents use of msub for interactive jobs. The recommendation from the vendor company is to use “qsub” for everything (we confirmed that it’s much faster than msub), but this bug fix gives you the freedom to pick either tool.

 

  • There will no longer be a discrepancy between job IDs generated by msub (Moab.###) and qsub (####). You will always see a single job ID (in plain number format) regardless of your msub/qsub preference.

 

  • Speed — new versions of Moab and Torque are now multithreaded, making it possible for some query commands (e.g. showq) to return instantly regardless of the load on the scheduler. Currently, when a user submits a large job array, these commands usually timeout.

 

  • Introduction of cpusets — when a user is given X cores, he/she will not be able to use more than that. Currently, users can easily violate the requested limits by spawning any number of processes/threads and Torque cannot do much to stop that. The use of cpusets will significantly reduce the job interference and allows us to finally use ‘node packing’ as explained above.

 

  • Several other benefits from bug fixes and improvements are (including but not limited to) less number of zombie processes, lost output files, missing array jobs. We also expect visible improvements in job allocation times and less frequent command timeouts.

 

We hope these improvements will provide you with a more efficient and productive computing environment. Please let us know (pace-support@oit.gatech.edu) if you have any concerns or questions!

Important Changes to PACE Scientific Software Repository

As announced earlier, we will remove a set of old MPI stacks (and applications that use them) from the PACE software repository after the April maintenance day. This is required by the planned upgrade of the schedulers (torque and moab), which use libraries that are incompatible with the old MPI stacks. Some MPI-related Python modules (e.g. mpi4py) are built on one of these old MPI versions (namely mvapich2/1.6) and they will also stop working with the new scheduler.

Old MPI versions are also known to have significant performance and scalability problems, and they are no longer supported by developers, therefore their expulsion was inevitable regardless of the scheduler upgrades. Namely, all versions older than than “mvapich2/1.9” and “openmpi/1.6.2” are known to be incompatible, and will be removed along with applications that are compiled with them. MPI stacks newer than these versions are compatible with the new scheduler version, so they will continue to be available. PACE team is ready to offer assistance with all the changes you may need to replace these old MPI versions with new versions with minimal interruptions to your research.

We saw these problems as an opportunity to start creating a new and improved software repository almost from scratch, which not only fixes the MPI problems, but also provides added benefits such as:

* a cleaner MPI versioning without long confusing subversions such as “1.9rc1” or “2.0ga”: You will see a only a single subversion for each major release, e.g.,

mvapich2: 1.9, 2.0, 2.1, …
openmpi: 1.6, 1.7, 1.8, …

* latest software versions: We showed a best effort to compile the most recent (stable) versions as we could, unless they had compilation problems or proved to be buggy.

* a new python that allows parallelization without requiring InfiniBand (IB) network: Current python uses mvapich2, which requires IB network. The new python, on the other hand, will employ openmpi, which can run on *any* node regardless of their network connection while still taking advantage of IB when available.

We will start offering this new repository as an alternative after the April maintenance day. Switching between old and the new repository will be as easy as loading/unloading a module named “newrepo”. E.g.:

# Make sure there are no loaded modules
$module purge
$module load newrepo

… You are now using the new repo …

# since newrepo is also a module itself, ‘module purge’ will put you back in the old repo
$module purge

… You are back in the old repo …

The current plan is to decommission the old repository after the July maintenance, therefore strongly encourage you to try the new repository (which is still beta) as soon as possible to ensure a smooth transition. If the new repository is working for you, continue to use it and never look back. If you notice problems or missing components, you can continue to use the old repository while we are working on fixing them.

Please keep in mind that the new repo is created almost from scratch, so expect changes in module names, as well as new set of dependencies/conflicts between the modules. PACE team is always ready to provide module suggestions for your applications, or answer any other questions that you may have.

We hope the new repository will make a positive contribution to your research environment with visible improvements in performance, stability and scalability.

Thanks!
PACE Team

Temporary disruption for many home directories

The server housing about half of the PACE affiliated home directories lost of of its root drives (ie: the operating system, not data drives) and that threw the system into a state where it was effectively only processing the intterrupts from that failure. We’ve rectified the situation so that the server can provide access to files again.

Affected:

hbiobot1     biobot
hcee2          cee
hcfm1          isabella
hface1         faceoff
hggate1      ggate
hmeg1        megatron
hp10           biocluster
hp12           aryabhata
hp14           atlantis
hp16           force
hp18           optimus
hp20          ece
hp22          prometheus
hp24          math
hp26          cee
hp28          granulous
hp8            athena
hpampa1  pampa
html1        tmlhpc

Short duration of unavailability (Feb 3, 2015)

On Feburary 3, at 7:00am, in coordination with our Network Team:

– Network team will have one of the main PACE firewall appliances rebooted in order to fix some isses we have been having with these appliances. This may cause a disruption in networking from outside of PACE in to the head nodes of all PACE clusters.

– PACE will also reboot all of the head nodes for PACE clusters to enact patches to fix the GHOST vulnerability.

These disruptions will result in a 15 minutes period of unavailability for the head nodes. Any login sessions and data transfers into PACE established prior to that time will be terminated, and any processes running on the head nodes themselves will also be terminated. Submitted and running user jobs, however, will remain unaffected, as this will only affect the connections from outside of PACE to inside — all internal operations will function as normal.