We’re back up

The maintenance day ran rather a bit longer than anticipated but the clusters are now back in operation and processing jobs. As usual, please send any reports of trouble to pace-support@oit.gatech.edu.

Upcoming quarterly maintenance – 10/18/2011

Reminder folks, the clusters will be down on this coming Tuesday, October 18.

All of the currently running jobs will have completed by then, and the scheduler has been instructed to not start any new jobs that will not complete by then. Jobs that have been submitted, but wouldn’t complete by Tuesday morning are being held by the scheduler, and will be released as nodes become available after our maintenance activities.

Major items on the list this time around are:

  • swap over to redundant network switches for the core of the HPC network
  • Panasas software update to version 4.1
  • routine Solaris and RedHat patching to non-user facing infrastructure services
  • routine security patches to ssh everywhere
  • migration of infrastructure services to virtual machines
  • migration to new infrastructure-facing LDAP schema
  • reinstating storage quotas missed in our previous maintenance

Some further minor things we’ll take care of as well:

  • load testing on some infrastructure servers
  • migrate the /hp3 filesystem to different fileserver, we put it on the wrong one; (no user impact expected)
  • OIT/Operations will be performing preventative maintenance on the UPS
  • OIT/Operations will be verifying some electrical circuit locations
  • update ganglia monitoring agents on all RHEL5 machines
  • reboot everything

 

Clusters Are Back!

1530

After days of continuous struggle and troubleshooting, we are happy to tell you that the clusters are finally back in a running state. You can now start submitting your jobs. All of your data have been safe, however the jobs that were running during the incident were killed and they need to be restarted. We understand how this interruption must have adversely impacted your research and apologize for all the trouble. Please let us (pace-support@oit.gatech.edu) know if there is anything we can do to bring you up to speed once again.

The brief technical explanation of what happened:
At the heart were a set of fiber optic cables that interacted to intermittently interrupt communications among the Panasas storage modules.  This would result in the remaining modules beginning to move the services handled by a non-communicating module to a backup location.  During the process of moving the service, one of the other modules (including the one accepting the new service) would either send or receive some garbled information causing the move now in process to be re-recovered or an additional service to be relocated, depending upon which modules were involved.  Interestingly, the cables themselves appear not to be bad but instead interacted badly with the networking components. Thus, when cables were replaced or switch ports or network switch itself were swapped, the problems would appear “fixed” for a short while then return before a full recovery could be completed. The three vendors involved provided access to their top support and engineering resources and these have never seen this kind of behavior. Our experience and adversity have been entered into their knowledge bases for future diagnostics.

Thank you once again for your understanding and patience!

Regards,
PACE Team