Disk failure rate spike

Hey everyone,

We’ve noticed an increase in a type of disk failure on some of the storage nodes that ultimately has a severe negative impact on storage performance. In particular, we observe that certain models of drives in certain manufacturing date ranges seem to be more prone to failure.

As a result, we’re looking a bit more closely at our logs to keep an eye on how widespread this is, but most of the older storage seems fine; it has tended towards some of the newer storage using both 2Tb and 4Tb drives. The 2Tb drives are the more surprising to us as the model line involved has generally been performing as expected, with many older storage units using the same drives without having these issues.

We are also engaging our vendor to see if this is something that they are seeing elsewhere, and making sure we keep a close eye on our stock of replacements to deal with these failures.

Storage slowdowns due to failing disks

CLUSTERS INVOLVED: emory/tardis, ase1

Hey folks,

We’ve gone ahead and replaced some disks in your storage as the type of failures they are generating right now cause dramatic slowdowns in I/O performance for the disk arrays.

As a result of the replacements, the array will remain slow for a period of ~5 or so hours as the arrays rebuild themselves to have the appropriate redundancy.

We’ll be keeping an eye on this problem as we have recently noticed a spike in the number of these events as of late.

PACE clusters ready for research

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

We have successfully completed a number of things:

  • Athena has been fully migrated to RedHat 6.3
  • The BioCluster /nv/pb4 filesystem has been migrated to the DDN space
  • All our Solaris storage servers have been patched
  • firewall upgrades are complete
  • electrical distribution
  • DDN updates
  • VMware updates
  • the mathlocal collection of software has been migrated to /nv/pma1

However, we were unable to complete the upgrade of the TestFlight cluster to RedHat 6.5.  At the moment TestFlight is down, and we will complete the upgrade over the next couple of days.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important for us, especially regarding file transfers in and out of the clusters.  (i.e. between your workstations and the PACE clusters)

PACE quarterly maintenance – April 15-16 2014

PACE Quarterly maintenance has begun

See this space for updates.

PACE Quarterly maintenance notification

It’s time again for our quarterly maintenance.  We will have the clusters down April 15 & 16.

As usual, we’ve instructed the schedulers to avoid running jobs that would cross into a planned maintenance window.  This will prevent running jobs from being killed, but also may mean jobs you submit now may not run until after maintenance completes.  I would suggest checking the wall times for the jobs you will be submitting and, if possible, modify them accordingly so they will complete sometime before the maintenance. Submitting jobs with longer wall times is still OK, but they will be held by the scheduler and released after maintenance completes.

Much of our activities time around are not directly visible, with a couple of notable exceptions.

We will be upgrading the operating system on our TestFlight cluster from RedHat 6.3 to RedHat 6.5.  Please do test your codes on this cluster over the coming weeks and months, as we plan to roll it out (along with any needed fixes) to all other RedHat 6 clusters in July.  This update is expected to bring some performance improvements, as well as some critical security fixes.  Additionally, it adds support for the Intel Ivy Bridge platform, which many of you are ordering.  Any new Ivy Bridge platforms will start with RedHat 6.5.

Other user visible changes include:

  • conclude the migration of the Athena cluster to RedHat 6.3.  We’ll plan to take Athena to 6.5 in July.
  • conclude the migration of the BioCluster /nv/pb4 filesystem to the DDN/GPFS space.
  • migrate mathlocal from /nv/hp24 to /nv/pma1 (Math cluster project space)
  • application of recommended and security patches to our Solaris storage systems.  This is a widespread update will affect filesystems that start with /nv.  A rapid reversion process is available should unanticipated events occur.
  • firewall upgrades to increase bandwidth between PACE and campus

Not so apparent changes include:

  • repairing some electrical distribution to compute node racks
  • minor software/firmware update to DDN to enable support of DDN/WOS evaluation
  • updates to VMware “hardware” levels, enabled by previous migration to VMware 5.1

As always, please follow our blog for communications, especially for announcements during our maintenance activities – and let us know of any concerns via pace-support@oit.gatech.edu.

[RESOLVED] PACE clusters experiencing problems

We’ve identified the source of problems which impacted all of the clusters this (4/7) afternoon.  While making preparations to deploy some firewall upgrades for PACE, one of the campus network team members inadvertently applied a misconfiguration to one of our core network links.  This resulted in widespread packet loss across the PACE internal network.

The head nodes seem to have recovered properly, but please let us know if you see continued issues there.  While it is possible that jobs have been lost, we believe that most things will have recovered without loss.

We’ll continue to monitor the situation and address any remaining problems as soon as we are able.

PACE Team

 

PC1 (Cygnus) filesystem woes

We’ve continued to have issues with the server, and we’ve now identified a networking issue tied to this server as well as a corrupted OS image.

The networking issue has be rectified, and I am installing a new software image onto this machine as I type this.

Despite the nature of the failure, we have not lost any of your already saved data — the drive units which house the OS are separate from the ones storing your data.

We should have this machine back in about a half hour.

Jobs Accidentally Killed

Dear users,

At least 1,600 queued and running jobs were accidentally killed last week by a member of the PACE-team, who was trying to clear out their own jobs. PACE-team accounts have elevated rights to certain commands, and the person who deleted the jobs did not realize that the command they were using would apply to more then just their own jobs.

If you have access to the iw-shared-6 queue, and were running jobs and/or had jobs queued earlier this week, this accident has likely impacted you.

Our deepest apologies for the unexpected and early job-terminations. We are re-evaluating our need to grant elevated permissions to our regular accounts, in order to prevent this from happening again

Thank you,
PACE team

Emergency reboot of compute nodes due to power/cooler outage

The Rich data center cooling system experienced a power outage today (2/6/2014) at around 9:20am when both the main and backup power systems failed requiring an emergency shutdown of all PACE compute nodes. We have since received confirmation from the operations team the room cooling is now stable but using the backup chillers while work proceeds to correct the problem. We are currently bringing the compute nodes back online as quickly as possible.

If you had queued jobs before the incident, they should start running as soon as sufficient number of compute nodes are brought online. However, all of the jobs running at the time of the failure are killed, and they need to be resubmitted. You can monitor the node status using ‘pace-stat’ and ‘pace-check-queue’ commands.

We are sorry for the inconvenience this failure have caused. Please contact us if you have any concerns or questions.

 

Scratch Quota Policies are Changing

We would like to give you a heads up of some upcoming adjustments to the scratch space quotas.

Current policy is a 10TB soft quota and a 20TB hard quota.  Given the space problems we’ve been having with the scratch, we will be adjusting this to a 5TB soft quota, and a 7 TB hard quota.  This change should only affect a small handful of users.  Given the close proximity to our maintenance next week, we will be making this change at the end of January.  This is an easy first step that we can take to start addressing the recent lack of space on scratch storage.  We are looking at a broad spectrum of other policy and technical changes, including changing retention times, improving our detection of “old” files, as well as increasing capacity.  If you have any suggestions for other adjustments to scratch policy, please feel free to let us know (pace-support@oit.gatech.edu).

Please remember that the scratch space is intended for transient data – not as a long term place to keep things.

January Maintenance is over

January maintenance is complete, and clusters started accepting and running jobs. We accomplished all of the primary objectives, and even found time to address a few bonus items.

Most importantly, we completed updating the resource and scheduling managers (torque and moab) throughout the entire PACE realm. This upgrade should bring visible improvements in the speed and reliability. Please note that the job submission process will show some differences after this update, therefore we strongly encourage you to read the transition guide here: http://www.pace.gatech.edu/job-submissionmanagement-transition-guide-jan-2014

Also, please make sure that you check the FAQ for common problems and their solutions by running the command on your headnode:  jan2014-faq  (use the spacebar to skip pages).

We had a hardware failure in the DDN storage system, which caused an interruption in the planned Biocluster data transfer. We expect to receive the replacement parts and fix the system in a few days. This failure has not caused any data loss, and the system will be up and running (perhaps with some performance degradation). We learned that the repairs will require a short downtime, and we will soon get in touch with the users of Gryphon, Biocluster and Skadi clusters (current users of this system), for scheduling this work.

Other accomplishments include:

– Optimus is now a shared cluster. All Optimus users now have access to optimusforce-6 and iw-shared-6.

– All of the Atlas nodes are upgraded to RHEL6.

– Most of the Athena nodes are upgraded to RHEL6.

– The old scheduler server (repace) is replaced with the upgraded (shared-sched). You may notice a difference in the generated job numbers and files.

– Some networking cable cleanup and improvements

– Gryphon has new scheduler and login servers, and the nodes used for these purposes have been put back in the computation pool.

– Deployed project file space quotas as previously agreed with PIs to users who did not have quotas prior to maintenance, and adjusted for those already over to allow some head room before abutting their quota. To check your quotas, use “quota -s”.