It’s that time again. We’ve been working with our scratch storage vendor (Panasas) quite a lot lately, and think we finally have some good news. Addressing the scratch space will be a major thrust of this quarterly maintenance, and we are cautiously optimistic that we will see improvements. We will also be applying some VMware tuning to our RHEL5 virtual machines that should increase responsiveness of those head nodes & servers. Completing upgrades to RHEL6 for a few clusters and a few other minor items round out our activities for the day.
Scratch storage
We have been testing new firmware on our loaner Panasas storage. Despite our best efforts, we have been unable to replicate our current set of problems after upgrading our loaner equipment to this firmware. This is good news! However, simply upgrading is insufficient to fully resolve our issues. So on maintenance day, we will be performing a number of tasks related to the Panasas. After the firmware update, we need to perform some basic file integrity checks – the equivalent of a UNIX fsck – on a copule of volumes. This process requires those volumes to be offline for the duration. After this, we need to perform reads of every file on the scratch that was created before the firmware upgrade. Based on our calculations, this will take weeks. Fortunately, this process can happen in the background, and with the filesystems online and otherwise operating normally. The net result is that the full impact of our maintenance day improvements to the scratch will not likely be realized for a couple of weeks. If there are files (particularly large ones) that you no longer need and can delete, this process will go faster. We will also be upgrading the Panasas client software on all compute nodes to (hopefully) address performance issues.
Finally, we will also be instituting a 20TB per user hard quota in addition to the 10TB per user soft quota currently in place. Users that exceed the soft quota will receive warning emails, but writes will succeed. Writes will fail for users that attempt to exceed the hard quota.
VMware tuning
With some assistance from the Architecture and Infrstructure directorate in OIT, we will be making a number of adjustments to our VMware world. The most significant of which is adjusting the filesystem alignment of our RHEL5 virtual machines. Users of RHEL5 head nodes are likely to see the most improvement. We’ll also be installing the VMware tools packages and applying various tuning parameters enabled by this package.
RHEL6 upgrades
The remaining RHEL5 portions of the clusters below will be upgraded to RHEL6. After maintenance day, RHEL5 will be unavailable to these clusters.
- Uranus
- BioCluster
- Cygnus
Misc items
- Configuration updates to redundant network switches serving some project storage
- Capacity expansion of the ECE file server
- Serial number updates to a small number of compute nodes lacking serial numbers in the BIOS
- Interoperability testing of Mellanox Infiniband switches
- Finish project directory migration of two remaining Optimus users