January 2013 quarterly maintenance is complete

Greetings!

We have completed our quarterly maintenance activities.  Head nodes are online again and available for use, queued up jobs have been released, and the scheduler is awaiting new submissions.

Our RedHat 6 clusters have received system software updates.  Please keep an eye on your jobs to verify everything is operating correctly.

Our Panasas scratch storage has received another round of updates.  Preliminary testing indicates that we should have a resolution to our crashes, but the quota system is known to be broken.  As advised by Panasas, we have disabled quotas on scratch.  Please do your best to stay below the 20TB threshold.  We will be monitoring usage and know where you live.  🙂

We have a new license server providing checkouts of the Portland Group and Intel compilers, Matlab DCS, the Allinea DDT debugger and Lumerical.  Please let us know if you have problems accessing this software.  The old server is still running and we will be monitoring it for a short while for extraneous activity.

More nodes from Joe and the FoRCE have been converted from RHEL5 to RHEL6.  If you are still using the RHEL5 side of the world, please prioritize a transition to RHEL6.  We stand ready to assist you with this transition.

Finally, our new configuration system has been deployed in prototype mode.  We will use this to gather operational information and other data that will facilitate a full transition to this system in a future maintenance day.

As usual, please let us know (via email to pace-support@oit.gatech.edu) if you encounter any issues.

Happy Computing!

–Neil Bright
 

Datacenter modifications

Tomorrow morning (January 9) at 8:30am, facilities management will be performing some work on the power distribution systems in the Rich datacenter.  None of this work is being performed on anything that power PACE systems; there should be zero impact on any job or computer that PACE manages.  However, due to the nature of sharing space in the datacenter; in the event of a major problem, PACE systems may be affected.

Once again, there should be zero impact on PACE systems; no jobs or computers should be affected.

Please let us know (via email to pace-support@oit.gatech.edu) if you have any questions or concerns.

TestFlight upgraded to new 6.3 stack

Here’s our present for the holidays: a new OS stack based on RHEL6.3, which our tests indicated that we get a performance boost across all CPU architectures. Please try your codes now on TestFlight to make sure we haven’t introduced new bugs in this stack, and report to us any problems you see.

Scheduled Quarterly Maintenance on 01/15/2013

The first quarterly maintenance of 2013 will take place on 01/15. All of the systems will be offlined for the entire day. We hope that no jobs will need to be killed, since we have been placing holds on jobs that would still be running on that day. If you submitted jobs with long walltimes (exceeding 01/15), then you will notice that they are being held by the scheduler to protect them from getting killed.

Here’s a summary of the tasks that we are planning to accomplish on the maintenance day.

* OS upgrade (6.2 to 6.3): We will upgrade the RHEL OS to version 6.3. This version offers better compatibility with our hardware, with potential benefits on performance. We have been testing existing software stack with this version to verify compatibility and do not expect any problems. We are upgrading the testflight nodes to 6.3 (they should be online very soon), so please submit test jobs to this queue to verify that your codes will continue to run on the new system.

* Scratch storage maintenance: As most of you already know, we have been working with Panasas to resolve the ongoing crashes. Panasas has identified the cause that will require a new release of their system software. We expect to deploy a tested version on this maintenance day.

 Important: The new release will be tested on a separate storage system that was provided by Panasas, and not on our production system. Therefore, we must be prepared for the possibility of unforeseen problems that will only be triggered by production runs with actual usage patterns. As an effort to shield long running jobs from such an undesired event, we are placing another reservation to only allow jobs that will complete by 02/17/2013, while holding longer jobs. This way, should we need to declare an emergency downtime on that day, we will be able to do so with minimal impact. This will require jobs with more than 31 days of walltime to be held until February the 17th, so please consider this while setting walltimes for your jobs. This reservation is contingent upon the stability of the system, and it can be removed earlier than this date if we feel confident enough. We are sorry for this  inconvenience.

* Conversion of more RHEL5 nodes to RHEL6: The majority of our users have made the switch to RHEL6 systems already. Therefore, we will migrate more of the FoRCE and Joe nodes to corresponding RHEL6 queues. We are not getting rid of the RHEL5 queues entirely (just yet), but the number of nodes they contain will be significantly reduced. Please contact us if your jobs are still dependent on RHEL5, since this version will be depreciated in the near future.

* Deployment of new database-driven configuration builders (dry-run mode only): We are developing a new system to manage user accounts and queue management, along with many other system management tasks, to minimize human error and maximize efficiency. We will deploy a dry-run mode only prototype of this system, which will run alongside with existing mechanisms. This will allow us to test and verify the new system against real usage scenarios to assist in the development effort, and will not be used for actual management tasks.

* New license server: We will start using a new license server, since the system on the existing server is getting old. We will migrate the existing licenses to the new server on the maintenance day. We don’t expect any difficulties, but please contact us if you notice any problems with licenses.

As always, please let us know if you have any concerns or questions at pace-support@oit.gatech.edu.

TestFlight in process of update

We have temporarily stopped the queues for TestFlight to allow them to drain so that we may upgrade TF machines to a new stack based on RHEL 6.3. Once all machines have been upgraded, we will re-enable the queues for jobs to test the suitability of this new stack.

Should there be no major software issues, this will become the de facto OS for RHEL6 based clusters on the next maintenance day, scheduled for January 17, 2013.

Maintenance Day (October 16, 2012) – complete

We have completed our maintenance activities.  Head nodes are online again and queued up jobs are being released.

Our filesystem correction activities on the scratch found eight “objects” on the v7 volume to be damaged and were automatically removed.  Unfortunately, the process provides no indication which file or directory was problematic.

As always, please followup to pace-support@oit.gatech.edu with any problems you may see, ideally with the pace-support.sh script discussed here: https://pace.gatech.edu/support.

campus network maintenance

The Network team will be performing some scheduled maintenance this Saturday morning.  This may impact connectivity between your workstations/laptops/home, but should not affect  jobs running within PACE.  However, if your job requires access to network services outside of the PACE cluster (e.g. a remote license server), this maintenance may affect your jobs.

For further information please see the maintenance announcement on status.oit.gatech.edu.

Check the status of queue(s) using “pace-check-queue”

Dear PACE Users,

We have a new tool to announce. If you would like to check the status of any PACE queue, you can now run:

pace-check-queue <queuename>

substituting the queuename with  the name of the queue you would like to check. This tool has a column, which tells you whether a node is accepting jobs or not, including a human readable explanation. This tool provides, at one glance, the following information:

* Which nodes are included in the queue

* Which nodes accept jobs and which don’t (and if they don’t, why)

* How may cores and how much memory each node has, and what percent of them are being used

* Overall usage (CPU/Memory) levels for the entire queue.

(This information is refreshed every half an hour)

We had recently announced a new tool, pace-stat, to check the status of your queues. These tools complement each other, so feel free to use both. Please report any down/problem nodes that you see in the list to pace-support@oit.gatech.edu.

Hope these new tools will provide you with a better HPC environment. Happy computing!

PS: These tools are continuously being developed, therefore your feedback and suggestions for improvements are always welcome!

upcoming maintenance day, 10/16 – working on the scratch storage

It’s that time again.  We’ve been working with our scratch storage vendor (Panasas) quite a lot lately, and think we finally have some good news.  Addressing the scratch space will be a major thrust of this quarterly maintenance, and we are cautiously optimistic that we will see improvements.  We will also be applying some VMware tuning to our RHEL5 virtual machines that should increase responsiveness of those head nodes & servers.  Completing upgrades to RHEL6 for a few clusters and a few other minor items round out our activities for the day.

Scratch storage

We have been testing new firmware on our loaner Panasas storage.  Despite our best efforts, we have been unable to replicate our current set of problems after upgrading our loaner equipment to this firmware.  This is good news!  However, simply upgrading is insufficient to fully resolve our issues.  So on maintenance day, we will be performing a number of tasks related to the Panasas.  After the firmware update, we need to perform some basic file integrity checks – the equivalent of a UNIX fsck – on a copule of volumes.  This process requires those volumes to be offline for the duration.  After this, we need to perform reads of every file on the scratch that was created before the firmware upgrade.  Based on our calculations, this will take weeks.  Fortunately, this process can happen in the background, and with the filesystems online and otherwise operating normally.  The net result is that the full impact of our maintenance day improvements to the scratch will not likely be realized for a couple of weeks.  If there are files (particularly large ones) that you no longer need and can delete, this process will go faster.  We will also be upgrading the Panasas client software on all compute nodes to (hopefully) address performance issues.

Finally, we will also be instituting a 20TB per user hard quota in addition to the 10TB per user soft quota currently in place.  Users that exceed the soft quota will receive warning emails, but writes will succeed.  Writes will fail for users that attempt to exceed the hard quota.

VMware tuning

With some assistance from the Architecture and Infrstructure directorate in OIT, we will be making a number of adjustments to our VMware world.  The most significant of which is adjusting the filesystem alignment of our RHEL5 virtual machines.  Users of RHEL5 head nodes are likely to see the most improvement.  We’ll also be installing the VMware tools packages and applying various tuning parameters enabled by this package.

RHEL6 upgrades

The remaining RHEL5 portions of the clusters below will be upgraded to RHEL6.  After maintenance day, RHEL5 will be unavailable to these clusters.

  • Uranus
  • BioCluster
  • Cygnus

Misc items

  • Configuration updates to redundant network switches serving some project storage
  • Capacity expansion of the ECE file server
  • Serial number updates to a small number of compute nodes lacking serial numbers in the BIOS
  • Interoperability testing of Mellanox Infiniband switches
  • Finish project directory migration of two remaining Optimus users