Scheduled Quarterly Maintenance on 01/15/2013

The first quarterly maintenance of 2013 will take place on 01/15. All of the systems will be offlined for the entire day. We hope that no jobs will need to be killed, since we have been placing holds on jobs that would still be running on that day. If you submitted jobs with long walltimes (exceeding 01/15), then you will notice that they are being held by the scheduler to protect them from getting killed.

Here’s a summary of the tasks that we are planning to accomplish on the maintenance day.

* OS upgrade (6.2 to 6.3): We will upgrade the RHEL OS to version 6.3. This version offers better compatibility with our hardware, with potential benefits on performance. We have been testing existing software stack with this version to verify compatibility and do not expect any problems. We are upgrading the testflight nodes to 6.3 (they should be online very soon), so please submit test jobs to this queue to verify that your codes will continue to run on the new system.

* Scratch storage maintenance: As most of you already know, we have been working with Panasas to resolve the ongoing crashes. Panasas has identified the cause that will require a new release of their system software. We expect to deploy a tested version on this maintenance day.

 Important: The new release will be tested on a separate storage system that was provided by Panasas, and not on our production system. Therefore, we must be prepared for the possibility of unforeseen problems that will only be triggered by production runs with actual usage patterns. As an effort to shield long running jobs from such an undesired event, we are placing another reservation to only allow jobs that will complete by 02/17/2013, while holding longer jobs. This way, should we need to declare an emergency downtime on that day, we will be able to do so with minimal impact. This will require jobs with more than 31 days of walltime to be held until February the 17th, so please consider this while setting walltimes for your jobs. This reservation is contingent upon the stability of the system, and it can be removed earlier than this date if we feel confident enough. We are sorry for this  inconvenience.

* Conversion of more RHEL5 nodes to RHEL6: The majority of our users have made the switch to RHEL6 systems already. Therefore, we will migrate more of the FoRCE and Joe nodes to corresponding RHEL6 queues. We are not getting rid of the RHEL5 queues entirely (just yet), but the number of nodes they contain will be significantly reduced. Please contact us if your jobs are still dependent on RHEL5, since this version will be depreciated in the near future.

* Deployment of new database-driven configuration builders (dry-run mode only): We are developing a new system to manage user accounts and queue management, along with many other system management tasks, to minimize human error and maximize efficiency. We will deploy a dry-run mode only prototype of this system, which will run alongside with existing mechanisms. This will allow us to test and verify the new system against real usage scenarios to assist in the development effort, and will not be used for actual management tasks.

* New license server: We will start using a new license server, since the system on the existing server is getting old. We will migrate the existing licenses to the new server on the maintenance day. We don’t expect any difficulties, but please contact us if you notice any problems with licenses.

As always, please let us know if you have any concerns or questions at pace-support@oit.gatech.edu.