Our maintenance activities are now underway. All PACE clusters are down. Please watch this space for updates.
For details on work to be completed, please see our previous posts, here.
Our maintenance activities are now underway. All PACE clusters are down. Please watch this space for updates.
For details on work to be completed, please see our previous posts, here.
Greetings!
The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, January 26 and continuing through Wednesday, January 27. We have a couple of major items that hopefully will provide a much better PACE experience.
Building on the new DDN hardware deployed in October, this item is the dominant activity in this maintenance period. Our old Panasas scratch storage has now exceeded its warranty, so this is a “must do” activity. Given then performance level of the Panasas and the volume of data it contains, we do not believe we will be able to migrate all data during this maintenance period. So, we will optimize the migration to maximize the number of users migrated. Using this approach, we believe we will be able to migrate more than 99% of the PACE user community. After the maintenance window, we will work directly with those individuals who we are not able to migrate. You will receive an email when your migration begins, and when it is complete. (Delivered to your official GT email address, see previous post!)
After the maintenance period, the old Panasas scratch will still be available, but in read-only mode. All users will have scratch space provisioned on the new GPFS scratch. Files for users who are successfully migrated will not be deleted, but will be rendered inaccessible except to PACE staff. This provides a safety net in the unlikely event that something goes wrong.
For the time being, we will preserve the 5TB soft quota and 7TB hard quota on the new GPFS scratch, as well as the 60 day maximum file age. However, the timestamps of the files will get reset as they migrate, so the 60 day timer gets reset for all files.
The ~/scratch symlinks in your home directories will also be updated to point to the new locations, so please continue to use these paths to refer to your scratch files. File names beginning with /panfs will no longer work once your migration is complete.
Pending successful testing, we will also be rolling out a bug fix update to our Moab & Torque scheduler software and increase network connectivity for our most heavily used schedulers. Among issues addressed in this release is a bug where we have seen erratic notifications about failures canceling jobs, incorrect groups being included in reports and some performance improvements. Unlike previous scheduler upgrades, all previously submitted jobs will be retained. No user action should be required as a result of this upgrade.
We will be upgrading network connectivity on some of our servers to take advantage of network equipment upgraded in October. No user action required.
We will adjust parameters on some GPFS clients to more appropriately utilize their Infiniband connections. This only affects the old (6+ years) nodes with DDR connections. We will also substitute NFS access for native GPFS access on machines that lack Infiniband connectivity or have otherwise been identified as poorly performing GPFS clients. In particular, this will affect most login nodes. The /gpfs path names on these machines will be preserved, so no user action is needed here either.
The /nv/pk1 filesystem for the Aryabhata cluster will be migrated to GPFS.
The /usr/local filesystem will be exported read-only. This is a security measure, and should not impact normal operations.
We will continue the transition away from diskless nodes that we started in October. This mainly affects nodes in the 5-6 years old range. Apart from more predictable performance of these nodes, this should also be a transparent change.
Greetings!
In order to help ensure the reliability of email communications from PACE, we will be changing how we deliver mail effective Wednesday, January 20. (next week!) From this time forward, PACE will use only the officially published email addresses as defined in the Georgia Tech Directory.
This is a needed change, as we have many, many messages that we have been unable to deliver due to outdated or incorrect destinations.
The easiest way to determine your official email address is to visit http://directory.gatech.edu and enter your name. If you wish to change your official email address, visit http://passport.gatech.edu.
In particular, this change will affect the address which is subscribed to PACE related email lists (i.e. pace-availability and such) as well as job status emails generated automatically from the schedulers.
For the technically savvy, we will be changing our mail servers to lookup addresses from GTED. We will no longer use the contents of a users ~/.forward file.
p.s. Users of the Tardis cluster do not have entires in the Georgia Tech directory, this change does not apply to you.
Greetings,
We’ve made substantial progress getting through our activities, and are releasing jobs. We still have a number of compute nodes that still need to be brought online, however all clusters have some amount of resources and are running jobs. We will continue to work through these issues later today. After sleep.
Major upgrade to DDN & a new scratch storage
All data migrated successfully to new front ends, additional disks have been added for upcoming scratch. Substantial delays due to unanticipated long running processes to join compute nodes to the new GPFS cluster. This work is still ongoing. Benchmarking suggests a slight performance improvement for those of you with project directories in GPFS.
New PACE router and firewall hardware & additional core network capacity
successfully completed without incident.
Panasas scratch filesystem maintenance
successfully completed without incident.
Migration of home directories
successfully completed without incident.
Migration of /usr/local storage
successfully completed without incident.
Begin transition away from diskless compute nodes.
migrated approximately 100 compute nodes. Some of these still have issues with GPFS, as above.
We’ve had some unexpected delays and challenges this go around. The short version is that we will need to extend our maintenance activities into tomorrow. We’ll do a rolling release to you as we can bring compute nodes online.
The long version:
The storage system that is responsible for /usr/local and our virtual machine infrastructure experienced a hardware failure that caused us a significant amount of lost time. Some PACE staff have spent 40 of the last 48 hours on site in order to try and make corrections. We were already planning on transitioning /usr/local off of this storage and had alternate storage in place. Likewise for the virtual machines, although our plan was to live-migrate those after maintenance activities were complete. The good news is that we don’t have data loss, the bad news is that we’ve had to accelerate the virtual machine migration, resulting in additional unplanned effort.
Also, the DDN work is taking far longer than expected. Part of this work required us to remove all nodes from the GPFS filesystem and add them back in again. Current estimates to bring everything back to full production range from an additional 12 to 24 hours. This means between 10am and 10pm tomorrow before we have everything back up. As mentioned above, we will make things available as soon as we can. Pragmatically, that means that clusters will initially be available at reduced capacity. Look for another post here when we start enabling logins again.
Our maintenance activities are now underway. All PACE clusters are down. Please watch this space for updates.
For details on work to be completed, please see our previous posts, here.
Greetings,
The PACE team is preparing for our quarterly maintenance that will occur, Tuesday, October 20 and Wednesday, October 21. We have a number of activities scheduled that should provide positive improvements across the board.
** NO USERS WILL BE MIGRATED DURING THE MAINTENANCE PERIOD **
After the maintenance period, we will begin migrating users to the new scratch storage. This will be a lengthy process, with some user actions and coordination required. We will do our best to minimize the impact of the migration. We are targeting our January maintenance to retire the Panasas storage, as the service contracts expire at the end of December.
Greetings,
Our quarterly maintenance is now complete. We have no known outstanding issues affecting general operations, but have a few straggling nodes that we will address over the next couple of days.
All compute, login and interactive nodes have been updated to version 3.5.0-25 of the GPFS client per recommendation of DDN. This update addresses the bugs identified in the -20 version that caused problems during our April maintenance. No user changes should be needed.
The “newrepo” software repository has been made the default. Please note that there are a significant number of changes in available versions of software relative to the old repository. Jobs that reference versions that are no longer available will have difficulty running. If you have been running by doing a ‘module load newrepo’ before our maintenance activities, you should not experience any difference.
We’ve reset our infiniband fabric and it appears to be in good health.
The storage devices for this project finally arrived earlier today. This item will be deferred until a future maintenance period.
We weren’t quite ready to complete this bonus objective, so we’ll try and find a period of inactivity to do so between now and our next maintenance period. Whenever this happens, no user changes will be needed.