ONGOING: PACE quarterly maintenance – October ’15 – Partnership for an Advanced Computing Environment

We’ve had some unexpected delays and challenges this go around. The short version is that we will need to extend our maintenance activities into tomorrow. We’ll do a rolling release to you as we can bring compute nodes online.

The long version:

The storage system that is responsible for /usr/local and our virtual machine infrastructure experienced a hardware failure that caused us a significant amount of lost time. Some PACE staff have spent 40 of the last 48 hours on site in order to try and make corrections. We were already planning on transitioning /usr/local off of this storage and had alternate storage in place. Likewise for the virtual machines, although our plan was to live-migrate those after maintenance activities were complete. The good news is that we don’t have data loss, the bad news is that we’ve had to accelerate the virtual machine migration, resulting in additional unplanned effort.

Also, the DDN work is taking far longer than expected. Part of this work required us to remove all nodes from the GPFS filesystem and add them back in again. Current estimates to bring everything back to full production range from an additional 12 to 24 hours. This means between 10am and 10pm tomorrow before we have everything back up. As mentioned above, we will make things available as soon as we can. Pragmatically, that means that clusters will initially be available at reduced capacity. Look for another post here when we start enabling logins again.