Cluster Downtime December 19th for Scratch Space Cancelled

We have been working very closely with Panasas regarding the necessity of emergency downtime for the cluster to address the difficulties with the high-speed scratch storage. At this time, they have located a significant problem in their code base that, they believe, is responsible for this and other issues. Unfortunately, the full product update will not be ready in time for the December 19th date so we have cancelled this emergency downtime and all jobs running or scheduled will continue as expected.

We will update you with the latest summary information from Panasas when available. Thank you for your continued patience and cooperation with this issue.

– Paul Manno

Cluster Downtime December 19th for Scratch Space Issues

As many of you have noticed, we have experienced disruptions and undesirable performance with our high-speed scratch space. We are continuing to work diligently with Panasas to discover the root cause and repair for these faults.

As we are working toward a final resolution of the product issues, we will need to schedule an additional cluster-wide downtime on the Panasas to implement a potential resolution. We are scheduling a short downtime (2 hours) for Wednesday, December 19th at 2pm ET. During this window, we expect to install a tested release of software.

We understand this is an inconvenience to all our users but feel this is important enough to the PACE community to warrant this disruption. If this particular date and duration falls at a time that is especially difficult, please contact us and let us know and we will do our best to negotiate a better date or time.

It is our hope this will implement a permanent solution to these near-daily disruptions.

– Paul Manno

Scratch storage issues: update

Scratch storage status update:

We continue to work with Panasas on the difficulties with our high-speed scratch storage system. Since the last update, we have received and installed two PAS-11 test shelves and have successfully reproduced our problems on them under the current production software version. We then updated to their latest release and re-tested only to observe a similar problem with this new release as well.

We’re continuing to do what we can to encourage the company to find a solution but are also exploring alternative technologies. We apologize for the inconvenience and will continue to update you with our progress.

Scratch Storage and Scheduler Concerns

The PACE team is urgently working on two ongoing critical issues with the clusters:

Scratch storage
We are aware of access, speed and reliability issues with the high-speed scratch storage system. We are currently working with the vendor to define and implement a solution to these issues. At this time, we are told there is a new version of the storage system firmware just released today that will likely resolve our issues. The PACE team is expecting the arrival of a test unit where we can verify the vendor’s solution. Once we have verified the vendor’s solution, we are considering an emergency maintenance for the entire cluster in order to implement the solution. We appreciate your feedback on this approach and especially the impact upon your research. We will let you know and work with you on scheduling when a known solution is available.

Scheduler
We are presently preparing a new system to host the scheduler software. We expect the more powerful system will alleviate many of the difficulties you are experiencing with the scheduler especially with delays in job scheduling and the time-outs when requesting information or scheduling jobs. Once we have a system ready, we will have to suspend the scheduler for a few minutes while we transition services to the new system. We do not anticipate the loss of any jobs currently running or any that are currently queued with this transition.

In both situations, we will provide you with notice well in advance of any potential interruption and work with you to provide the least impact to your research schedule.

– Paul Manno

We’re back up

The maintenance day ran rather a bit longer than anticipated but the clusters are now back in operation and processing jobs. As usual, please send any reports of trouble to pace-support@oit.gatech.edu.

Clusters Are Back!

1530

After days of continuous struggle and troubleshooting, we are happy to tell you that the clusters are finally back in a running state. You can now start submitting your jobs. All of your data have been safe, however the jobs that were running during the incident were killed and they need to be restarted. We understand how this interruption must have adversely impacted your research and apologize for all the trouble. Please let us (pace-support@oit.gatech.edu) know if there is anything we can do to bring you up to speed once again.

The brief technical explanation of what happened:
At the heart were a set of fiber optic cables that interacted to intermittently interrupt communications among the Panasas storage modules.  This would result in the remaining modules beginning to move the services handled by a non-communicating module to a backup location.  During the process of moving the service, one of the other modules (including the one accepting the new service) would either send or receive some garbled information causing the move now in process to be re-recovered or an additional service to be relocated, depending upon which modules were involved.  Interestingly, the cables themselves appear not to be bad but instead interacted badly with the networking components. Thus, when cables were replaced or switch ports or network switch itself were swapped, the problems would appear “fixed” for a short while then return before a full recovery could be completed. The three vendors involved provided access to their top support and engineering resources and these have never seen this kind of behavior. Our experience and adversity have been entered into their knowledge bases for future diagnostics.

Thank you once again for your understanding and patience!

Regards,
PACE Team