Scheduler
The new server for the workload scheduler seems to have gone well. We haven’t received much user feedback, but what we have received has been positive. This matches with our own observations as well. Presuming things continue to go well, we will relax some of our rate-limiting tuning paramaters on Thursday morning. This shouldn’t cause any interruptions (even of submitting new jobs) but should allow the scheduler to start new jobs at a faster rate. The net effect is to try and decrease wait times some users have been seeing. We’ll slowly increase this parameter and monitor for bad behavior.
Scratch Storage
The story of the Panasas scratch storage does not go as well. Last week, we received two “shelves” worth of storage to test. (For comparison, we have five in production.) Over the weekend, we put these through synthetic tests, designed to mimic the behavior that causes them to fail. The good news is that we were able to replicate the problem in the testbed. The bad news is that the highly anticipated new firmware provided by the vendor still does not fix the issues. We continue to press Panasas quite aggressively for resolution and are looking into contingency plans – including alternate vendors. Given that we are five weeks out from our normal maintenance day and have no viable fix, an emergency maintenance between now and then seems unlikely at this point.