Summary: The Phoenix scheduler became non-responsive on Friday 9/9/2021 between 7:30pm and 10pm.
Details: The Torque resource manager on the Phoenix scheduler crashed unexpectedly around 7:30 PM. A bad GPU node with the same error message caused a segmentation fault on the server, and the crashing scheduler corrupted a handful of jobs in queue with dependencies, requiring some pruning of those records from the system. Around 10pm, the node causing issues was purged from the scheduler and the corrupted jobs were removed restoring normal operations.
Impact: Running jobs were not interrupted, but no new jobs could be submitted during the period scheduler was down. Commands such as “qsub” and “qstat” were impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Corrupted jobs in queue were cancelled.
Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.