IDEaS storage Maintenance

WHAT’S HAPPENING?

One of the  IDEaS IntelliFlash  controller cards needs to be reseated. Before reseating the card, we will failover all resources to controller B, shutdown controller A, pull the whole enclosure out and reseat the card. The activity takes about 2 hours to complete. 

WHEN IS IT HAPPENING?

Monday, July 8th, 2024, starting at 9 AM EDT.

WHY IS IT HAPPENING?

We are working with the vendor to resolve an issue discovered while debugging controllers and restore system back to a healthy status.

WHO IS AFFECTED?

Users of the IDEaS storage system will notice decreased performance since all services will be switched over to a single controller. It is possible that access will be interrupted while the switch happens. 

WHAT DO YOU NEED TO DO?

During the maintenance, data access should be preserved, and we do not expect downtime. However, there have been cases in the past where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, jobs accessing the IDEaS storage may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage can be accessed.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Scheduler Outage

Summary: The Phoenix scheduler became non-responsive on Friday 9/9/2021 between 7:30pm and 10pm.

Details: The Torque resource manager on the Phoenix scheduler crashed unexpectedly around 7:30 PM. A bad GPU node with the same error message caused a segmentation fault on the server, and the crashing scheduler corrupted a handful of jobs in queue with dependencies, requiring some pruning of those records from the system. Around 10pm, the node causing issues was purged from the scheduler and the corrupted jobs were removed restoring normal operations.

Impact: Running jobs were not interrupted, but no new jobs could be submitted during the period scheduler was down. Commands such as “qsub” and “qstat” were impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Corrupted jobs in queue were cancelled.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scheduler outage

[Update 7/15/22 5:10 PM]

The PACE team has identified some known issues and addressed them to restore scheduler functionality. The scheduler is currently back up and running and new jobs can be submitted. We will continue to monitor the performance over next week and we appreciate your patience as we work through this. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 7/15/22 3:12 PM]

Summary: The Phoenix scheduler’s response was inconsistent starting today. While we are working towards fully resolving it, we have mitigated the issue by restarting the scheduler at approximately 2:00 PM today. The scheduler will be shut down temporarily to continue restoring it with full capacity.

Details: An unexpected scheduler crash earlier this week has resulted in certain ongoing issues that we are actively working with the vendor to fully resolve. The resource manager has been unable to detect all free resources. As a result,Torque resource manager on the Phoenix scheduler was not accepting certain interactive jobs today morning. Also, some jobs are waiting in queue to be launched for an unusually long period of time. PACE team restarted the scheduler and restored some of its function around 2:00 PM. PACE team is continuing to work on the issue to fully resolve at the earliest. As a result, the scheduler will be shut down temporarily for a system wide restart.

Impact: Interactive jobs submissions requesting relatively high number of processors and memory were failing and being cancelled without an error message. The wait time for jobs in queue has also been longer than usual. These issues have been resolved. However, due to scheduler down time, new jobs can’t be submitted in the meantime and scheduler commands such as qstat will not work. The jobs currently running will complete without interruption.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions. We will be following-up with another status message later today.

Best,

-The PACE Team