Phoenix scheduler outage

Summary: The Phoenix scheduler became nonresponsive yesterday evening and was restored at approximately 11:30 PM last night.

Details: Yesterday evening, the Torque resource manager on the Phoenix scheduler became overloaded, likely shortly after 7:30 PM. The PACE team restarted the scheduler and restored its function just before 11:30 PM last night.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

Hive project & scratch storage cable replacement

Summary: Hive project & scratch storage cable replacement and potential for an outage

Details: A cable connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 11:30 AM Tuesday (April 26).

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Launch of Open OnDemand Portal for PACE’s Phoenix and Hive Clusters

Dear PACE Researchers, 

We are pleased to announce the official release of the Open OnDemand (OOD) portal for PACE’s Phoenix and Hive clusters! OOD portal allows you to access PACE compute resources through your browser, and OOD provides a seamless interface for several different interactive applications, including Jupyter, Matlab, and a general interactive desktop environment. Each PACE cluster has its own portal, allowing access to all your data as usual with the Web interface. 

In-depth documentation on OOD at PACE is available at https://docs.pace.gatech.edu/ood/guide, and links to the portal for each PACE cluster are listed below: 

Please note that you will need to be on the GT VPN in order to access the OOD portals.

Thursday’s PACE clusters orientation will feature a demo using OOD. To register for upcoming PACE clusters orientation, visit https://b.gatech.edu/3w6ifqO.  

Please direct any questions about Open OnDemand to our ticketing system via email to pace-support@oit.gatech.edu or by filling out a help request form.  

Cheers! 

– The PACE Team 

Phoenix scheduler outage

Summary: The Phoenix scheduler stopped launching new jobs on Friday evening and was restored at approximately 9:30 AM on Saturday.

Details: At some point after 8 PM on Friday evening, the node hosting the Moab workload manager of the Phoenix scheduler lost its network connection, leaving it unable to communicate with the rest of the cluster. The PACE team repaired the connection just before 9:30 AM on Saturday morning, and functionality was restored.

Impact: While jobs could be submitted via “qsub” and checked via “qstat”, no new jobs would launch but would instead remain queued. Moab commands such as “showq” would not have worked. Running jobs were not interrupted.

Thank you for your patience over the weekend. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scheduler outage

Summary: The Phoenix scheduler became nonresponsive overnight and was restored at approximately 9:00 AM today.

Details: Last night, the Phoenix scheduler became nonresponsive, likely shortly after midnight. The PACE team restarted the scheduler and restored its function just before 9:00 this morning.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted early this morning. Running jobs were not interrupted.

Thank you for your patience early this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Hive project & scratch storage cable replacement

Summary: Hive project & scratch storage cable replacement and potential for an outage

Details: A cable connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10:00 AM Tuesday (April 12).

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved] Phoenix Charge Account Authorization

[Update 4/4/22 12:25 PM]

Summary: [Resolved] Free tier charge account balances did not reset on April 1. A manual reset was performed on April 4.

Details: The deleted Perl library that prevented job submissions last Thursday night and Friday morning also caused an error in the monthly reset of free tier charge account balances at midnight on Friday, April 1. Other accounts that reset on a monthly basis were not impacted. PACE manually reset all free tier account balances just before noon today.

Impact: Job submissions to free tier accounts over the last three days would have succeeded only if sufficient leftover balance from March remained. At this time, all free tier accounts have been reset to their full monthly allocation, and jobs run prior to the reset will not count towards April utilization. All faculty and their teams now have access to their full April free tier allocation. Researchers can run the “pace-quota” command to view their available charge accounts and balances.

We apologize for any disruption this may have caused. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 4/1/22 8:49 AM]

Summary: [Resolved] Phoenix users attempting to submit jobs have received an error message that they are not authorized for their charge account.

Details: Beginning yesterday evening, Phoenix users attempting to submit jobs have at times received an error message indicating that they are not authorized for charge accounts to which they should have access. PACE deployed a temporary repair at 6:45 PM yesterday. The issue recurred at midnight, and the temporary repair was again made at 8:15 AM today. We have now identified the root cause as a deleted Perl library on the scheduler and deployed a permanent fix.

Impact: At this time, researchers are again able to submit jobs. Please resubmit any rejected jobs with the usual charge account. Researchers can run the “pace-quota” command to view their available charge accounts and balances. No running jobs were impacted.

We apologize for this disruption. Please contact us at pace-support@oit.gatech.edu with any questions.

 

 

PACE Data Retention Updates

PACE is nearing the conclusion of our user records and data audit, and we would like to share with you several updates to how PACE handles data stored on our storage systems and the role of schools in managing that data.

PACE takes responsibility for the storage of all data from our old systems in the Rich datacenter that has not been assigned to a current member of the faculty. All of this data has been inaccessible to all users since December 1, 2020. Any school that wishes to do so may request data of a former faculty member and thereby assume responsibility for the data and for the cost of continuing to store it on PACE (unless it is relocated), in accordance with the Data Retention Guidelines provided by the GT administration and shared with you at this time. If the school does not make such a request, PACE will cover the cost of storing this data until July 2024, then delete anything that has not been requested.

All data left on PACE by faculty who departed the Institute after July 1, 2021, will follow the Data Retention Guidelines. Under these guidelines, the faculty member’s former school will be responsible for the data and the cost of storing it on PACE, relocating it to another storage system, or determining it can be deleted while complying with any relevant regulations and contracts. The 1 TB provided to each school at no charge on Phoenix may be used to store these files. Schools also have the option of purchasing PACE Archive storage, which is designed for long-term retention of data that does not need to be used regularly on PACE compute clusters.

If you have any questions about PACE storage, please contact PACE at pace-support@oit.gatech.edu, and we’ll be happy to discuss it with you.

[Resolved] Phoenix scheduler outage

[Update 3/1/22 6:40 PM]

The Phoenix scheduler is restored after the PACE team rebooted the server hosting the Moab workload manager. Thank you for your patience while we performed this repair. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 3/1/22 5:20 PM]

Summary: Phoenix scheduler outage prevents new jobs from starting

What’s happening and what are we doing: The Phoenix scheduler is currently experiencing an outage as the server hosting the Moab workload manager is unavailable. The PACE team is actively investigating and working to bring the scheduler back in service.

How does this impact me: New jobs on Phoenix will not be able to start. While you can submit jobs, they will remain in the queue until service is restored. Running jobs should not be impacted and will continue to run. You may find that Moab commands, including “showq” and others, are not responsive.

What we will continue to do: PACE will continue working to restore functionality to the Phoenix scheduler. We will provide updates as they are available.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Battery replacement on Phoenix project & scratch storage will impact performance on Thursday, March 3

Summary: Battery replacement on Phoenix project & scratch storage will impact performance on Thursday, March 3.

What’s happening and what are we doing: Power supply units on the Phoenix Lustre storage device, holding project and scratch storage, need to be replaced. During the replacement, which will begin at approximately 10 AM on Thursday, March 3, storage will shift to write-through mode, and performance will be impacted. Once the UPS batteries in the new units are sufficiently charged, performance will return to normal.

How does this impact me: Phoenix project and scratch performance will be impacted until the fresh batteries have sufficiently charged, which should take several hours. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Phoenix storage throughout this procedure.

Thank you for your patience as we complete this replacement. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.