Summary: An outage of the metadata servers on Phoenix project storage (Lustre) is preventing access to that storage and may also prevent login by ssh, access to Phoenix OnDemand, and some Globus access on Phoenix. The PACE team is working to repair the system.
Details: During the afternoon of Saturday, July 19, one of the metadata servers for Phoenix Lustre project storage stopped responding. The failover to the other metadata server was not successful. The PACE team has not yet been able restore access and has engaged our storage vendor.
Impact: Files on the Phoenix Lustre project storage system are not accessible, and researchers may not be able to log in to Phoenix by ssh nor via the OnDemand web interface. Globus on Phoenix may time out, but researchers can type another path into the Path box to bypass the home directory and enter a subdirectory directly (e.g., typing ~/scratch
will allow access to the scratch storage). Research groups that have already migrated to VAST project storage may not be impacted. VAST project, scratch, and CEDAR storage may still be reachable this way.
Thank you for your patience as we work to restore access to Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions.
[UPDATE Mon 21 Jul, 11:00]
Phoenix project storage outage is over
The outage on the Lustre project storage is over; the scheduler has been released and is accepting jobs. The access through the head nodes, Globus, and Open OnDemand is restored.
The diagnostic check of the metadata volumes, performed over the weekend, completed successfully. As a precaution, we are running a thorough check to data volumes to verify there are no other issues. In an unlikely event of data loss, it will be restored from the backups. Scratch, home, VAST and CEDAR storage systems were not affected by the outage. The cost of the jobs that were terminated due to the outage will be refunded.
We are continuing to work with the storage vendors to prevent project storage outages. The ongoing migration of project storage from Lustre to VAST systems will reduce the impact when one of the shared file systems has issues.