Services Restored: PACE Login

Summary: PACE services are restored after an issue with Georgia Tech’s domain name system (DNS) servers caused interruptions for users and jobs accessing PACE resources. The DNS service has been restored and OIT is working on identifying the root cause.

Details: OIT will continue to investigate the origin of this outage.

Impact: PACE users trying to login to Phoenix, Buzzard, Hive, Firebird and ICE clusters and/or Open OnDemand instances were prevented from logging in. Additionally, users and jobs trying to check out software licenses or access CEDAR storage might have been unable to do so. If you continue to experience issues logging into your PACE account or accessing CEDAR, please contact pace-support@oit.gatech.edu.

PACE Login Down


Summary
: Users are currently unable to complete login attempts to PACE clusters via command line or OnDemand web portals, receiving an “unauthorized” or “permission denied” error.

Details: The PACE team is investigating and believe there is an issue with authentication of logins from the central GT access management but do not yet have details. 

Impact: Attempts to access PACE resources may fail at this time.

Thank you for your patience as we continue investigating. Please visit https://status.gatech.edu for updates.

Degraded performance on Phoenix storage

Dear Phoenix users,

Summary: The project storage system on Phoenix (/storage/coda1) is slower than normal, due to heavy use and hard drive failures. The rebuild process to spare hard drives is ongoing; until it is complete, some users might experience slower file access on the project storage.

Details: Two hard drives that support the /storage/coda1 project storage failed on 1-July at 3:30am and 9:20am forcing a rebuild of the data to spare drives. This rebuild usually takes 24-30 hours to complete. We are closely monitoring the rebuilding process, which we expect to complete on July 2 around noon. In addition, we are temporarily moving file services from one metadata server to another and back to rebalance the load across all available systems.

Impact: Access to files is slower than usual during the drive rebuild and metadata server migration process. There is no data loss for any users. For the affected users, the degradation of performance can be observed on the login as well as compute nodes. The file system will continue to be operational while the rebuilds are running in the background. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working to solve the problem.

[Update] [storage] Phoenix Project storage degraded performance

[Updated March 31, 2025 at 414pm]

Dear Phoenix researchers,

As the Phoenix project storage system has stabilized, we have restored login access via ssh and resumed starting jobs.

The cost for the jobs running during the performance degradation will not count towards the March usage.

The Phoenix OnDemand portal can again be used to access project and scratch space. Any user still receiving a “Proxy Error” should contact pace-support@oit.gatech.edu for an individual reset of their OnDemand session.

Globus file transfers have resumed. We have determined that transfers to/from home, scratch, and CEDAR storage were inadvertently paused, and we apologize for any confusion. Any paused transfer should have automatically resumed.

The PACE team continues to monitor the storage system for any further issues. We are working with the vendor to identify the root cause and prevent future performance degradation.

Please contact us at pace-support@oit.gatech.edu with any questions. We appreciate your patience during this unexpected outage.

Best,

The PACE Team

[Updated March 31, 2025 at 12:41pm]

Dear Phoenix Users,

To limit the impact of the current Phoenix project filesystem issues, we have implemented the following changes to expedite troubleshooting and limit impact to currently running jobs:

New Logins to Phoenix Login Nodes are Paused

We have prevented new login attempts to the Phoenix login nodes. Users that are currently logged in will be able to stay logged onto the system.

Phoenix Jobs Prevented from Starting

Jobs that are in the queue but that have not yet started have been paused to prevent them from starting. These submitted jobs will remain in the queue.

Jobs that are currently running may experience decreased performance if using project storage. We are doing our best to prioritize the successful completion of these jobs.

Open OnDemand (OOD)

Users of Phoenix OOD can log in and interact with only their home directory. Project and scratch space are not available.

Some users of Open OnDemand may be unable to reach this service and are experiencing “Proxy Error” messages. We are investigating the root cause of this issue.

Globus File Transfer Paused for Project Space

File transfers to/from project storage on Globus have been paused. Other Globus transfers (Box, DropBox, and OneDrive cloud connectors; scratch; home; and CEDAR) will continue.

The PACE team is working to diagnose the current issues with support from our filesystem vendor. We will continue to share updates as we have them and apologize for this unexpected service outage.

Best,

The PACE Team