Beginning tomorrow morning (9/18/19), there will be urgent maintenance on the cooling system in the Rich data center, which houses all PACE clusters except for Hive. A temporary cooling unit has been installed, but should the secondary cooling unit fail, the room will begin to heat.
Please follow updates on this maintenance and find a full list of affected services across campus at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5d7fd6219012a0316b71ef83.
– PACE team
Author: mweiner3
COMSOL use at PACE
As you may know, the College of Engineering is changing the licensing model for COMSOL on September 16, 2019, and will now restrict access for research use to named users who have purchased access through CoE. Use of COMSOL for research on PACE is licensed through CoE (regardless of your college affiliation). If you or your PI have not yet made arrangements with CoE, please contact Angelica Remolina in CoE IT (angie.remolina@coe.gatech.edu). You will not be able to run COMSOL on PACE without permission from CoE after September 16.
[Resolved] Campus Network Down
[Update] September 5
OIT reports that the campus network is again fully functional.
[Update] September 4 4:28 PM
This is brief update, OIT Network Services has identified the cause of the campus network issues. One of the enterprise routers for campus rebooted unexpectedly that impacted our campus network. Since this event, the network has been stabilized. OIT continues to monitor this situation for any further issues. For latest update, please check on OIT status page.
As for PACE cluster(s), you should be able to access the cluster(s) without issues. If you continue to experience an issue, please try to disconnect and reconnect to restore your connectivity.
As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu
[Original] September 4 2:30 PM
Our campus network is down. OIT is investigating this incident, and you may check on the details from the link below:
https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d6ff4f4daca6a0543918df2
This incident will prevent you from accessing the PACE resources, but your current jobs running on PACE should not be interrupted.
Please check the status link above for up to date details. If you have any questions, please send us a note to pace-support@oit.gatech.edu. Also note, we are impacted by the outage and our responses to your email will be delayed.
Thank you for your patience.
[Resolved] GPFS outage on Red Hat 7 queues
An issue occurred around 3:30 AM on several queues running on the Red Hat 7 operating system, where a number of nodes failed to mount GPFS, our project (data) and scratch storage system. This caused the nodes to be offlined and unavailable for jobs. We repaired the affected nodes at approximately 9:30 AM today, and all queues should be functioning normally. Any jobs that were held should have begun. Please check your overnight jobs for errors.
The following queues were impacted:
atlas-he
ece-gpu
flamel-gpu
gaanam-gpu
gemini-cpu
gemini-gpu
megatron
ml_gpu
sake
skylake-test
starscream
swarm
swarm-gpu
Should you notice the problem recur, or if you have any other concerns, please contact us at pace-support@oit.gatech.edu, and we will be happy to help you. We apologize for the inconvenience this morning.
[Resolved] Campus-wide network outage impacting PACE
A campus-wide DNS server failure occurred on the morning of Monday, August 5. OIT was able to resolve the issue at 10:06 AM, and all PACE services should now be working normally. The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.
We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu.
For details on the DNS failure, please visit the OIT status update.
Thank you for your attention to this, and we apologize for the inconvenience.