Posts
[COMPLETE] PACE quarterly maintenance – (Aug 9-11, 2018)
update (Aug 10, 2018, 8:00pm): Our Aug 2018 maintenance is complete, one day ahead of schedule. All of the tasks are completed as planned. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
The next PACE maintenance will start on 8/9 (Thr) and may take up to 3 days to complete, as scheduled.
As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 8/9 and resubmit if this will give them enough time to complete successfully.
Planned Tasks
Headnodes
- (some user action needed) Most PACE headnodes (login nodes) are currently Virtual Machines (VM) with slow response time and sub-optimal storage performance, which are often the cause of slowness.
We are in progress of replacing these VMs with more capable physical servers. After the maintenance day, your login attempts to these VMs will be rejected with a message that tells you which hostname should you be using instead. In addition, we are in the progress of sending each user a customized email with a list of old and new login nodes. Please don’t forget to configure your SSH clients to use these new hostnames.
Simply, “login-s.pace.gatech.edu” will be used for all shared clusters and “login-d.pace.gatech.edu” will be for dedicated clusters. You’ll notice that once you login, you’ll be redirected to one of the several physical nodes automatically (e.g. login-s1, login-d2, …) depending on their current load.
There will be no changes to clusters which already come with a dedicated (and physical) login node (e.g. gryphon, asdl, ligo, etc)
- (some user action needed) As some of the users have already noticed, users can no longer edit cronjobs (e.g. crontab -e) on the headnodes. This is on purpose because the access to new login nodes (login-d and login-s) are dynamically routed to different servers depending on their load. This means, you may not be able to see the cron jobs you have installed the next time you login to one of these nodes. For this reason, only PACE admins can install the cronjobs on behalf of users to ensure consistency (only login-d1 and login-s1 will be used for crons jobs). If you need to add (or edit) cronjobs, please contact pace-support@oit.gatech.edu. If you already have user cron jobs setup on one of the decommissioned VMs, they will be moved over to login-d1 or login-s1 during the maintenance so they’ll continue to run.
Storage
- (no user action needed) Add a dedicated protocol node to the GPFS system to increase capacity and response time for non-InfiniBand connected systems. This system will gradually replace the IB gateway systems that are currently in operation.
- (no user action needed) Replace batteries to DDN/GPFS storage controllers
Network
- (no user action needed) Upgrades to the DNS appliances in both PACE datacenters
- (no user action needed) Add redundant storage links to specific clusters
Other
- (no user action needed) Perform network upgrades
- (no user action needed) Replace devices that are out of support
[Resolved] Shared scheduler problems
The PACE Scratch storage just got faster!
[Resolved] Datacenter cooling problem with potential impact on PACE systems
Update (06/29/2018, 3:30 pm): We’re happy to report that the issues with cooling systems are largely addressed without any visible impact on systems and/or running jobs. The schedulers are resumed, allocating new jobs as they are submitted. There is more work to be done to resolve the issue fully, but it can be performed without any disruption to services. You may continue to use PACE systems as usual. If you notice any problems, please contact pace-support@oit.gatech.edu
For a related status update from OIT, please see: https://status.gatech.edu/incidents/0ykh9wwnw50j
Original post:
The operations team notified PACE of cooling problems that started around noon today, impacting the datacenter housing the storage and virtual machine infrastructure. We immediately started monitoring the temperatures and turning off some non-critical systems as a precautionary step, and paused schedulers to prevent new jobs from running. Submitted jobs will be held until the problem is sufficiently addressed.
Depending on the course of this issue, there is a possibility that we may need to power down critical systems such as storage and Virtual Headnodes, but all critical systems are currently online for now.
We will continue to provide updates as we have them here on this blog and pace-available email list as needed.
Thank you!
Possible Water Service May Impact PACE Clusters
Impact on PACE Clusters:
—————————————–
Original communication from Georgia Tech Office of Emergency Management:
To the campus community:
Out of an abundance of caution, Georgia Tech Emergency Management and Communications has taken steps to prepare the campus for the possibility of a water outage tonight in light of needed repairs to the City of Atlanta’s water lines.
The City of Atlanta’s Department of Watershed will repair a major water line beginning tonight between 11 p.m. and midnight. The repair is scheduled to be completed this week and should not negatively impact campus. If all goes according to plan, the campus will operate as usual.
In the event the repairs cause a significant loss of water pressure or loss of water service completely, the campus will be closed and personnel will be notified through the Georgia Tech Emergency Notifications System (GTENS).
If GTENS alerts are sent, essential personnel who are pre-identified by department leadership should report even if campus is closed. If the campus loses water, all non-essential activities will be canceled on campus.
Those with specialized research areas need to make arrangements tonight in the event there is a water failure. All lab work and experiments that can be delayed should be planned for later in the week or next week.
In the event of an outage, employees are asked to work with department leadership to work remotely. Employees who can work remotely should prepare before leaving work June 4 to work remotely for several days. Toilets won’t be operational, drinking water will not be available, and air conditioning will not be functioning in buildings on campus and throughout the city.
All who are housed on campus should fill bathtubs and other containers to have water on hand to manually flush toilets should there be a loss in pressure. Plans are underway to relocate campus residents to nearby campuses such as Emory University or Kennesaw State University in the event of a complete loss of water to the campus.
Parking and Transportation Services will continue on-campus transportation as long as the campus is open.
In the event of an outage, additional instructions and information on campus operations will be shared at gatech.edu.
Major Outage of GT network on Sunday, May 27
OIT Operations team informed us about a service outage on Sunday (5/27, 8am). Their detailed note is copied below.
This outage should not impact running jobs, however you will not be able to login to headnodes and VPN for the duration of this outage.
If you have ongoing data transfers (using SFTP, scp, rsync), they *will* be terminated. We strongly recommend waiting until successful completion of this work before starting any large data transfers. Similarly, your active SSH connections will be interrupted, please save your work and exit all sessions as you can.
PACE team will be in contact with the Operations team and provide status updates in this blog post as needed: http://blog.pace.gatech.edu/?p=6259
More details:
Storage (GPFS) slowness impacting pace1 and menon1 systems
update (5/18/2018, 4:15pm): We’ve identified a large number of jobs overloading the storage and worked with their owners to delete them. This resulted in an immediate improvement in performance. Please let us know if you observe any of the slowness comes back over the weekend.
original post: PACE is aware of GPFS (storage) slowness that impacts a large fraction of users from the pace1 and menon1 systems. We are actively working, with guidance from the vendor, to identify the root cause and resolve this issue ASAP.
This slowness is observed from all nodes mounting this storage, including headnodes, compute nodes and the datamover.
We believe that we’ve found the culprit, but more investigation is needed for verification. Please continue to report any slowness problems to us.
PACE clusters ready for research
Our May 2018 maintenance (http://blog.pace.gatech.edu/?p=6158) is complete ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
Our next maintenance period is scheduled for Thursday, Aug 9 through Saturday, Aug 11, 2018.
Schedulers
Job-specific temporary directories (may require user action): Complete as planned. Please see the maintenance day announcement (http://blog.pace.gatech.edu/?p=6158) to see how this impacts your jobs.
ICE (instructional cluster) scheduler migration to a different server (may require user action): Complete as planned. Users should not notice any differences.
Systems Maintenance
ASDL cluster (requires no user action): Complete as planned. Bad CMOS batteries are replaced and the fileserver has a replacement CPU. Memory problems were related to bad CPU, which are resolved without changing any Memory DIMMs.
Replace PDUs on Rich133 H37 Rack (requires no user action): Deferred per the request of cluster owner.
LIGO cluster rack replacement (requires no user action): Complete as planned.
Storage
GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): Complete as planned, and tested. Please report any missing storage mounts to pace-support.
Run routine system checks on GPFS filesystems (requires no user action): Complete as planned, no problems found!
Network
The IB network card firmware upgrades (requires no user action): Complete as planned.
Enable 10GbE on physical headnodes (requires no user action): Complete as planned.
Several improvements on networking infrastructure (requires no user action): Complete as planned.
[Resolved] Large Scale Storage Problems
Current Status (5/3 4:30pm): Storage problems are resolved, all compute nodes are back online, accepting jobs. Please resubmit crashed jobs and contact pace-support@oit.gatech.edu if there is anything we can assist with.
update (5/3 4:15pm): We found that the storage failure was caused by a series of tasks we have been performing with guidance from the vendor, in preparation for the maintenance day. These steps were considered safe and no failures were expected. We are still investigating to find more about which step(s) lead to this cascading failure.
update (5/3 4:00pm): All of the compute nodes will appear offline and will not accept jobs until this issue is resolved.
Original Message:
We received reports of the main PACE storage (GPFS) failures around 3:30pm today (5/3, Thr), impacting jobs. We found that this issue applies to all GPFS systems (pace1, pace2, menon1), with a large scale impact PACE-wide.
We are actively working with the vendor to resolve this issue urgently and will continue to update this post as we find more about the root cause.
We are sorry for this inconvenience and thank you for your patience.