[Update – 05/09/2019] Our final quarterly maintenance schedule will include the following list of tasks:
Compute
- (no user action needed) Replace CMOS batteries on multiple servers
- (no user action needed) Upgrade testflight cluster to RHEL 7.6
- (some user action needed) Upgrade gemini-gpu and gemini-cpu clusters to RHEL7, which will require user action (only for gemini-cpu/gpu clusters‘ users)
- (no user action needed) Switch nodes between chemx and gemini-cpu queues
Network
- (no user action needed) Replace a faulty InfiniBand switch, which affects a single rack with no impact to the complete fabric
- (no user action needed) Migrate Rich to campus connections to 10Gbps
Storage
- (no user action needed) Reboot ICE storage servers to correct issues with backup application
- (no user action needed) Perform detailed performance analysis of the GPFS environment, in order to fine tune parameters to improve performance
Other
- (no user action needed) Updates to the submit filters in the schedulers
- (no user action needed) Update salt master and minions
[Original Post – May 7, 2019 – 12:32pm] We are preparing for a maintenance day on May 16, 2019. This maintenance day is planned for three days and will start on Thursday May 16 and go through Saturday, May 18.
As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.
In general, we will perform maintenance on PACE Network and migrate from 10Gbps to 40Gbps connections, GPFS storage performance analysis, upgrade schedulers, replace CMOS batteries, upgrade testflight cluster to the latest RHEL 7 kernel, 3.10.0-957.12.1, i.e., RHEL 7.6.
While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.