PACE clusters ready for research

Our November 2018 maintenance (http://blog.pace.gatech.edu/?p=6360) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Completed Tasks

Compute

  • Complete – (no user action needed) Replace power components in a rack in Rich 133
  • Complete(no user action needed) Replace defective PCIe connectors on multiple servers
      • As a precaution, additional identified nodes will have their PCIe connectors replaced  when parts are delivered.  There will be no user action needed.

Network

  • Complete(no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • Complete(no user action needed) Change uplink connections from management switches

Storage

  • Complete(no user action needed) Verify integrity of GPFS file systems
  • Complete(no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • Complete(no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • Complete (some user action needed) Replaced PACE ICE schedulers with a physical server, to increase capacity and reliability.   Some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually.