Posts – Page 33 – Partnership for an Advanced Computing Environment

Storage (GPFS) Issue Update

We are seeing a reduction in the GPFS filesystem problems over the past weekend, and are continuing to actively work with the vendor. We don’t have a complete solution yet, but have observed greater stability for compute nodes in the GPFS filesystem. Thank you for your patience – we will continue to keep you updated as much as possible as the situation changes.

Storage (GPFS) Issue Update

While the problem wasn’t very widespread and we have improved the reliability, we have not yet arrived at a full solution and are still actively working on the problem. We now believe the problem is due to the recent addition of many compute nodes, ultimately bringing us into the next tier of system-level tuning needed for the filesystem. Thank you for your patience – we will continue to provide updates as they become available.

Storage (GPFS) Issue

We are experiencing intermittent problems with the GPFS storage system that hosts scratch and project directories (~/scratch, and ~/data). At the moment, we are exploring this failure with the vendor if this may be related to the recent cluster nodes that have been brought online.

This issue has potential impact on running jobs. We are actively working on the problem, apologize for the inconvenience, and will update as soon as possible.

Storage (GPFS) and datacenter problems resolved

All nodes and GPFS filesystem issues affected from the power failure should be resolved as of late Friday evening (June 16) . If you are still experiencing problems, please let us know at pace-support@oit.gatech.edu.

PACE is experiencing storage (GPFS) problems

We are experiencing intermittent problems with the GPFS storage system that hosts most of the project directories.

We are working with the vendor to investigate the ongoing issues. At this moment we don’t know whether they are related to yesterday’s power/cooling failures or not, but we will update the PACE community as we find out more.

This issue has potential impact on running jobs and we are sorry for this inconvenience.

PACE datacenter experienced a power/cooling failure

What happened: We had a brief power failure in our datacenter, which took out cooling in racks running chilled water. This impacted about 160 nodes from various queues, with potential impact on running jobs.

Current Situation: Some cooling has been restored, however we had to issue a shut down to a couple of the highest temperature racks that were not cooling down (p41, k30, h43, c29, c42). We are keeping a close eye on the remaining racks that were in the risk area in coordination with the Operations team as they continue to monitor temperatures in these racks.

We will start bringing the down nodes online once the cooling issue is fully resolved.

What can you do: Please resubmit failed jobs (if any) if you were using any of the queues listed below. As always, contact pace-support@oit.gatech.edu for any kind of assistance you may need.

Thank you for your patience and sorry for the inconvenience.

Impacted Queues:

—————————

apurimac-6

apurimacforce-6

atlas-6

atlas-debug

b5force-6

biobot

biobotforce-6

bioforce-6

breakfix

cee

ceeforce

chemprot

chowforce-6

cnsforce-6

critcel

critcel-burnup

critcelforce-6

critcel-prv

cygnus

cygnus-6

cygnus64-6

cygnusforce-6

cygnus-hp

davenprtforce-6

dimerforce-6

ece

eceforce-6

enveomics-6

faceoff

faceoffforce-6

force-6

ggate-6

granulous

gryphon

gryphon-debug

gryphon-prio

gryphon-tmp

hygeneforce-6

isabella-prv

isblforce-6

iw-shared-6

martini

mathforce-6

mayorlab_force-6

mday-test

medprint-6

medprintfrc-6

megatron

megatronforce-6

microcluster

micro-largedata

monkeys_gpu

mps

njordforce-6

optimusforce-6

prometforce-6

prometheus

radiance

rombergforce

semap-6

skadi

sonarforce-6

spartacusfrc-6

threshold-6

try-6

uranus-6

Large Scale Problem

Update (6/7/2017, 1:20pm): The network issues are now addressed and systems are back in normal operation.Please check your jobs and resubmit failed jobs as needed. If you continue to experience any problems, or need our assistance for anything else, please contact us at pace-support@oit.gatech.edu. We are sorry for this inconvenience and thank you once again for your patience.

Original message: We are experiencing a large scale network problem impacting multiple storage servers and software repository with a potential impact on running jobs. We are currently actively working to get it resolved and will provide updates as much as possible. We appreciate your patience and understanding, and are committed to resolving the issue as soon as we possibly can.

Infiniband switch failure causing partial network and storage unavailability

We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar

PACE clusters ready for research

Our May 2017 maintenance period is now complete, far ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, August 10 through Saturday, August 12, 2017.

New operating system kernel

All compute, interactive, and head nodes have received the updated kernel. No user action needed.

DDN firmware updates

This update brought low level firmware on drives up to date per recommendation from DDN. No user action needed.

Networking

DNS/DHCP and firewall updates per vendor recommendation applied by OIT Network Engineering.
IP address reassignments for some clusters completed. No user action needed.

Electrical

Power distribution repairs completed by OIT Operations. No user action needed.

PACE quarterly maintenance – May 11, 2017

PACE clusters and systems will be taken offline at 6am this Thursday (May 11) through the the end of Saturday (May 13). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Systems

We will deploy a recompiled kernel that’s identical to the current version except for a patch that addresses the dirty cow vulnerability. Currently, we have mitigation in place that prevents the use of debuggers and profilers (e.g. gdb, strace, Allinea DDT, etc). After the deployment of the patched kernel, these functions will once again be available for all nodes. Please let us know if you continue to have problems debugging or profiling your codes after the maintenance day.

Storage

Firmware updates on all of the DDN GPFS storage (scratch and most of the project storage)

Network

Upgrades to DNS servers, as recommended and performed by OIT Network Engineering
Software upgrades to the PACE firewall appliance to address a known bug
New subnets and re-assignment of IP addresses for some of the clusters

Power

PDU fixes that are impacting 3 nodes in c29 rack

The date for the next maintenance day is not certain yet, but we will announce it as soon as we have it.

Partnership for an Advanced Computing Environment

Recent Posts

Storage (GPFS) Issue Update

Storage (GPFS) Issue Update

Storage (GPFS) Issue

Storage (GPFS) and datacenter problems resolved

PACE is experiencing storage (GPFS) problems

PACE datacenter experienced a power/cooling failure

Large Scale Problem

Infiniband switch failure causing partial network and storage unavailability

PACE clusters ready for research

New operating system kernel

DDN firmware updates

Networking

Electrical

PACE quarterly maintenance – May 11, 2017

Georgia Institute of Technology