Updated: Network troubles, redux (FIXED)

We’ve got the switch back.  The outage looks to have caused our virtual machine farm to reboot, so connections to head nodes will have been dropped.

This also affected the network path between compute nodes and the file servers.  With a little luck, the NFS traffic should resume, but you may want to check on any running jobs to make sure.

Word from the network team is that they were following published instructions from the switch vendor to integrate the two switches when the failure occurred.  We’ll be looking into pretty intensely, as this these switches are seeing a lot of deployments in other OIT functions.

Network troubles, redux – 11/10 3:00pm

Hi folks,

In an attempt to restore network redundancy from the switch failure on 10/31, the Campus Network team has experienced some troubles connecting the new switch.  At this point, the core of our HPC network is non-functional.  Senior experts from the network team are working on restoring connectivity as soon as possible.

Full filesystems this morning

This morning, we found the hp8, hp10, hp12, hp14, hp16, hp18, hp20, hp22, hp24, and hp26 filesystems full.  All of these filesystems reside on the same fileserver and share capacity.  The root cause was a an oversight on our part – a lack of quota enforcement on a particular users home directory.  The proper 5GB home directory quotas have been reinstated and we are working with this user to move their data to their project directory.  We’ve managed to free up a little space at the moment, but it will take a little time to move a couple TB of data.  We’re also doing an audit to ensure that all appropriate storage quotas are in place.

 

This would have affected users on the following clusters:

  • Athena
  • BioCluster
  • Aryabhata
  • Atlantis
  • FoRCE
  • Optimius (not production yet)
  • ECE (not production yet)
  • Prometheus (not production yet)
  • Math (not production yet)
  • CEE (not production yet)

Updated: network troubles this morning (FIXED)

All head nodes and critical servers are back online (some required an emergency reboot).  The network link to PACE equipment in TSRB is restored as well.

We do not believe any jobs were lost.

All Inchworm clusters should be back to normal.

Please let us know via pace-support@oit.gatech.edu if you notice anything out of the ordinary at this point.

 

network troubles this morning – 0908

Looks like we have a problem with a network switch this morning.  Fortunately, our resiliency improvements have mitigated some of this, but not all as we haven’t yet extended those improvements down to the individual server level.  We’re working with the OIT network team to get things back functional as soon as possible.

UPDATED: Cygnus/Force: Second failure of new VM Storage (FIXED)

————————————————————

UPDATE: At 8:45pm EDT, Force resumed normal function. The normal computing environment is now restored.

————————————————————

UPDATE: At 7:35pm EDT, Cygnus resumed normal function. Force is still under repair.

————————————————————

5:30pm:

Well folks, I hate to do this to you again, but it looks like I need
to take cygnus and force down again thanks to problems with the storage.

Again, I’ll down Cygnus & Force at 7pm EDT. Please begin the process
of saving your work.

At this point, I’m moving these back to the old storage system, which,
while slow (and it did impact the responsiveness of these machines) at
least stayed running without issues. The new machine has not
previously shown issues in its prior use, so, I admit to being a bit
flummoxed as to what is going on.

This downtime will be longer as I need to scrub a few things clean,
make sure the VMs will be intact and usable.

I’ll let you know when things are back online. I don’t have good
estimates this time.

No scheduled compute jobs will be impacted.

I, and the rest of the PACE team apologize for the continued
interruption in service and we hope to rectify these issues in a
couple of hours from now.

Thanks for your patience.

bnm

Urgent: Cygnus & FoRCE head nodes reboot at 7pm due to Storage issues

Hey folks,

We suffered a temporary loss of connectivity to the backend storage
serving our VM farm earlier this afternoon. As such, several running
VMs moved their OS filesystems to a read-only state.

The filesystems on which your data is stored are fine, however.

Unfortunately, though, the head nodes for Cygnus and the FoRCE
clusters were affected, and judging by our previous experience with
this, we need to reboot these nodes soon. As such, we ask any
currently logged in users to please save their data now and logout.

We are scheduling a reboot of these systems at 7:00pm EDT. A few
minutes after that, the nodes should be available and fully functional.

No jobs have been, nor will be lost in this process.

We are sorry for the inconvenience, and plan to keep you up to date
with any further issues with these, as well as the rest of the machines.