Cygnus FS pc5 online…mostly.

We have been able to bring /nv/pc5 back online, but at a cost to redundancy. One of the network interfaces/cables/switches is not behaving, but when we tried disconnecting various combinations of cables, we found one that caused the filesystem to be immediately available to all nodes.

Considering how close maintenance day is (10/16/12), spending time isolating the cable/switch/interface problem now only means more time for this filesystem to be offline as equipment gets retested. Waiting until maintenance day will cause the least disruption for Cygnus pc5 users who have their last run of jobs and take some time pressure off of us to make sure we have resolved the issue in its entirety before bringing all resources back online.

Despite the loss of redundancy, functionality is NOT affected. Only in the case of an additional switch or cable failure between now on October 16 will functionality be impacted.

Cygnus File System pc5 offline

It appears that we have an issue with the server housing the /nv/pc5 filesystem, which contains a subset of the Cygnus cluster users. We’re trying to isolate the source of the problem, but we have yet to actually find a pattern to why it is available on some nodes and not on others.

Joe Cluster Status

Around 8, 8:30pm on September 28, 2012, a power event took down the TSRB data center, knocking a significant fraction of the Joe cluster offline.

With assistance from Operations, we are now bringing these nodes online after determining that several of the management switches for these nodes did not recover from the event gracefully. As these switches control our ability to manage the nodes, we had to wait until the switches were available to bring nodes online, now at about 4pm on September 29, 2012.

Jobs that were running on these nodes (iw-a2-* and iw-a3-*) at the time of the outage may have terminated abnormally. Jobs scheduled but not running should be fine.

UPDATE @ 4:40pm, 2012-09-29: All nodes are online.

New and Updated Software: GCC, Maxima, OpenCV, Boost, ncbi_blast

Software Installation and Updates

We have had several requests for new or updated software since the last post on August 14.
Here are the details about the updates.
All of this software is installed on RHEL6 clusters (including force-6, uranus-6, ece, math, apurimac, joe-6, etc.)

GCC 4.7.2

The GNU Compiler Collection (GCC) includes compilers for many languages (C, C++, Fortran, Java, and Go).
This latest version of GCC supports advanced optimizations for the latest compute nodes in PACE.

Here is how to use it:

$ module load gcc/4.7.2
$ gcc <source.c>
$ gfortran <source.f>
$ g++ <source.cpp>

Versions of GCC already installed on RHEL6 cluster are gcc/4.4.5, gcc/4.6.2, and gcc/4.7.0

Maxima 5.28.0

Maxima is a system for the manipulation of symbolic and numerical expressions, including differentiation, integration, Taylor series, Laplace transforms, ordinary differential equations, systems of linear equations, polynomials, and sets, lists, vectors, matrices, and tensors. Maxima yields high precision numeric results by using exact fractions, arbitrary precision integers, and variable precision floating point numbers. Maxima can plot functions and data in two and three dimensions.

Here is how to use it:

$ module load clisp/2.49.0 maxima/5.28.0
$ maxima
#If you have X-Forwarding turned on, "xmaxima" will display a GUI with a tutorial
$ xmaxima

OpenCV 2.4.2

OpenCV (Open Source Computer Vision) is a library of programming functions for real time computer vision.

OpenCV is released under a BSD license, it is free for both academic and commercial use. It has C++, C, Python and soon Java interfaces running on Windows, Linux, Android and Mac. The library has more than 2500 optimized algorithms.

This installation of OpenCV has been installed with support for Python and NumPy. It has been installed without support for Intel TBB, Intel IPP, or CUDA.

Here is how to use it:

$ module load gcc/4.4.5 opencv/2.4.2
$ g++ <source.cpp> $(pkg-config --libs opencv)

Boost

Boost provides free peer-reviewed portable C++ source libraries.
Boost libraries are intended to be widely useful, and usable across a broad spectrum of applications.

Here is how to use it:

$ module load boost/1.51.0
$ g++ <source.cpp>

NCBI BLAST

Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.

Here is how to use it:

$ module load gcc/4.4.5 ncbi_blast/2.2.27
$ blastn
$ blastp
$ blastx
...

Joe Fileserver fixed

The fileserver that houses Joe users’ data ( hp3 / pj1 ) started acting squirrelly this morning, finding itself unable to connect to the PACE LDAP server. That, in turn, caused Joe users to have problems logging in or having their jobs hang up because the fileserver could not authenticate users/jobs.

Restarting all the services on the fileserver rectified the problem.

Scratch storage issues: update

Scratch storage status update:

We continue to work with Panasas on the difficulties with our high-speed scratch storage system. Since the last update, we have received and installed two PAS-11 test shelves and have successfully reproduced our problems on them under the current production software version. We then updated to their latest release and re-tested only to observe a similar problem with this new release as well.

We’re continuing to do what we can to encourage the company to find a solution but are also exploring alternative technologies. We apologize for the inconvenience and will continue to update you with our progress.

[updated] Scratch Storage and Scheduler Concerns

Scheduler

The new server for the workload scheduler seems to have gone well.  We haven’t received much user feedback, but what we have received has been positive.  This matches with our own observations as well.  Presuming things continue to go well, we will relax some of our rate-limiting tuning paramaters on Thursday morning.  This shouldn’t cause any interruptions (even of submitting new jobs) but should allow the scheduler to start new jobs at a faster rate.  The net effect is to try and decrease wait times some users have been seeing.  We’ll slowly increase this parameter and monitor for bad behavior.

Scratch Storage

The story of the Panasas scratch storage does not go as well.  Last week, we received two “shelves” worth of storage to test.  (For comparison, we have five in production.)  Over the weekend, we put these through synthetic tests, designed to mimic the behavior that causes them to fail.  The good news is that we were able to replicate the problem in the testbed.  The bad news is that the highly anticipated new firmware provided by the vendor still does not fix the issues.  We continue to press Panasas quite aggressively for resolution and are looking into contingency plans – including alternate vendors.  Given that we are five weeks out from our normal maintenance day and have no viable fix, an emergency maintenance between now and then seems unlikely at this point.

FoRCE project server outage (pf2)

At about 4:30pm, one of the network interfaces for the server hosting the /nv/pf2 filesystem was knocked offline, causing the resources hosted by it to be unavailable. Normally, this shouldn’t have caused complete failure, but the loss of network exposed what was a configuration error in the fail-over components.

At 5:10, both the misconfiguration as well as the failed interface were brought back online, which should have brought all resources provided by this server online.

This affected some FoRCE users’ access to project storage. Please double check to see if jobs may have failed because of this outage. Data should not have been lost, as any transactions in progress should have been held up until connectivity was restored.