Posts

Massive network outage requires head node restarts!

Earlier today, a campus wide network outage disrupted communications between the Head Node VMs and their storage. Some of these may be working, however, nothing that is done on them properly saved. We will be restarting these machines shortly, after which, everything will return to normal.

This should not cause already scheduled jobs to fail, but any scripts running on the head nodes will surely fail.

We will send an “all-clear” when we have completed the list.

VASP Calculation Errors

UPDATE: The VASP binaries that generate incorrect results have been DELETED.

One of the versions of VASP installed on all RHEL6 clusters can generate incorrect answers.
The DFT energies calculated are correct, but the forces may not be correct.

The affected vasp binaries are located here:
/usr/local/packages/vasp/5.2.12/mvapich2-1.6/intel-12.0.0.084/bin/vasp
/usr/local/packages/vasp/5.2.12/mvapich2-1.7/intel-12.0.0.084/bin/vasp
/usr/local/packages/vasp/5.2.12/openmpi-1.4.3/intel-12.0.0.084/bin/vasp
/usr/local/packages/vasp/5.2.12/openmpi-1.5.4/intel-12.0.0.084/bin/vasp

All affected binaries were compiled with the intel/12.0.0.084 compiler.

Solution:
Use a different vasp binary – versions compiled with the intel/10.1.018 and intel/11.1.059 compilers have been checked for correctness.
Neither of those compilers generate incorrect answers on the test cases that discovered the error.

Here is an excerpt from a job script that uses a correct vasp binary:

###########################################################

#PBS -q force-6
#PBS -l walltime=8:00:00

cd $PBS_O_WORKDIR

module load intel/11.1.059 mvapich2/1.6 vasp/5.2.12
which vasp
#This “which vasp” command should print this:
#/usr/local/packages/vasp/5.2.12/mvapich2-1.6/intel-11.1.059/bin/vasp
#If it prints anything other than this, the modules loaded are not as expected, and you are not using the correct vasp.

mpirun -rmk pbs vasp
##########################################################

We now have a test case with known correct results that will be checked every time a new vasp binary is installed.
This step will prevent this particular error from occurring again.
Unless there are strenuous objections, this version of vasp will be deleted from the module that loads it (today) and the binaries will be removed from /usr/local/packages/ (in one week).

Thank you Ambarish for reporting this issue.

Let us know if you have any questions, concerns, or comments.

Regarding the jobs failing around 2am

We received multiple reports of jobs getting killed around 2:00am. After further investigation, we have found the cause and made the corrections required to prevent this from happening again. Here’s a detailed explanation of what caused the job failures:

Each individual machine in PACE, including workstations in some cases, has an OS-software stack that is maintained by a single-sourced service called GTSWD. During maintenance periods we often use the GTSWD service to push-out new OS updates, firmware updates, and system service updates.

One of the updates we pushed out during the last maintenance window was a new panfs client, which is responsible for mounting the /scratch filesystem. The process used to update the panfs client came in two stages:

#1 initiate the client installation under the assumption that the node was free (this was a valid assumption at that time because this was being done during the maintenance window).

#2 replace the process that did the client installation with another update process that would first check to see if the change had already been made, but more importantly, would not make the assumption that the node was free.

On some of the compute nodes, #2 didn’t not get applied, and so every day at 2am since maintenance day, process #1 has been attempting to umount and remount /scratch, causing the failure of several user jobs. The reason #2 did not get applied, is because the GTSWD service-source is now completely overwhelmed by the number of PACE nodes that we now have trying to use it, and had not been able to automatically apply the updates to a small subset of our nodes.

In terms of this particular problem, we have manually updated all the RHEL5 nodes that were still using process #1; that will stop any more jobs that use RHEL5 queues from getting killed at 2am. In terms of the capacity problem, we’re going to be adding more capacity to the GTSWD service so that it can support all of the PACE nodes. On the RHEL6 nodes we’ve updated the execution instruction sets on our distribution system such that, when updated, they will not attempt to use process #1.

We are sorry for the time this has cost you, and also for not correcting the problem sooner. The failure rate of GTSWD has been historically very low, and thus is usually one of the very last things we look at when trying to determine the source of a problem.