Posts

[updated] Scratch Storage and Scheduler Concerns

Scheduler

The new server for the workload scheduler seems to have gone well.  We haven’t received much user feedback, but what we have received has been positive.  This matches with our own observations as well.  Presuming things continue to go well, we will relax some of our rate-limiting tuning paramaters on Thursday morning.  This shouldn’t cause any interruptions (even of submitting new jobs) but should allow the scheduler to start new jobs at a faster rate.  The net effect is to try and decrease wait times some users have been seeing.  We’ll slowly increase this parameter and monitor for bad behavior.

Scratch Storage

The story of the Panasas scratch storage does not go as well.  Last week, we received two “shelves” worth of storage to test.  (For comparison, we have five in production.)  Over the weekend, we put these through synthetic tests, designed to mimic the behavior that causes them to fail.  The good news is that we were able to replicate the problem in the testbed.  The bad news is that the highly anticipated new firmware provided by the vendor still does not fix the issues.  We continue to press Panasas quite aggressively for resolution and are looking into contingency plans – including alternate vendors.  Given that we are five weeks out from our normal maintenance day and have no viable fix, an emergency maintenance between now and then seems unlikely at this point.

RFI-2012, a competitive vendor selection process

Greetings GT community,

PACE is in the midst of our annual competitive vendor selection process. As outlined on the “Policy” page of our web site, we have issued a set of documents to various state contract vendors. This time around we have Dell, HP, IBM and Penguin Computing. Contained within these documents are general specifications based on the computing demand we are anticipating coming from the faculty over the next year. I’ve included a link to the documents (GT login required) below. Please bear in mind that these specs are not intended to limit configurations you may wish to purchase, but rather to normalize vendor responses and help us choose a vendor for the next year.

The document I’m sure you will be most interested in is a timeline. The overall timeline has not been published to the vendors, and I would appreciate if it was kept confidential. The first milestone, which obviously has been published, is that responses are due to us by 5:00pm today. The next step is for us to evaluate those responses. If any of you are interested in commenting on those responses, please let me know. Your feedback is appreciated.

Please watch this blog, as we will post updates as we move through the process.  We already have a number of people interested in a near-term purchase.  If you are as well, or you know somebody who is, now is the time to get the process started.  Please contact me at your convenience.

 

--
Neil Bright
Chief HPC Architect
neil.bright@oit.gatech.edu

FoRCE project server outage (pf2)

At about 4:30pm, one of the network interfaces for the server hosting the /nv/pf2 filesystem was knocked offline, causing the resources hosted by it to be unavailable. Normally, this shouldn’t have caused complete failure, but the loss of network exposed what was a configuration error in the fail-over components.

At 5:10, both the misconfiguration as well as the failed interface were brought back online, which should have brought all resources provided by this server online.

This affected some FoRCE users’ access to project storage. Please double check to see if jobs may have failed because of this outage. Data should not have been lost, as any transactions in progress should have been held up until connectivity was restored.

pace-stat

In answer to the requests made by many for insight on the status of your queues, we’ve developed a new tool for you called ‘pace-stat’ (/opt/pace/bin/pace-stat).

When you run pace-stat, a summary of all available queues will be displayed, and for each queue, values for:

– The number of jobs you have running, and the total number of running jobs
– The number of jobs you have queued, and the total number of queued jobs
– The total number of cores that all of your running jobs are using
– The total number of cores that all of your queued jobs are requesting
– The current number of unallocated cores free on the queue
– The approximate amount of memory/core that your running jobs are using
– The approximate amount of memory/core that your queued jobs are requesting
– The approximate amount of memory/core currently free in the queue
– The current percentage of the queue that has been allocated (by all running jobs)
– The total number of nodes in the queue
– The maximum wall-time for the queue

Please use pace-stat to help determine resource availability, and where best to submit jobs.

[updated] new server for job scheduler

As of about 3:00 this afternoon, we’re back up on the new server. Things look to performing much better. Please let us know if you have troubles. Also, positive reports on scheduler performance would be appreciated as well.

Thanks!

–Neil Bright

——————————————————————-

[update: 2:20pm, 8/30/12]

We’ve run in to a last minute issue with the scheduler migration.  Rather than rush things going into a long weekend, we will reschedule for next week, 2:30pm Tuesday afternoon.

——————————————————————-

We have made our preparations to move the job scheduler to new hardware, and plan to do so this Thursday (8/30) afternoon at 2:30pm.  We expect this to be a very low impact, low risk change.  All queued jobs should move to the new server and all executing jobs should continue to run without interruption.  What you may notice is some amount of time where you will be unable to submit new jobs and job queries will fail.  You’ll see the usual ‘timeout’ messages from commands like msub and showq.

As usual, please direct any concerns to pace-support@oit.gatech.edu.

–Neil Bright

Scratch Storage and Scheduler Concerns

The PACE team is urgently working on two ongoing critical issues with the clusters:

Scratch storage
We are aware of access, speed and reliability issues with the high-speed scratch storage system. We are currently working with the vendor to define and implement a solution to these issues. At this time, we are told there is a new version of the storage system firmware just released today that will likely resolve our issues. The PACE team is expecting the arrival of a test unit where we can verify the vendor’s solution. Once we have verified the vendor’s solution, we are considering an emergency maintenance for the entire cluster in order to implement the solution. We appreciate your feedback on this approach and especially the impact upon your research. We will let you know and work with you on scheduling when a known solution is available.

Scheduler
We are presently preparing a new system to host the scheduler software. We expect the more powerful system will alleviate many of the difficulties you are experiencing with the scheduler especially with delays in job scheduling and the time-outs when requesting information or scheduling jobs. Once we have a system ready, we will have to suspend the scheduler for a few minutes while we transition services to the new system. We do not anticipate the loss of any jobs currently running or any that are currently queued with this transition.

In both situations, we will provide you with notice well in advance of any potential interruption and work with you to provide the least impact to your research schedule.

– Paul Manno

[Resolved] Unexpected downtime on compute nodes

[update]   We think we’re back up at this point. If you see odd behavior, please send a support request directly to the PACE team via email to pace-support@oit.gatech.edu.

The issue seems to have been an inadvertent switching off of a circuit breaker by an electrician, and is not expected to recur.

====================

We’ve had a power problem in the data center this afternoon that caused a loss of power to three of our racks.  This has affected some (or all) portions of the following clusters:

Apurimac

Prometheus

Cygnus

Granulous

ECE

Monkeys

Isabella

CEE

Aryabhata

Optimus

Atlas

BioCluster

 

We’re looking into the cause of the problem, and have already started bringing up compute nodes.

[Resolved] Campus DNS Problems

Update:  We believe that the DNS issues have been resolved. We have checked that all affected servers are functioning as expected. The scheduler has been unpaused and is now scheduling jobs.

Thank you for your patience.

==================

At this time, the campus DNS server is experiencing problems.

The effect on PACE is that some storage servers and compute nodes cannot be accessed since their DNS names cannot be found. No currently running jobs should be affected. Any job currently executing has already succeeded in accessing all needed storage and compute nodes. The scheduler has been paused so that no new jobs can be started. We are working with the campus DNS administrators to resolve this as quickly as possible.

When the issue is resolved, the scheduler will be allowed to execute jobs.

We apologize for any problems this has caused you.

New Software: HDF5(1.8.9), OBSGRID (April 2, 2010), ABINIT(6.12.3), VMD(1.9.1), and NAMD(2.9)

Several new software packages have been installed on all RHEL6 clusters.

HDF5

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
A previous version of HDF5 (1.8.7) has existed on the RHEL6 clusters for many months.
The 1.8.9 version includes many bug fixes and some new utilities.

The hdf5/1.8.9 module is used differently than the 1.8.7 module.
The 1.8.9 module is able to detect whether an MPI module has been previously loaded and will support the proper serial or MPI version of the library.
The 1.8.7 module was not able to automatically detect MPI vs. non-MPI.

Here are two examples of how to use the new HDF5 module (note that all compilers and MPI installations are usable with HDF5):

$ module load hdf5/1.8.9

or

$ module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9

OBSGRID

OBSGRID is an objective re-analysis package for WRF designed to lower the error of analyses that are used to nudge the model toward the observed state.
The analyses input to OBSGRID as the first guess are analyses output from the METGRID part of the WPS package
Here is how to use obsgrid:

$ module load intel/12.1.4 hdf5/1.8.7/nompi netcdf/4.1.3 ncl/6.1.0-beta obsgrid/04022010
$ obsgrid.exe

ABINIT

ABINIT is a package whose main program allows one to find the total energy, charge density and electronic structure of systems made of electrons and nuclei (molecules and periodic solids) within Density Functional Theory (DFT), using pseudopotentials and a planewave or wavelet basis.
ABINIT 6.8.1 is already installed on the RHEL6 clusters.
There are many changes from 6.8.1 to 6.12.3. See the 6.12.3 release notes for more information.

Here are a few examples of how to use ABINIT in a job script:

#PBS ...
#PBS -l walltime=8:00:00
#PBS -l nodes=64:ib

cd $PBS_O_WORKDIR
module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9 netcdf/4.2 mkl/10.3 fftw/3.3 abinit/6.12.3
mpirun -rmk pbs abinit < abinit.input.file > abinit.output.file

VMD

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
VMD has been installed with support for the GCC compilers (versions 4.4.5, 4.6.2, and 4.7.0), NetCDF, Python+NumPy, TCL, and OpenGL.
Here is an example of how to use it:

  1. Login to a RHEL6 login node (joe-6, biocluster-6, atlas-6, etc.) with X-Forwarding enabled (X-Forwarding is critical for VMD to work).
  2. Load the needed modules:
    $ module load gcc/4.6.2 python/2.7.2 hdf5/1.8.7/nompi netcdf/4.1.3 vmd/1.9.1
  3. Execute “vmd” to start the GUI

NAMD

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.
Version 2.9 of NAMD has been installed with support for GNU and Intel compilers, MPI, FFTW3.
CUDA support in NAMD has been disabled.

Here is an example of how to use it in a job script in a RHEL6 queue (biocluster-6, atlas-6, ece, etc.):

#PBS -N NAMD-test
#PBS -l nodes=32
#PBS -l walltime=8:00:00
...
module load gcc/4.6.2 mvapich2/1.7 fftw/3.3 namd/2.9
cd $PBS_O_WORKDIR

mpirun -rmk pbs namd2 input.file

Call for Proposals for Allocations on the Blue Waters High Performance Computing System

FYI – for anybody interested in applying for time on the petaflop Cray being installed at NCSA.

Begin forwarded message:

From: “Gary Crane” <gcrane@sura.org>
To: ITCOMM@sura.org
Sent: Thursday, August 9, 2012 10:51:37 AM
Subject: Call for Proposals for Allocations on the Blue Waters High Performance Computing System
The Great Lakes Consortium for Petascale Computing (GLCPC) has issued a call for proposals for allocations on the Blue Water system. Principle investigators affiliated with a member of the Great Lakes Consortium for Petascale Computation are eligible to submit a GLCPC allocations proposal. SURA is a member of the GLCPC and PIs from SURA member schools are eligible to submit proposals. Proposals are due October 31, 2012.

The full CFP can be found here: http://www.greatlakesconsortium.org/bluewaters.html

–gary

Gary Crane
Director, SURA IT Initiatives
phone: 315-597-1459
fax: 315-597-1459
cell: 202-577-1272