pace-stat

In answer to the requests made by many for insight on the status of your queues, we’ve developed a new tool for you called ‘pace-stat’ (/opt/pace/bin/pace-stat).

When you run pace-stat, a summary of all available queues will be displayed, and for each queue, values for:

– The number of jobs you have running, and the total number of running jobs
– The number of jobs you have queued, and the total number of queued jobs
– The total number of cores that all of your running jobs are using
– The total number of cores that all of your queued jobs are requesting
– The current number of unallocated cores free on the queue
– The approximate amount of memory/core that your running jobs are using
– The approximate amount of memory/core that your queued jobs are requesting
– The approximate amount of memory/core currently free in the queue
– The current percentage of the queue that has been allocated (by all running jobs)
– The total number of nodes in the queue
– The maximum wall-time for the queue

Please use pace-stat to help determine resource availability, and where best to submit jobs.

[updated] new server for job scheduler

As of about 3:00 this afternoon, we’re back up on the new server. Things look to performing much better. Please let us know if you have troubles. Also, positive reports on scheduler performance would be appreciated as well.

Thanks!

–Neil Bright

——————————————————————-

[update: 2:20pm, 8/30/12]

We’ve run in to a last minute issue with the scheduler migration.  Rather than rush things going into a long weekend, we will reschedule for next week, 2:30pm Tuesday afternoon.

——————————————————————-

We have made our preparations to move the job scheduler to new hardware, and plan to do so this Thursday (8/30) afternoon at 2:30pm.  We expect this to be a very low impact, low risk change.  All queued jobs should move to the new server and all executing jobs should continue to run without interruption.  What you may notice is some amount of time where you will be unable to submit new jobs and job queries will fail.  You’ll see the usual ‘timeout’ messages from commands like msub and showq.

As usual, please direct any concerns to pace-support@oit.gatech.edu.

–Neil Bright

Scratch Storage and Scheduler Concerns

The PACE team is urgently working on two ongoing critical issues with the clusters:

Scratch storage
We are aware of access, speed and reliability issues with the high-speed scratch storage system. We are currently working with the vendor to define and implement a solution to these issues. At this time, we are told there is a new version of the storage system firmware just released today that will likely resolve our issues. The PACE team is expecting the arrival of a test unit where we can verify the vendor’s solution. Once we have verified the vendor’s solution, we are considering an emergency maintenance for the entire cluster in order to implement the solution. We appreciate your feedback on this approach and especially the impact upon your research. We will let you know and work with you on scheduling when a known solution is available.

Scheduler
We are presently preparing a new system to host the scheduler software. We expect the more powerful system will alleviate many of the difficulties you are experiencing with the scheduler especially with delays in job scheduling and the time-outs when requesting information or scheduling jobs. Once we have a system ready, we will have to suspend the scheduler for a few minutes while we transition services to the new system. We do not anticipate the loss of any jobs currently running or any that are currently queued with this transition.

In both situations, we will provide you with notice well in advance of any potential interruption and work with you to provide the least impact to your research schedule.

– Paul Manno

[Resolved] Unexpected downtime on compute nodes

[update]   We think we’re back up at this point. If you see odd behavior, please send a support request directly to the PACE team via email to pace-support@oit.gatech.edu.

The issue seems to have been an inadvertent switching off of a circuit breaker by an electrician, and is not expected to recur.

====================

We’ve had a power problem in the data center this afternoon that caused a loss of power to three of our racks.  This has affected some (or all) portions of the following clusters:

Apurimac

Prometheus

Cygnus

Granulous

ECE

Monkeys

Isabella

CEE

Aryabhata

Optimus

Atlas

BioCluster

 

We’re looking into the cause of the problem, and have already started bringing up compute nodes.

[Resolved] Campus DNS Problems

Update:  We believe that the DNS issues have been resolved. We have checked that all affected servers are functioning as expected. The scheduler has been unpaused and is now scheduling jobs.

Thank you for your patience.

==================

At this time, the campus DNS server is experiencing problems.

The effect on PACE is that some storage servers and compute nodes cannot be accessed since their DNS names cannot be found. No currently running jobs should be affected. Any job currently executing has already succeeded in accessing all needed storage and compute nodes. The scheduler has been paused so that no new jobs can be started. We are working with the campus DNS administrators to resolve this as quickly as possible.

When the issue is resolved, the scheduler will be allowed to execute jobs.

We apologize for any problems this has caused you.

New Software: HDF5(1.8.9), OBSGRID (April 2, 2010), ABINIT(6.12.3), VMD(1.9.1), and NAMD(2.9)

Several new software packages have been installed on all RHEL6 clusters.

HDF5

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
A previous version of HDF5 (1.8.7) has existed on the RHEL6 clusters for many months.
The 1.8.9 version includes many bug fixes and some new utilities.

The hdf5/1.8.9 module is used differently than the 1.8.7 module.
The 1.8.9 module is able to detect whether an MPI module has been previously loaded and will support the proper serial or MPI version of the library.
The 1.8.7 module was not able to automatically detect MPI vs. non-MPI.

Here are two examples of how to use the new HDF5 module (note that all compilers and MPI installations are usable with HDF5):

$ module load hdf5/1.8.9

or

$ module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9

OBSGRID

OBSGRID is an objective re-analysis package for WRF designed to lower the error of analyses that are used to nudge the model toward the observed state.
The analyses input to OBSGRID as the first guess are analyses output from the METGRID part of the WPS package
Here is how to use obsgrid:

$ module load intel/12.1.4 hdf5/1.8.7/nompi netcdf/4.1.3 ncl/6.1.0-beta obsgrid/04022010
$ obsgrid.exe

ABINIT

ABINIT is a package whose main program allows one to find the total energy, charge density and electronic structure of systems made of electrons and nuclei (molecules and periodic solids) within Density Functional Theory (DFT), using pseudopotentials and a planewave or wavelet basis.
ABINIT 6.8.1 is already installed on the RHEL6 clusters.
There are many changes from 6.8.1 to 6.12.3. See the 6.12.3 release notes for more information.

Here are a few examples of how to use ABINIT in a job script:

#PBS ...
#PBS -l walltime=8:00:00
#PBS -l nodes=64:ib

cd $PBS_O_WORKDIR
module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9 netcdf/4.2 mkl/10.3 fftw/3.3 abinit/6.12.3
mpirun -rmk pbs abinit < abinit.input.file > abinit.output.file

VMD

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
VMD has been installed with support for the GCC compilers (versions 4.4.5, 4.6.2, and 4.7.0), NetCDF, Python+NumPy, TCL, and OpenGL.
Here is an example of how to use it:

  1. Login to a RHEL6 login node (joe-6, biocluster-6, atlas-6, etc.) with X-Forwarding enabled (X-Forwarding is critical for VMD to work).
  2. Load the needed modules:
    $ module load gcc/4.6.2 python/2.7.2 hdf5/1.8.7/nompi netcdf/4.1.3 vmd/1.9.1
  3. Execute “vmd” to start the GUI

NAMD

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.
Version 2.9 of NAMD has been installed with support for GNU and Intel compilers, MPI, FFTW3.
CUDA support in NAMD has been disabled.

Here is an example of how to use it in a job script in a RHEL6 queue (biocluster-6, atlas-6, ece, etc.):

#PBS -N NAMD-test
#PBS -l nodes=32
#PBS -l walltime=8:00:00
...
module load gcc/4.6.2 mvapich2/1.7 fftw/3.3 namd/2.9
cd $PBS_O_WORKDIR

mpirun -rmk pbs namd2 input.file

maintenance day complete, ready for jobs

We are done with maintenance day – however some automated nightly processes still need to run before jobs can flow again.  So, I’ve set an automated timer to release jobs at 4:30am today.  That’s a little over two hours from now.  The scheduler will accept new jobs now, but will not start executing until 4:30am.

 

With the exception of the following two items, all of the tasks listed at our previous blog post have been accomplished.

  • * firmware updates on the scratch servers were deferred per the strong recommendation of the vendor
  • * an experimental software component of the scratch system was not tested due to the lack of test plan from the vendor.

 

SSH host keys have changed on the following head nodes.  Please accept the new keys into your preferred SSH client.

  • atlas-6
  • atlas-post5
  • atlas-post6
  • atlas-post7
  • atlas-post8
  • atlas-post9
  • atlas-post10
  • apurimac
  • biocluster-6
  • cee
  • critcel
  • cygnus-6
  • complexity
  • cns
  • ece
  • granulous
  • optimus
  • math
  • prometheus
  • uranus-6

10TB soft quota per user on scratch storage

One of the many benefits of using PACE clusters is the scratch storage, which provides a fast filesystem for I/O-bound jobs. The scratch server is designed to offer high speeds but not so much storage capacity. So far, a weekly script that deletes all files older than 60 days had allowed us sustain this service without the need for disk quotas. However this situation started changing as the PACE clusters had grown to a whopping ~750 active users, with the addition of ~300 users only since Feb 2011. Consequently, it became common for the scratch utilization to reach 98%-100% on several volumes, which is alarming for the health of the entire system.

We are planning to address this issue with a 2-step transition plan for enabling file quotas. The first step will be applying 10TB “soft” quotas for all users for the next 3 months. A soft quota means that you will receive warning emails from the system if you exceed 10TB, but your writes will NOT be blocked. This will help you adjust your data usage and get prepared for the second step, which is the 10TB “hard” quotas that will block writes when the quota is exceeded.

Considering that the total scratch capacity is 260TB, a 10TB quota for 750 users is a very generous limit. Looking at some current statistics, the number of users using more than this capacity does not exceed 10. If you are one of these users (you can check using the command ‘du -hs ~/scratch’) and have concerns that the 10TB quota will adversely impact your research, please contact us (pace-support@oit.gatech.edu).

REMINDER – upcoming maintenance day, 7/17

The  major activity for maintenance day is the RedHat 6.1 to RedHat 6.2 software update.  (Please test your codes!!)   This will affect a significant amount of our user base.  We’re also instituting soft quotas on the scratch space.  Please see the detail below.

The following are running RedHat 5, and are NOT affected:

  • Athena
  • Atlantis

The following have already been upgraded to the new RedHat 6.2 stack.  We would appreciate reports on any problems you may have:

  • Monkeys
  • MPS
  • Isabella
  • Joe-6
  • Aryabhata-6

If I didn’t mention your cluster above, you are affected by this software update.  Please test using the ‘testflight’ queue.  Jobs are limited to 48 hours in this queue.  If you would like to recompile your software with the 6.2 stack, please login to the ‘testflight-6.pace.gatech.edu’ head node.

Other activities we have planed are:

Relocating some project directory servers to an alternate data center on campus.  We have strong network connectivity, so this should not change performance of these filesystems.  No user modifications needed.

  • /nv/hp3 – Joe
  • /nv/pb1 – BioCluster
  • /nv/pb3 – Apurimac
  • /nv/pc1 – Cygnus
  • /nv/pc2 – Cygnus
  • /nv/pc3 – Cygnus
  • /nv/pec1 – ECE
  • /nv/pj1 – Joe
  • /nv/pma1 – Math
  • /nv/pme1 – Prometheus
  • /nv/pme2 – Prometheus
  • /nv/pme3 – Prometheus
  • /nv/pme4 – Prometheus
  • /nv/pme5 – Prometheus
  • /nv/pme6 – Prometheus
  • /nv/pme7 – Prometheus
  • /nv/pme8 – Prometheus
  • /nv/ps1 – Critcel
  • /nv/pz1 – Athena

Activities on the scratch space – no user change is expected for any of this.

  • We need to balance some users on volumes v3, v4, v13 and v14.  This will involve moving users from one volume to another, but we will place links in the old locations.
  • Run a filesystem consistency check on the v14 volume.  This has the potential to take a significant amount of time.  Please watch the pace-availability email list (or this blog) for updates if this will take longer than expected.
  • firmware updates on the scratch servers to resolve some crash & failover events that we’ve been seeing.
  • institute soft quotas.  Users exceeding 10TB of usage on the scratch space will receive automated warning emails, but writes will be allowed to proceed.  Currently, this will affect 6 of 750+ users.  The 10TB space represents about 5% of a rather expensive shared 215TB resource, so please be cognizant of the impact to other users.

Retirement of old filesystems.  User data will be moved to alternate filesystems.  Affected filesystems are:

  • /nv/hp6
  • /nv/hp7

Performance upgrades (hardware RAID) for NFSroot servers for the Athena cluster. Previous maintenance activities have upgraded other clusters already.

moving some filesystems off of temporary homes and onto new servers.  Affected filesystems are:

  • /nv/pz2 – Athena
  • /nv/pb2 – Optimus

If time permits, we have a number of other “targets of opportunity” –

  • relocate some compute nodes and servers, removing retired systems
  • reworking a couple of Infiniband uplinks for the Uranus cluster
  • add resource tags to the scheduler so that users can better select compute node features/capabilities from their job scripts
  • relocate a DNS/DHCP server for geographic redundancy
  • fix system serial numbers in the BIOS for asset tracking
  • test a new Infiniband subnet manager to gather data for future maintenance day activities
  • rename some ‘twin nodes’ for naming consistency
  • apply BIOS updates to some compute nodes in the Optimus cluster to facilitate remote management
  • test an experimental software component of the scratch system.  Panasas engineers will be onsite to do this and revert before going back into production.  This will help gather data and validate a fix for some other issues we’ve been seeing.

upcoming maintenance day, 7/17 – please test your codes

It’s that time of the quarter again, and all PACE-manager clusters will be taken offline for maintenance on July 17 (Tuesday). All jobs that will not complete by then will be held by the scheduler. They will be released by the scheduler once the clusters are up and running again, requiring no further action on your end. If you find that your jobs does not start running, then you might like to check its walltime to make sure it does not exceed this date.

With this maintenance, we are upgrading our RedHat 6 clusters to RedHat 6.2, which includes many bugfixes and performance improvements. This version is known to provide better software and hardware integration with our systems, particularly with the 64-core nodes we have been adding over the last year.

We are doing our best to test existing codes with the new RedHat 6.2 stack. In our experience, codes currently running on our RedHat 6 systems continue to run without problems. However we strongly recommend you test your critical codes on the new stack. For this purpose, we renovated the “testflight” cluster to include RedHat 6.2 nodes, so all you need for testing is to submit your RedHat 6 jobs to the “testflight” queue. If you would like to recompile your code, please login to the testflight-6.pace.gatech.edu head node. Please try to keep the problem sizes small since this cluster only includes ~14 nodes with cores varying from 16 to 48, plus a single 64 core node. We have limited this queue to two jobs at a time from a given user. We hope the testflight cluster will be sufficient to test drive your codes, but if you have any concerns, or notice any problems with the new stack, please let us know at pace-support@oit.gatech.edu.

We will also upgrade the software on the scratch storage Panasas. We have observed many ‘failover’ events resulting in brief interruptions of service under high loads, potentially incurring performance penalties on running codes. This version is supposed to help address these issues.

We have new storage systems for Athena (/nv/pz2) and Optimus (/nv/pb2). During maintenance day, we will move these filesystems off of temporary storage, and onto their new servers.

More details will be forthcoming on other maintenance day activities, so please keep an eye on our blog at http://blog.pace.gatech.edu/

Thank you for your cooperation!

-PACE Team