[Resolved] Campus DNS Problems

Update:  We believe that the DNS issues have been resolved. We have checked that all affected servers are functioning as expected. The scheduler has been unpaused and is now scheduling jobs.

Thank you for your patience.

==================

At this time, the campus DNS server is experiencing problems.

The effect on PACE is that some storage servers and compute nodes cannot be accessed since their DNS names cannot be found. No currently running jobs should be affected. Any job currently executing has already succeeded in accessing all needed storage and compute nodes. The scheduler has been paused so that no new jobs can be started. We are working with the campus DNS administrators to resolve this as quickly as possible.

When the issue is resolved, the scheduler will be allowed to execute jobs.

We apologize for any problems this has caused you.

New Software: HDF5(1.8.9), OBSGRID (April 2, 2010), ABINIT(6.12.3), VMD(1.9.1), and NAMD(2.9)

Several new software packages have been installed on all RHEL6 clusters.

HDF5

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
A previous version of HDF5 (1.8.7) has existed on the RHEL6 clusters for many months.
The 1.8.9 version includes many bug fixes and some new utilities.

The hdf5/1.8.9 module is used differently than the 1.8.7 module.
The 1.8.9 module is able to detect whether an MPI module has been previously loaded and will support the proper serial or MPI version of the library.
The 1.8.7 module was not able to automatically detect MPI vs. non-MPI.

Here are two examples of how to use the new HDF5 module (note that all compilers and MPI installations are usable with HDF5):

$ module load hdf5/1.8.9

or

$ module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9

OBSGRID

OBSGRID is an objective re-analysis package for WRF designed to lower the error of analyses that are used to nudge the model toward the observed state.
The analyses input to OBSGRID as the first guess are analyses output from the METGRID part of the WPS package
Here is how to use obsgrid:

$ module load intel/12.1.4 hdf5/1.8.7/nompi netcdf/4.1.3 ncl/6.1.0-beta obsgrid/04022010
$ obsgrid.exe

ABINIT

ABINIT is a package whose main program allows one to find the total energy, charge density and electronic structure of systems made of electrons and nuclei (molecules and periodic solids) within Density Functional Theory (DFT), using pseudopotentials and a planewave or wavelet basis.
ABINIT 6.8.1 is already installed on the RHEL6 clusters.
There are many changes from 6.8.1 to 6.12.3. See the 6.12.3 release notes for more information.

Here are a few examples of how to use ABINIT in a job script:

#PBS ...
#PBS -l walltime=8:00:00
#PBS -l nodes=64:ib

cd $PBS_O_WORKDIR
module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9 netcdf/4.2 mkl/10.3 fftw/3.3 abinit/6.12.3
mpirun -rmk pbs abinit < abinit.input.file > abinit.output.file

VMD

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
VMD has been installed with support for the GCC compilers (versions 4.4.5, 4.6.2, and 4.7.0), NetCDF, Python+NumPy, TCL, and OpenGL.
Here is an example of how to use it:

  1. Login to a RHEL6 login node (joe-6, biocluster-6, atlas-6, etc.) with X-Forwarding enabled (X-Forwarding is critical for VMD to work).
  2. Load the needed modules:
    $ module load gcc/4.6.2 python/2.7.2 hdf5/1.8.7/nompi netcdf/4.1.3 vmd/1.9.1
  3. Execute “vmd” to start the GUI

NAMD

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.
Version 2.9 of NAMD has been installed with support for GNU and Intel compilers, MPI, FFTW3.
CUDA support in NAMD has been disabled.

Here is an example of how to use it in a job script in a RHEL6 queue (biocluster-6, atlas-6, ece, etc.):

#PBS -N NAMD-test
#PBS -l nodes=32
#PBS -l walltime=8:00:00
...
module load gcc/4.6.2 mvapich2/1.7 fftw/3.3 namd/2.9
cd $PBS_O_WORKDIR

mpirun -rmk pbs namd2 input.file

10TB soft quota per user on scratch storage

One of the many benefits of using PACE clusters is the scratch storage, which provides a fast filesystem for I/O-bound jobs. The scratch server is designed to offer high speeds but not so much storage capacity. So far, a weekly script that deletes all files older than 60 days had allowed us sustain this service without the need for disk quotas. However this situation started changing as the PACE clusters had grown to a whopping ~750 active users, with the addition of ~300 users only since Feb 2011. Consequently, it became common for the scratch utilization to reach 98%-100% on several volumes, which is alarming for the health of the entire system.

We are planning to address this issue with a 2-step transition plan for enabling file quotas. The first step will be applying 10TB “soft” quotas for all users for the next 3 months. A soft quota means that you will receive warning emails from the system if you exceed 10TB, but your writes will NOT be blocked. This will help you adjust your data usage and get prepared for the second step, which is the 10TB “hard” quotas that will block writes when the quota is exceeded.

Considering that the total scratch capacity is 260TB, a 10TB quota for 750 users is a very generous limit. Looking at some current statistics, the number of users using more than this capacity does not exceed 10. If you are one of these users (you can check using the command ‘du -hs ~/scratch’) and have concerns that the 10TB quota will adversely impact your research, please contact us (pace-support@oit.gatech.edu).

Scheduler Problems

The job scheduler is currently under heavy load (heavier than any we have seen so far).

Any commands you run to query the scheduler (showq, qstat, msub, etc.) will probably fail because the scheduler can’t respond in time.

We are working feverishly to correct the problem.

Upcoming Quarterly Maintenance on 4/17

The first quarter of the year had passed already, and it’s time for the quarterly maintenance once again!

Our team will offline all the clusters for regular maintenance and improvements on 04/17, for the entire day. We have a scheduler reservation in place to hold jobs that would not complete until the maintenance day, so hopefully no jobs will need to be killed. The jobs with such long wallclock times will still be queued, but they will not be released until the maintenance is over.

Please direct your concerns/questions to PACE support at pace-support@oit.gatech.edu.

Thanks!

New rhel6 shared/hybrid queues are ready!

We are happy to announce the availability of shared/hybrid queues for all sharing rhel6 clusters. Please run “/opt/pace/bin/pace-whoami” to see which of these queues you have access to. We did our best to test and validate these queues, but there could still be some issues left overlooked. Please contact us at pace-support@oit.gatech.edu if you notice any problems.

Here’s a list of these queues:

  • mathforce-6
  • critcelforce-6
  • apurimacforce-6
  • prometforce-6 (prometheusforce-6 was too long for the scheduler)
  • eceforce-6
  • cygnusforce-6
  • iw-shared-6

Happy computing!

 

Webinar: Parallel Computing with MATLAB on Multicore Desktops and GPUs

Mathworks is offering us a very interesting webinar:

“Parallel Computing with MATLAB on Multicore Desktops and GPUs ”

Friday, March 30, 2012

2:00 PM EDT

REGISTER NOW

In this webinar we introduce how using Parallel Computing Toolbox you can fully leverage the computing power available on your desktop through multicore processors and GPUs.

Through demonstrations you will learn how with minimal changes to your code you can speed up your MATLAB based data analysis, design and simulation work.

The webinar will last approximately 60 minutes. A Q&A session will follow the presentation and demos.

Register here

Mathworks contact:

Jamie Winter

508-647-7463

jamie.winter@mathworks.com

Regarding the job scheduler problems over the weekend

We experienced a major problem with one of our file servers over the weekend, which caused some of your jobs to fail. We would like to apologize for this inconvenience and provide you with more details on the issue.

In a nutshell, the management blade of our file server we use for scratch space (iw-scratch) crashed for a reason that we are still investigating. This system has a failover mechanism, which allows another blade to take over for continuation of operations. Therefore, you were still able to see your files and could use the software stack that is on this fileserver.

Our node that runs the moab server (job scheduler), on the other hand, mounts this fileserver using another mechanism that uses a static IP. After the new blade took over the operations, our Moab node continued to try mounting the iw-scratch using the IP of the failed blade, needless to say, unsuccessfully.

As a result, some jobs failed with messages similar to “file not found”. This problem also rendered the moab server unresponsive, until we rebooted it Saturday night. Even after the reboot, some problems persisted until we fixed the server this morning. We will keep you updated as we find more about the nature of the problem. We are also in contact with the vendor company to prevent this from happening again.

Thank you once again for your understanding and patience. Please contact us at pace-support@oit.gatech.edu for any questions and concerns.

Possible network outage on 12/13

The network team will perform a maintenance next Tuesday (12/13) at 7:30am. This is not expected to affect any systems or running jobs, but there is still a ~20% chance that a network outage can happen, and last for about an hour. The team will be on site, prepared to intervene immediately should that happens.
Please note that a network outage will affect running jobs, so you might like to wait until the maintenance is over to submit large and/or critical jobs. As always, please contact us if you have any concerns or questions.