Panasas problems, impacting all PACE clusters

The Panasas storage server started responding slowly approximately an hour ago. We are using this server to host all of the software stack, and also for the “scratch” directory in your home folders. 

No jobs have been killed, but you will notice significant degradation in the performance. Starting new jobs/commands will be also slow, although they should run.

We are actively working with the vendor to resolve these issues and will keep you updated via this blog and the “pace-availability” email list.

Thank you for your patience.

PACE Team

Collapsing nvidiagpu and nvidia-gpu queues

PACE has several nodes with NVidia GPUs installed.
There are currently two queues (nvidiagpu and nvidia-gpu) that have GPU nodes assigned to them.
It is confusing to have two queues with the same purpose and slightly different names, so PACE will be collapsing both queues into the “nvidia-gpu” queue.
That means that the nvidiagpu queue will disappear, and the nvidia-gpu queue will have all of the resources contained by both queues.

Please send any questions or concerns to pace-support@oit.gatech.edu

Jobs failing to start due to scheduler problems (~10am this morning)

We experienced scheduler-related problems this morning (around 10am), which caused jobs to terminate immediately after they are allocated on compute nodes. The system is back to normal, however we are still investigating what caused the issue.

If you have jobs that are affected by this issue, please resubmit them. If you continue to have problems, please contact us as soon as possible.

We are really sorry for this inconvenience.

 

Cluster Downtime December 19th for Scratch Space Cancelled

We have been working very closely with Panasas regarding the necessity of emergency downtime for the cluster to address the difficulties with the high-speed scratch storage. At this time, they have located a significant problem in their code base that, they believe, is responsible for this and other issues. Unfortunately, the full product update will not be ready in time for the December 19th date so we have cancelled this emergency downtime and all jobs running or scheduled will continue as expected.

We will update you with the latest summary information from Panasas when available. Thank you for your continued patience and cooperation with this issue.

– Paul Manno

TSRB Connectivity Restored

Network access to the RHEL-5 Joe cluster compute nodes has been restored.

The problem was caused by a UPS power disruption to a network switch in the building. In addition to recovering the switch and UPS, the backbone team added power redundancy to the switch by adding another PDU to the switch and connecting it to a different UPS.

New Software: VASP 5.3.2

VASP 5.3.2 – Normal, Gamma, and Non-Collinear versions

Version 5.3.2 of VASP has been installed.
The newly installed versions have been checked against our existing tests; the expected results agree to within some small error.
Please check this new version against your known correct results!

Using it

#First, load the required compiler 
$ module load intel/12.1.4
#Load all the necessary support modules
$ module load mvapich2/1.6 mkl/10.3 fftw/3.3
#Load the vasp module
$ module load vasp/5.3.2
#Run vasp $ mpirun vasp #Run the gamma-only version of vasp $ mpirun vasp_gamma #Run the noncollinear version of vasp $ mpirun vasp_noncollinear

Compilation Notes

  • Only the Intel compiler generated MPI-enabled vasp binaries that correctly executed the test suite.
  • The “vasp” binary was compiled with these preprocessor flags: -DMPI -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=12000 -DMINLOOP=1 -DPGF90 -Davoidalloc -DNGZhalf -DMPI_BLOCK=8000
  • The “vasp_gamma” binary was compiled with these preprocessor flags: -DMPI -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=12000 -DMINLOOP=1 -DPGF90 -Davoidalloc -DNGZhalf -DwNGZhalf -DMPI_BLOCK=8000
  • The “vasp_noncollinear” binary was compiled with these preprocessor flags: -DMPI -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=12000 -DMINLOOP=1 -DPGF90 -Davoidalloc -DMPI_BLOCK=8000

TSRB Connectivity Problem

All of the RHEL-5 Joe nodes are currently unavailable, due to an unspecified connectivity problem at TSRB. This problem does not impact any joe-6 nodes, or nodes from any other group.

Since connectivity between Joe and the rest of PACE is required for home, project, and scratch storage access, all of the jobs currently running on Joe will eventually get stuck in a IO-wait state, but should resume once connectivity has been restored.

Cluster Downtime December 19th for Scratch Space Issues

As many of you have noticed, we have experienced disruptions and undesirable performance with our high-speed scratch space. We are continuing to work diligently with Panasas to discover the root cause and repair for these faults.

As we are working toward a final resolution of the product issues, we will need to schedule an additional cluster-wide downtime on the Panasas to implement a potential resolution. We are scheduling a short downtime (2 hours) for Wednesday, December 19th at 2pm ET. During this window, we expect to install a tested release of software.

We understand this is an inconvenience to all our users but feel this is important enough to the PACE community to warrant this disruption. If this particular date and duration falls at a time that is especially difficult, please contact us and let us know and we will do our best to negotiate a better date or time.

It is our hope this will implement a permanent solution to these near-daily disruptions.

– Paul Manno

New and Updated Software: BLAST, COMSOL, Mathematica, VASP

All of the software detailed below is available through the “modules” system installed on all PACE-managed Redhat Enterprise 6 computers.
For basic usage instructions on PACE systems see the Using Software Modules page.

NCBI BLAST 2.2.25 – Added multithreading in new GCC 4.6.2 version

The 2.2.25 version of BLAST that was compiled with GCC 4.4.5 has multithreading (i.e. multi-CPU execution) disabled.
A new version of BLAST with multithreading enabled has been compiled with the GCC 4.6.2 compiler.

Using it

#First, load the required compiler 
$ module load gcc/4.6.2
#Now load BLAST
$ module load ncbi_blast/2.2.25
#Setup the environment so that blast can find the database
$ export BLASTDB=/path/to/db
#Run a nucleotide-nucleotide search
$ blastn -query /path/to/query/file -db <db_name> -num_threads <number of CPUS allocated to job>

COMSOL 4.3a – Student and Research versions

COMSOL Multiphysics version 4.3a contains many new functions and additions to the COMSOL product suite. These Release Notes provide information regarding new functionality in existing products and an overview of new products.
See the COMSOL Release Notes for information on updates to this version of COMSOL.

Using it

#Load the research version of comsol 
$ module load comsol/4.3a-research
$ comsol ...
#Use the matlab livelink
$ module load matlab/r2011b
$ comsol -mlroot ${MATLAB}

Mathematica 9.0

Mathematica 9 is a major update to the Mathematica software.

Using it

$ module load mathematica/9.0 
$ mathematica

VASP 5.2.12

The pre-calculated kernel for the vdW-DF functional has been installed into the same directory as the vasp binary.
This precalculated kernel is contained in the file “vdw_kernel.bindat”

Using it

#First, load the vasp module (and all the prerequisites) 
$ module load intel/12.1.4 mvapich2/1.6 mkl/10.2 fftw/3.3 vasp/5.2.12
#Copy the kernel to where vasp expects (normally the working directory)
$ cp ${VDW_KERNEL} .
# Run vasp
$ mpirun vasp

Profiling tools available: PAPI and TAU

The Performance API (PAPI) and TAU are two of the most common open source profiling tools, and they are now available for PACE users, including support for hardware counters and threading.

PAPI description, from their website:

The PAPI project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. PAPI provides two interfaces to the underlying counter hardware; a simple, high level interface for the acquisition of simple measurements and a fully programmable, low level interface directed towards users with more sophisticated needs.

TAU description, from their website:

TAU Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. This tool is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements.

 

TAU tool uses PAPI for event collection and provides two tools for visualization. The text based tool is called pprof and the graphical tool is called paraprof.

A *very* short guide to using TAU on PACE clusters

* First, you need to recompile your code with TAU wrappers.

  • Load the modules your code needs (compiler, MPI, etc)
module load gcc/4.4.5 mvapich2/1.6
  • Load the latest tau module (currently tau/2.22-p1, older versions are known to have bugs)
module load tau/2.22-p1

(This will load PDT and PAPI modules too, if you don’t have them loaded already)

  • The TAU module will set the correct TAU Makefile in your environment. Check if you have it right:
$ echo $TAU_MAKEFILE
/usr/local/packages/tau/2.22-p1/mvapich2-1.6/gcc-4.4.5/x86_64/lib/Makefile.tau-papi-mpi-pthread-pdt-openmp

• Compile your code using one of the compiler wrapper scripts.

E.g., for a f90 code:

tau_f90.sh -L${PAPIDIR}/lib -lpfm loop_test.f90 -o loop_test

Note that “-L${PAPIDIR}/lib -lpfm” part is necessary on PACE clusters to avoid the system default libpfm, which is not compatible with TAU. If you don’t specify this, you will get this warning:

Error: Reverting to a Regular Make
To suppress this message and revert automatically, please add -optRevert to your TAU_OPTIONS environment variable
Press Enter to continue

* Run the code as usual (not on the headnode!!) 

 mpirun -np 4 ./loop_test

 You will see profiler files in the format “profile.A.B.C” in the same folder, which indicates TAU ran and collected profiling data

* Finally, run pprof or paraprof from the same directory to see the results!

    • pprof -ea   (sort by exclusive time and show all details)
    • paraprof

Remember, these are very brief instructions. Please refer to PAPI and TAU documentation for more details:

PAPI Reference

TAU User Guide

Enjoy!