Posts – Page 47 – Partnership for an Advanced Computing Environment

Login Node Storage Server Problems

Last night (2013/06/30), one of the storage servers that is responsible for many of the cluster login nodes encountered some major problems.
These issues are preventing the login nodes from allowing any user to login or use the server.
Following is a list of the affected login nodes:
cee
chemprot
cns
cygnus-6
force-6
force
math
mokeys
optimus
testflight-6

We are aware of the problem and we are working as quickly as possible to fix this.
Please let us know of any problems you are having that may be related to this.
We will keep you posted about our progress.

Grace 5.1.23 installed

Grace

Grace is a WYSIWYG 2D plotting tool for the X Window System and M*tif.
Grace is a descendant of ACE/gr, also known as Xmgr.

Example Usage

$ module load grace/5.1.23

$ xmgrace

Newest LAMMPS 17Jun13 is Installed

The new version is built with various compiler combinations and with fftw 3.3

after loading compiler’s modules, do

module load mkl/10.3

module load fftw/3.3

module load lammps/17Jun13

Intel Cluster Studio 2013 XE Installed

The Intel Cluster Studio 2013 XE software suite installation adds several new and useful tools for PACE users.

VTune: Intel® VTune™ Amplifier XE 2013 is a serial and parallel performance profiler for C, C++, C#, Fortran, Assembly and Java.
Inspector: Intel® Inspector XE is an easy to use memory debugger and thread debugger for serial and parallel applications.
Advisor: Intel® Advisor XE is a threading prototyping tool for C, C++, C# and Fortran.

This installation includes updated versions of many currently installed packages. The updates include:

MKL – updated to 11.0.1
TBB – updated to 4.1
IPP – updated to 7.1.1
Compilers (C, C++, Fortran) – updated to 13.2.146

To use the new or updated software, please load whichever modules are appropriate:

intel/13.2.146 (loads the C, C++, and Fortran compilers)
vtune/2013xe (loads VTune)
advisor/2013xe (loads Advisor)
inspector/2013xe (loads Inspector)
tbb/4.1 (loads the Thread Building Blocks)
ipp/7.1.1 (loads the Performance Primitives)
mkl/11.0.1 (load the Math Kernel Library)

For information on using VTune, Inspector, Advisor, or any of the Intel tools, see the Intel Cluster Studio XE site.

New 128-procs Allinea DDT license on PACE clusters

Allinea DDT is a powerful parallel debugger with an easy-to-use GUI. You can run it by loading its module (module load ddt/3.2) and entering “ddt”. Some introduction level information can be found in “https://pace.gatech.edu/workshop/DebuggingProfiling.pdf“.

We extended our single-user 32-procs license to multi-user 128 procs. Aside from the increased number of processors, this license allows multiple users to use the software at the same time, as long as the total number of processors do not exceed 128. E.g., two users can use the software with 64procs run each.

Happy debugging!

PACE Systems Back Online

The fileserver has recovered, and all headnodes are now accessible. The jobs running off scratch should continue from where they left. You have access to all files, including the scratch. The server is still performing reconstruction of data, which may slow down the system (especially on volumes v0 and v3) for a few more hours. This slowness will go away when the reconstruction is complete.

We are expecting to receive the failed part tomorrow (6/6). The fileserver can function without this part and its installation will not cause any interruptions.

Once again, thank you for bearing with us while we were working on this problem. If you have jobs that you think crashed due to this problem, please send us an email at pace-support@oit.gatech.edu.

Login Problems, current situation

The Panasas fileserver (scratch storage) crashed today while recovering from a hardware problem. This causes the headnodes (that mount Panasas) to hang, and they are not accessible via SSH now.

We do have a way to disable Panasas and give you access to headnodes right away, without the panasas storage. However, doing so will crash all of the jobs using the scratch space. We do not want that, especially considering that some jobs have been running for days.

We are now running a filesystem check on the system, which will take 3 to 4 hours. This is required to prevent data corruption. After this process, Panasas should recover and the jobs will continue running. At the point, the headnodes will become accessible again.

If you urgently need to access your data in your home or project directories, please contact us at pace-support@oit.gatech.edu. We might be able to help you access your files via a headnode that does not mount Panasas.

The filesystem check has been running for 40 minutes and current at 26% (by 12:25pm EST).

Thank you once again for your understanding and patience, and we apologize for this inconvenience,

Login Problems

With the exception of RHEL-5 Atlas users, it is currently not possible for regular users to log into PACE, due to a problem with the PANFS storage system. We are working to get the problem resolved as quickly as possible.

PC1 back online, troublesome process identified

Hey Cygnus users!

It looks like we have finally been able to identify the cause of recent file server crashes and tracked it down to a particular job run and how it hands file I/O. We’re in contact with the user now to try to improve the job’s I/O behavior to prevent this from happening again (at least, with this job).

Thank you for your patience, we know this has been inconvenient.

PC1 & PB1 filesystems back online

Hey folks,

It looks like we may have finally found the issue tying up the PB1 file server and the occasional lock up of the PC1 file server. We’ve isolated the compute nodes that seemed to be generating the bad traffic, and have even isolated the processes which appear to have compounded the problem on a pair of shared nodes (thus linking the two server failures). With any luck, we’ll get those nodes online once their other jobs complete or are cancelled.

Thank you for the patience you have given us while we tracked this problem down. We know it was quite inconvenient, but we have a decent picture of what occurred and thankfully it was something that is very unlikely to repeat itself.

Partnership for an Advanced Computing Environment

Recent Posts