Posts

PACE Systems Back Online

The fileserver has recovered, and all headnodes are now accessible. The jobs running off scratch should continue from where they left. You have access to all files, including the scratch. The server is still performing reconstruction of data, which may slow down the system (especially on volumes v0 and v3) for a few more hours. This slowness will go away when the reconstruction is complete.

We are expecting to receive the failed part tomorrow (6/6). The fileserver can function without this part and its installation will not cause any interruptions.

Once again, thank you for bearing with us while we were working on this problem. If you have jobs that you think crashed due to this problem, please send us an email at pace-support@oit.gatech.edu.

Login Problems, current situation

The Panasas fileserver (scratch storage) crashed today while recovering from a hardware problem. This causes the headnodes (that mount Panasas) to hang, and they are not accessible via SSH now.

We do have a way to disable Panasas and give you access to headnodes right away, without the panasas storage. However, doing so will crash all of the jobs using the scratch space. We do not want that, especially considering that some jobs have been running for days.

We are now running a filesystem check on the system, which will take 3 to 4 hours. This is required to prevent data corruption. After this process, Panasas should recover and the jobs will continue running. At the point, the headnodes will become accessible again.

If you urgently need to access your data in your home or project directories, please contact us at pace-support@oit.gatech.edu. We might be able to help you access your files via a headnode that does not mount Panasas.

The filesystem check has been running for 40 minutes and current at 26% (by 12:25pm EST).

Thank you once again for your understanding and patience, and we apologize for this inconvenience,

Login Problems

With the exception of RHEL-5 Atlas users, it is currently not possible for regular users to log into PACE, due to a problem with the PANFS storage system. We are working to get the problem resolved as quickly as possible.

PC1 back online, troublesome process identified

Hey Cygnus users!

It looks like we have finally been able to identify the cause of recent file server crashes and tracked it down to a particular job run and how it hands file I/O. We’re in contact with the user now to try to improve the job’s I/O behavior to prevent this from happening again (at least, with this job).

Thank you for your patience, we know this has been inconvenient.

RESOLVED (again…): PC1 server back online

Hey folks, it’s me again.

As of this post, I have been able to keep the system running 3 solid hours doing the catch-up backup runs with no issue. The previous announcement and subsequent embarrassment made me wary of announcing this too early again, but I think the system really is stable now, so have at it.

Compute away…

bnm

PC1 file server still unaccessible…

*sigh*

I made sure to let the system get loaded down for a while with the
backups and such before I made that announcement, but sure enough,
something is still wrong here as now the replacement file server has
crashed.

Looking into it now, but now I have to suspect something on the OS
level has gone terribly wrong in the past few days.

Sorry folks.

RESOLVED: Hardware Failure for /PC1 filesystem users

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

New and Updated Software: Portland Group Compiler and ANSYS

Two new sets of software have been installed on PACE-managed systems – PGI 12.10 and ANSYS 14.5 service pack 1.

PGI 12.10

The Portland Group, Inc. (a.k.a. PGI) makes software compilers and tools for parallel computing. The Portland Group offers optimizing parallel FORTRAN 2003, C99 and C++ compilers and tools for workstations, servers and clusters running Linux, MacOS or Windows operating systems based on the following microprocessors:

This version of the compiler supports the OpenACC GPU programming directives.
More information can be found at The Portland Group website.
Information about using this compiler with the OpenACC directives can be found at PGI Insider and OpenACC.

Usage Example

$ module load pgi/12.10
$ pgfortran example.f90
$ ./a.out
Hello World

ANSYS 14.5 Service Pack 1

ANSYS develops, markets and supports engineering simulation software used to foresee how product designs will behave and how manufacturing processes will operate in real-world environments.

Usage Example

$ module load ansys/14.5
$ ansys145

Panasas problems, impacting all PACE clusters

The Panasas storage server started responding slowly approximately an hour ago. We are using this server to host all of the software stack, and also for the “scratch” directory in your home folders. 

No jobs have been killed, but you will notice significant degradation in the performance. Starting new jobs/commands will be also slow, although they should run.

We are actively working with the vendor to resolve these issues and will keep you updated via this blog and the “pace-availability” email list.

Thank you for your patience.

PACE Team