PC1 back online, troublesome process identified

Hey Cygnus users!

It looks like we have finally been able to identify the cause of recent file server crashes and tracked it down to a particular job run and how it hands file I/O. We’re in contact with the user now to try to improve the job’s I/O behavior to prevent this from happening again (at least, with this job).

Thank you for your patience, we know this has been inconvenient.

RESOLVED (again…): PC1 server back online

Hey folks, it’s me again.

As of this post, I have been able to keep the system running 3 solid hours doing the catch-up backup runs with no issue. The previous announcement and subsequent embarrassment made me wary of announcing this too early again, but I think the system really is stable now, so have at it.

Compute away…

bnm

PC1 file server still unaccessible…

*sigh*

I made sure to let the system get loaded down for a while with the
backups and such before I made that announcement, but sure enough,
something is still wrong here as now the replacement file server has
crashed.

Looking into it now, but now I have to suspect something on the OS
level has gone terribly wrong in the past few days.

Sorry folks.

RESOLVED: Hardware Failure for /PC1 filesystem users

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

New and Updated Software: Portland Group Compiler and ANSYS

Two new sets of software have been installed on PACE-managed systems – PGI 12.10 and ANSYS 14.5 service pack 1.

PGI 12.10

The Portland Group, Inc. (a.k.a. PGI) makes software compilers and tools for parallel computing. The Portland Group offers optimizing parallel FORTRAN 2003, C99 and C++ compilers and tools for workstations, servers and clusters running Linux, MacOS or Windows operating systems based on the following microprocessors:

This version of the compiler supports the OpenACC GPU programming directives.
More information can be found at The Portland Group website.
Information about using this compiler with the OpenACC directives can be found at PGI Insider and OpenACC.

Usage Example

$ module load pgi/12.10
$ pgfortran example.f90
$ ./a.out
Hello World

ANSYS 14.5 Service Pack 1

ANSYS develops, markets and supports engineering simulation software used to foresee how product designs will behave and how manufacturing processes will operate in real-world environments.

Usage Example

$ module load ansys/14.5
$ ansys145

Panasas problems, impacting all PACE clusters

The Panasas storage server started responding slowly approximately an hour ago. We are using this server to host all of the software stack, and also for the “scratch” directory in your home folders. 

No jobs have been killed, but you will notice significant degradation in the performance. Starting new jobs/commands will be also slow, although they should run.

We are actively working with the vendor to resolve these issues and will keep you updated via this blog and the “pace-availability” email list.

Thank you for your patience.

PACE Team

Collapsing nvidiagpu and nvidia-gpu queues

PACE has several nodes with NVidia GPUs installed.
There are currently two queues (nvidiagpu and nvidia-gpu) that have GPU nodes assigned to them.
It is confusing to have two queues with the same purpose and slightly different names, so PACE will be collapsing both queues into the “nvidia-gpu” queue.
That means that the nvidiagpu queue will disappear, and the nvidia-gpu queue will have all of the resources contained by both queues.

Please send any questions or concerns to pace-support@oit.gatech.edu

Jobs failing to start due to scheduler problems (~10am this morning)

We experienced scheduler-related problems this morning (around 10am), which caused jobs to terminate immediately after they are allocated on compute nodes. The system is back to normal, however we are still investigating what caused the issue.

If you have jobs that are affected by this issue, please resubmit them. If you continue to have problems, please contact us as soon as possible.

We are really sorry for this inconvenience.

 

Cluster Downtime December 19th for Scratch Space Cancelled

We have been working very closely with Panasas regarding the necessity of emergency downtime for the cluster to address the difficulties with the high-speed scratch storage. At this time, they have located a significant problem in their code base that, they believe, is responsible for this and other issues. Unfortunately, the full product update will not be ready in time for the December 19th date so we have cancelled this emergency downtime and all jobs running or scheduled will continue as expected.

We will update you with the latest summary information from Panasas when available. Thank you for your continued patience and cooperation with this issue.

– Paul Manno