PC1 back online, troublesome process identified

Hey Cygnus users!

It looks like we have finally been able to identify the cause of recent file server crashes and tracked it down to a particular job run and how it hands file I/O. We’re in contact with the user now to try to improve the job’s I/O behavior to prevent this from happening again (at least, with this job).

Thank you for your patience, we know this has been inconvenient.

PC1 & PB1 filesystems back online

Hey folks,

It looks like we may have finally found the issue tying up the PB1 file server and the occasional lock up of the PC1 file server. We’ve isolated the compute nodes that seemed to be generating the bad traffic, and have even isolated the processes which appear to have compounded the problem on a pair of shared nodes (thus linking the two server failures). With any luck, we’ll get those nodes online once their other jobs complete or are cancelled.

Thank you for the patience you have given us while we tracked this problem down. We know it was quite inconvenient, but we have a decent picture of what occurred and thankfully it was something that is very unlikely to repeat itself.

RESOLVED (again…): PC1 server back online

Hey folks, it’s me again.

As of this post, I have been able to keep the system running 3 solid hours doing the catch-up backup runs with no issue. The previous announcement and subsequent embarrassment made me wary of announcing this too early again, but I think the system really is stable now, so have at it.

Compute away…

bnm

PC1 file server still unaccessible…

*sigh*

I made sure to let the system get loaded down for a while with the
backups and such before I made that announcement, but sure enough,
something is still wrong here as now the replacement file server has
crashed.

Looking into it now, but now I have to suspect something on the OS
level has gone terribly wrong in the past few days.

Sorry folks.

RESOLVED: Hardware Failure for /PC1 filesystem users

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

PACE Sponsors High Performance Computing Townhall!

PACE sponsoring HPC Townhall

What could you do with over 25,000 computer cores? Join faculty and students at the April 30 High Performance Computing Town Hall to find out. The event will be held in the MaRC auditorium and is sponsored by PACE, Georgia Tech’s Advanced Computing Environment program.

When: April 30, 3-5pm
Where: MaRC Auditorium (Map to location)

Overview

PACE provides researchers with a robust computing platform that enables faculty and students to carry out research initiatives without the burden of maintaining infrastructure, software, and dedicated technicians. The program’s services are managed by OIT’s Academic & Research Technologies department and include physical hosting, system management infrastructure, high-speed scratch storage, home directory space, commodity networking, and common HPC software such as RedHat Enterprise Linux, VASP, LAMMPS, BLAST, Matlab, Mathematica, and Ansys Fluent. Various compilers, math libraries and other middleware is available for those who author their own codes.  All of these resources are designed and offered with the specific intention of combining intellect with efficiency, in order to advance the research presence here at Tech to the peak of its abilities.

There are many ways to participate with PACE.  With a common infrastructure, we support clusters dedicated to individual PIs or research groups, clusters that are shared amongst participants and our FoRCE Research Computing Environment (aka “The FoRCE”).  The FoRCE is available to all campus users via a merit-based proposal mechanism.

The April 30 HPC Town Hall is open to members of the Tech research community and will feature presentations on the successes and challenges that PACE is currently experiencing, followed by a panel discussion and Q&A.

For more information on the PACE program, visit the official website at www.pace.gatech.edu, and also the program’s blog at blog.pace.gatech.edu.

Agenda (To Be Finalized Soon)

  • Message from Georgia Tech’s CTO Ron Hutchins
  • Message from PACE’s director Neil Bright
  • Lightning Talks By Faculty
  • Discussion around technologies and capabilities currently under investigation by PACE
  • Panel Discussion regarding future directions for PACE
  • Question and Answer Session

Account related problems on 03/14/2013

We experienced some account management difficulties today (03/14/2013), mostly caused by exceeding the capacity of our database. We found the cause and fixed all of the issues. 

This problem might have affected you in two different ways. First, temporary login problems to the headnodes, and second, failure of some recently allocated jobs on compute nodes. As far as we know, none of the running jobs are affected.

We apologize for any inconvenience this might have caused. If you have experienced any problems, please send us a note (pace-support@oit.gatech.edu).

 

 

PACE Debugging and Profiling Workshop on 03/21/2013

Dear PACE community,

We are happy to announce the first of the Debugging and Profiling Workshop that will take place on 03/21/2013 1pm-5pm, in the Old Rich Building Conference Room (ITDC 242).

If your code is crashing, hanging, producing inaccurate results, or running unbearably slow, you *do* want to be there. We will go over text and GUI based tools that are available on the PACE clusters, including gdb, valgrind, DDT, gprof, PAPI and TAU. There will be hands-on examples, so bring your laptop if you can, although it is not mandatory.

If you bring a laptop to follow the hands-on examples, please make sure that you have:

  • An active PACE account with access to one of the RHEL6 queues
  • Access to “GTwifi”
  • A terminal client to login (PuTTy for windows, Terminal for Mac)
  • A text editor that you are comfortable with (Vim, Emacs, nano, …)

Don’t worry if your laptop is not configured to access the PACE clusters. I will be in the conference room half an hour early to help you prepare for the session. Just show up a bit early with your laptop, and we will take care of the rest together 🙂

Please RSVP (to mehmet.belgin@oit.gatech.edu) by 03/19/2003 and include your GT username. Your RSVP will guarantee a seat and printed out copies of the course material. You will also be able to fetch an electronic copy (including all the slides and codes) anytime by running a simple command on the cluster (we will do that during the class).

Here’s the full schedule:

  • 12:30pm -> 1:00pm : (Optional) Help session to make sure your laptop is ready for the workshop
  •  1:00pm -> 2:45pm : Debugging session (gdb, valgrind, DDT)
  •  2:45pm -> 3:15pm : Break
  •  3:15pm -> 5:00pm : Profiling session (gprof, PAPI, TAU )

The location is the Old Rich Building, ATDC conference room, #242. The google knows us as “258 4th Street“. We are right across the Clough Commons Building.

We look forward to seeing you there!