RESOLVED: Hardware Failure for /PC1 filesystem users

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

Hardware Failure for /PC1 filesystem users

Hey folks,

The fileserver providing access to the filesystems hosted under (/nv)/pc1 has suffered a severe failure, requiring replacement parts before we can bring it online again. We are in contact with the vendor to try and resolve this as quickly as possible.

bnm

PACE Maintenance day complete

We have completed our maintenance day activities, and are now back into regular operation. Please let us know (via email to pace-support@oit.gatech.edu) if you encounter problems.

–Neil Bright

PACE Maintenance Day Underway

Maintenance day has begun at 6:00am on April 16, 2013.
No users will be allowed to login and no jobs may be submitted until maintenance day is complete.

PACE maintenance day – NEXT WEEK 4/16

The next maintenance day (4/16, Tuesday) is just around the corner and we would like to remind you that all systems will be powered off for the entire day. You will not be able to access the headnodes, compute nodes or your data until the maintenance tasks are complete.

None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.

We have many tasks to complete, and here’s a summary:

1) Job Resource Manager/Scheduler maintenance

Contrary to the initial plan, we decided NOT to upgrade the resource manager (torque) and job scheduler (moab) software yet. We have been testing the new versions of these software (with your help) and, unfortunately, identified significant bugs/problems along the way. Despite being old, the current versions are known to be robust, so we will maintain the status quo until we resolve all of the problems with the vendor.

2) Interactive login prevention mechanism

Ideally, compute nodes should not allow for interactive logins, unless the user has active jobs on the node. We noticed that some users can directly ssh to compute nodes and start jobs, however. This may lead to resource conflicts and unfair use of the cluster. We identified the problem and will apply the fix on this maintenance day.

3) continued RHEL-6 migration

We are planning to convert all of the remaining Joe nodes to RHEL6 in this cycle. We will also convert the 25% of the remaining RHEL5 FoRCE nodes. We are holding off the migration for Aryabhata and Atlas clusters per request of those communities.

4) Hardware installation and configuration

We noticed that some of the nodes in the Granulous, Optimus and FoRCE are still running diskless, although they have local disks. Some nodes also not using the optimal choice for their /tmp. We will fix these problems.

We received (and tested) a replacement for the fileserver for the Apurimac project storage (pb3), since we have been experiencing problems there. We will install the new system and swap the disks. This is just a mechanical process and your data will is safe. As an extra precaution, we have been taking incremental backups (in addition to the regular backups) of this storage since it first started showing the signs of failure.

5) Software/Configurations

We will also patch/update/add software, including:

Upgrade the node health checker scripts
Deploy new database-based configuration makers (in dry-run mode for testing)
Reconfigure licensing mechanism so different groups can use different sources for licenses

6) Electrical Work

We will also perform some electrical work to better facilitate the recent and future additions to the clusters. We will replace some problematic PDUs and redistribute the power among racks.

7) New storage from Data Direct Networks (DDN)

Last, but not least! In concert with a new participant, we have procured a new high performance storage system from DDN. In order to make use of this multi-gigabyte/sec performing monster, we are installing the GPFS filesystem. This is a commercial filesystem which PACE is funding. We will continue to operate the Panasas in parallel with DDN, and both storage systems can be used at the same time from any compute node. We are planning a new storage offering that allows users to purchase additional capacity on this system, so stay tuned.

As always, please contact us form pace-support@oit.gatech.edu for any questions/concerns you may have.

Thank you!

PACE Team

PACE Sponsors High Performance Computing Townhall!

PACE sponsoring HPC Townhall

What could you do with over 25,000 computer cores? Join faculty and students at the April 30 High Performance Computing Town Hall to find out. The event will be held in the MaRC auditorium and is sponsored by PACE, Georgia Tech’s Advanced Computing Environment program.

When: April 30, 3-5pm
Where: MaRC Auditorium (Map to location)

Overview

PACE provides researchers with a robust computing platform that enables faculty and students to carry out research initiatives without the burden of maintaining infrastructure, software, and dedicated technicians. The program’s services are managed by OIT’s Academic & Research Technologies department and include physical hosting, system management infrastructure, high-speed scratch storage, home directory space, commodity networking, and common HPC software such as RedHat Enterprise Linux, VASP, LAMMPS, BLAST, Matlab, Mathematica, and Ansys Fluent. Various compilers, math libraries and other middleware is available for those who author their own codes. All of these resources are designed and offered with the specific intention of combining intellect with efficiency, in order to advance the research presence here at Tech to the peak of its abilities.

There are many ways to participate with PACE. With a common infrastructure, we support clusters dedicated to individual PIs or research groups, clusters that are shared amongst participants and our FoRCE Research Computing Environment (aka “The FoRCE”). The FoRCE is available to all campus users via a merit-based proposal mechanism.

The April 30 HPC Town Hall is open to members of the Tech research community and will feature presentations on the successes and challenges that PACE is currently experiencing, followed by a panel discussion and Q&A.

For more information on the PACE program, visit the official website at www.pace.gatech.edu, and also the program’s blog at blog.pace.gatech.edu.

Agenda (To Be Finalized Soon)

Message from Georgia Tech’s CTO Ron Hutchins
Message from PACE’s director Neil Bright
Lightning Talks By Faculty
Discussion around technologies and capabilities currently under investigation by PACE
Panel Discussion regarding future directions for PACE
Question and Answer Session

Account related problems on 03/14/2013

We experienced some account management difficulties today (03/14/2013), mostly caused by exceeding the capacity of our database. We found the cause and fixed all of the issues.

This problem might have affected you in two different ways. First, temporary login problems to the headnodes, and second, failure of some recently allocated jobs on compute nodes. As far as we know, none of the running jobs are affected.

We apologize for any inconvenience this might have caused. If you have experienced any problems, please send us a note (pace-support@oit.gatech.edu).

PACE Debugging and Profiling Workshop on 03/21/2013

Dear PACE community,

We are happy to announce the first of the Debugging and Profiling Workshop that will take place on 03/21/2013 1pm-5pm, in the Old Rich Building Conference Room (ITDC 242).

If your code is crashing, hanging, producing inaccurate results, or running unbearably slow, you *do* want to be there. We will go over text and GUI based tools that are available on the PACE clusters, including gdb, valgrind, DDT, gprof, PAPI and TAU. There will be hands-on examples, so bring your laptop if you can, although it is not mandatory.

If you bring a laptop to follow the hands-on examples, please make sure that you have:

An active PACE account with access to one of the RHEL6 queues
Access to “GTwifi”
A terminal client to login (PuTTy for windows, Terminal for Mac)
A text editor that you are comfortable with (Vim, Emacs, nano, …)

Don’t worry if your laptop is not configured to access the PACE clusters. I will be in the conference room half an hour early to help you prepare for the session. Just show up a bit early with your laptop, and we will take care of the rest together 🙂

Please RSVP (to mehmet.belgin@oit.gatech.edu) by 03/19/2003 and include your GT username. Your RSVP will guarantee a seat and printed out copies of the course material. You will also be able to fetch an electronic copy (including all the slides and codes) anytime by running a simple command on the cluster (we will do that during the class).

Here’s the full schedule:

12:30pm -> 1:00pm : (Optional) Help session to make sure your laptop is ready for the workshop
1:00pm -> 2:45pm : Debugging session (gdb, valgrind, DDT)
2:45pm -> 3:15pm : Break
3:15pm -> 5:00pm : Profiling session (gprof, PAPI, TAU )

The location is the Old Rich Building, ATDC conference room, #242. The google knows us as “258 4th Street“. We are right across the Clough Commons Building.

We look forward to seeing you there!

Breaking news from NSF

Looks like Dr. Subra Suresh will be stepping down from his position as Director of NSF, effective late March to become the next President of Carnegie Mellon.

Click the link here: Staff Letter 2-4-13 to download a copy of his letter to the NSF community.

Interesting times are ahead for both NSF and DOE.

New and Updated Software: Portland Group Compiler and ANSYS

Two new sets of software have been installed on PACE-managed systems – PGI 12.10 and ANSYS 14.5 service pack 1.

PGI 12.10

The Portland Group, Inc. (a.k.a. PGI) makes software compilers and tools for parallel computing. The Portland Group offers optimizing parallel FORTRAN 2003, C99 and C++ compilers and tools for workstations, servers and clusters running Linux, MacOS or Windows operating systems based on the following microprocessors:

This version of the compiler supports the OpenACC GPU programming directives.
More information can be found at The Portland Group website.
Information about using this compiler with the OpenACC directives can be found at PGI Insider and OpenACC.

Usage Example

$ module load pgi/12.10 $ pgfortran example.f90 $ ./a.out Hello World

ANSYS 14.5 Service Pack 1

ANSYS develops, markets and supports engineering simulation software used to foresee how product designs will behave and how manufacturing processes will operate in real-world environments.

Usage Example

$ module load ansys/14.5 $ ansys145

Partnership for an Advanced Computing Environment

Recent Posts

RESOLVED: Hardware Failure for /PC1 filesystem users

Hardware Failure for /PC1 filesystem users

PACE Maintenance day complete

PACE Maintenance Day Underway

PACE maintenance day – NEXT WEEK 4/16

PACE Sponsors High Performance Computing Townhall!

PACE sponsoring HPC Townhall

Overview

Agenda (To Be Finalized Soon)

Account related problems on 03/14/2013

PACE Debugging and Profiling Workshop on 03/21/2013

Breaking news from NSF

New and Updated Software: Portland Group Compiler and ANSYS

PGI 12.10

Usage Example

ANSYS 14.5 Service Pack 1

Usage Example

Georgia Institute of Technology