Posts

Update on widespread drive failures

After some consultation with members of the GT IT community (thank you specifically, Didier Contis for raising the awareness of the issue), as well as our vendor, we have identified the cause of the high rate of disk failures plaguing storage units purchased a little bit more than a year ago.

An update to the firmwares running on the internal backplanes of the storage arrays was necessary, and performance and availability were greatly improved immediately after these were applied on the arrays. These firmwares are normally manufacturer maintained materials only, and aren’t readily available to the public like controller firmwares are, which led to some additional delays before repairs.

That said, we have retained the firmwares and the software used to apply them for any future use should other units have issues.

Physical host failure for VMs – potential job impact

This morning (approximately between 3am and 8am) we suffered a failure in one of our physical hosts which makes up part of our VM farm. This failure caused several head nodes to go offline, as well as one of the PACE run license servers for software.

**********
For ALL PACE run clusters, it would be wise to double check your job runs in case they may have lost their license server prior to kicking off this morning or if it was running during this time.
**********

The following head nodes went offline, but have returned:
cygnus-6
granulous
megatron
microcluster
mps
rozell
testflight-6

The following license server went offline, but has returned:
license-gt

In the cases of the head nodes, no jobs should have been affected nor any data lost because of nodes being offline.

PACE Lecture Series: New Courses Scheduled

PACE is starting the lecture and training course offering once again.
Over the next two months, we are offering 4 courses:

  1. Introduction to Parallel Programming with MPI and OpenMP (July 22)
  2. Introduction to Parallel Application Debugging and Profiling (June 17)
  3. A Quick Introduction To Python (June 24)
  4. Python For Scientific Computing (July 8)

For details about where and when, or to register your attendance (each class is limited to 30 seats), visit our PACE Training page.

Disk failure rate spike

Hey everyone,

We’ve noticed an increase in a type of disk failure on some of the storage nodes that ultimately has a severe negative impact on storage performance. In particular, we observe that certain models of drives in certain manufacturing date ranges seem to be more prone to failure.

As a result, we’re looking a bit more closely at our logs to keep an eye on how widespread this is, but most of the older storage seems fine; it has tended towards some of the newer storage using both 2Tb and 4Tb drives. The 2Tb drives are the more surprising to us as the model line involved has generally been performing as expected, with many older storage units using the same drives without having these issues.

We are also engaging our vendor to see if this is something that they are seeing elsewhere, and making sure we keep a close eye on our stock of replacements to deal with these failures.

Storage slowdowns due to failing disks

CLUSTERS INVOLVED: emory/tardis, ase1

Hey folks,

We’ve gone ahead and replaced some disks in your storage as the type of failures they are generating right now cause dramatic slowdowns in I/O performance for the disk arrays.

As a result of the replacements, the array will remain slow for a period of ~5 or so hours as the arrays rebuild themselves to have the appropriate redundancy.

We’ll be keeping an eye on this problem as we have recently noticed a spike in the number of these events as of late.

Big Data Week 2014 – Atlanta

A big Big Data event with many interesting speakers. Food and drinks will be served!
You can RSVP following the link:

http://meetup.com/Atlantas-Big-Data-Week-2014/

(needs a meetup account)

Monday, May 5, 2014 to

250 14th Street, NW , Atlanta, GA 30361

Parking is free in the deck.

Panelists

• Moderator – Bloomberg‘s Duane Stanford

Delta‘s Russell Pierce, Managing Director of Customer Data & Analytics

Home Depot‘s Steven Einbender, Lead Advanced Analytics Architect

Children’s Healthcare of Atlanta‘s Michael Thompson, VP of Business Intelligence

AirWatch‘s John Marshall, CEO

Weather Company‘s Eli Phetteplace, Director of Enterprise Data

Agenda

• 5-7 – networking and sponsor showcase

• 7-7:30 – introductions and intro to Big Data Week

• 7:30-8:30 – Keynote panel

• 8:30-9 – Q&A and closing remarks

 

 

 

 

SC14 Program Offers Immersive Program in HPC for Undergrads

Applications are now being accepted for Experiencing HPC for Undergraduates, a program designed to introduce high performance computing (HPC) research topics and techniques to undergraduate students at the sophomore level and above. The program introduces various aspects of HPC research at the SC14 Conference to increase awareness of opportunities to perform research as an undergraduate and potentially in graduate school or in a job related to HPC topics in computer science and computational science.

SC14 will be held Nov. 16-21, 2014 in New Orleans. Complete conference information can be found at:http://sc14.supercomputing.org

The Experiencing HPC for Undergraduates Program contains selected parts of the main SC Technical Program, with several additional elements. Special sessions include panels with current graduate students in HPC areas to discuss graduate school and research, and panels with senior HPC researchers from universities, government and industrial labs to discuss career opportunities in HPC fields.

Prof. Jeff Hollingsworth, co-chair of Experiencing HPC for Undergraduates, discusses the program and the need to develop the next generation of HPC professionals in an HPCwire podcast at: http://www.hpcwire.com/soundbite/toward-next-generation-hpc-professionals/

Applications must be submitted using the SC14 submission site at https://submissions.supercomputing.org/. The deadline to apply is Sunday, June 15.

ANSYS version 15 and Matlab R2014a installed

ANSYS version 15 and Matlab version R2014a have been installed on PACE clusters.
To see examples of how to properly load and use the new versions, execute the following commands and follow the instructions provided.

$ module help ansys/15.0

$ module help matlab/r2014a

If you have any problems executing the examples given by “module help”, please contact pace-support@oit.gatech.edu

Mvapich2 2.0rc1 available in PACE repository

We have installed the most recent Mvapich2 stack (2.0rc1), which is available via module “mvapich2/2.0rc1”. Please see this changelog if you would like to know more about the improvements this version provides.

Also, please note that we have not started rebuilding any applications with this stack yet. If you think it will provide significant benefits for any existing application, please send us an email to pace-support@oit.gatech.edu and we will be happy to recompile that application for you.

Another quick note is that versions mvapich1.6 to mvapich1.8 are known to have performance problems, which are fixed with 1.8 (hint: search for “Georgia Institute of Technology” in the changelog).  We are still keeping them in the repository for backwards compatibility, but please refrain from using these old versions as you can.

Happy computing!

Linux Cluster Institute Workshop

FYI –

Please pass along to anybody you think may be interested.  You may see some familiar faces there!  😉
We haven’t flushed out all of the details yet, but registration is likely to be somewhere in the $200-$300 range – pretty reasonable for a weeks worth of training. Official announcement follows below.

–Neil Bright

Save the date and plan to attend!
Linux Cluster Institute (LCI) Workshop
August 4-5, 2014
National Center for Supercomputing Applications (NCSA)
Urbana, Illinois

If you are a user of HPC or are responsible for maintaining an HPC resource, this is the workshop for you!  In just four days you will learn:

  • How to be an HPC cluster system administrator
  • How to be an effective HPC cluster user
  • The key issues of HPC
  • Current and emerging HPC hardware and software technologies

All sessions taught by some of the world’s best experts in HPC.
Program details and registration information coming soon!
www.linuxclusterinstitute.org