Posts

PACE clusters ready for research

Greetings!

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

In general, all tasks were successfully completed.  However, we do have some compute nodes that are still having various issues.  We will continue to work through those tomorrow.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!

 

PACE quarterly maintenance – July ’14

2014-07-15 at 6am: Maintenance has begun

Hi folks,

It is time again for our quarterly maintenance. We have a bit of a situation this time around and will need to extend our activities into a third day – starting at 6:00am Tuesday, July 15 and ending at or before 11:59pm Thursday, July 17. This is a one-time event, and I do not expect to move to three-day maintenance as a norm. Continue reading below for more details.

Over the years, we’ve grown quite a bit and filled up one side of our big Infiniband switch. This is a good thing! The good news is that there is plenty of expansion room on the other side of the switch. The bad news is that we didn’t leave a hole in the raised floor to get the cables to the other side of the switch. In order to rectify this, and install all of the equipment that was ordered in June, we need to move the rack that contains the switch as well as some HVAC units on either side. In order to do this, we need to unplug a couple hundred Infiniband connections and some ethernet fiber. Facilities will be on hand to handle the HVAC. After all the racks are moved, we’ll swap in some new raised-floor tiles and put everything back together. This is a fair bit of work, and is the impetus for the extra day.

In addition, we will be upgrading all of the RedHat 6 compute nodes and login nodes from RHEL6.3 to RHEL6.5 – this represents nearly all of the clusters that PACE manages. This image has been running on the TestFlight cluster for some time now – if you haven’t taken the opportunity to test your codes there, please do so. This important update contains some critical security fixes to go along with the usual assortment of bug fixes.

We are also deploying updates to the scheduler prologue and epilogue scripts to more effectively combat “leftover” processes from jobs that don’t completely clean up after themselves. This should help reduce situations where jobs aren’t started because compute nodes incorrectly appear busy to the scheduler.

We will also be relocating some storage servers to prepare for incoming equipment. There should be no noticeable impact to this move, but just in case, the following filesystems are involved:

  • /nv/pase1
  • /nv/pb2
  • /nv/pc6
  • /nv/pcoc1
  • /nv/pface1
  • /nv/pmart1
  • /nv/pmeg1
  • /nv/pmicro1
  • /nv/py2

Among other things, these moves will pave the way for another capacity expansion of our DDN project storage, as well as a new scratch filesystem. Stay tuned for more details on the new scratch, but we are planning a significant capacity and performance increase. Projected timeframe is to go into limited production during our October maintenance window, and ramp up from there.

We will also be implementing some performance tuning changes for the ethernet networks that should primarily benefit the non-GPFS project storage.

The /nv/pase1 filesystem will be moved back to its old storage server, which is now repaired and tested.

The tardis-6 head node will have some additional memory allocated.

And finally, some other minor changes – Firmware updates to our DDN/GPFS storage, as recommended by DDN, as well as installation of additional disks for increased capacity.

The OIT Network Backbone team will be upgrading the appliances that provide DNS & DHCP services for PACE. This should be negligible impact to us, as they have already rolled out new appliances for most of campus already.

Replacement of a fuse for the in-rack power distribution in rack H33.

— Neil Bright

Overset Grid Symposium October 6-9

A unique opportunity to meet in an intimate setting with the grid generation, solver, and post-processing tool developers prominent in the field! Past attendees include developers of FUN3D, OpenFoam, Overflow, Overgrid, Overture, SUGGAR++, and technical rep- resentatives from Pointwise, Intelligent Light, Celeritas, and more!

For more information, visit: http://www.2014.oversetgridsymposium.org/index.php

Flyer

XSEDE14 comes to Atlanta (July 13-18)

XSEDE 2014 is coming to town, and here’s their announcement (https://www.xsede.org/web/conference/xsede14)

Mark your calendars and join us in Atlanta for XSEDE14, July 13-18, 2014!

The annual XSEDE conference brings together the extended community of individuals interested in advancing research cyberinfrastructure and integrated digital services for the benefit of science and society. XSEDE14 will place a special emphasis on recruiting and engaging under-represented minorities, women, and students as well as encouraging participation by people from domains of study that do not traditionally use high-performance computing. Sessions will be structured to engage people who are new to computational science and engineering, as well as providing in-depth tutorials and high-quality peer-reviewed papers that will allow the most experienced researchers to gain new insights and knowledge.

Hotel and Registration deadlines extended!

The XSEDE14 Conference is shaping up to be an excellent conference! We are pleased to announce that the hotel has extended our room block rate until June 27. To align with the extended room block extension, the conference registration will remain at $500 for full conference participation through June 27. After June 27, the late registration fee of $600 will begin.

Update on widespread drive failures

After some consultation with members of the GT IT community (thank you specifically, Didier Contis for raising the awareness of the issue), as well as our vendor, we have identified the cause of the high rate of disk failures plaguing storage units purchased a little bit more than a year ago.

An update to the firmwares running on the internal backplanes of the storage arrays was necessary, and performance and availability were greatly improved immediately after these were applied on the arrays. These firmwares are normally manufacturer maintained materials only, and aren’t readily available to the public like controller firmwares are, which led to some additional delays before repairs.

That said, we have retained the firmwares and the software used to apply them for any future use should other units have issues.

Physical host failure for VMs – potential job impact

This morning (approximately between 3am and 8am) we suffered a failure in one of our physical hosts which makes up part of our VM farm. This failure caused several head nodes to go offline, as well as one of the PACE run license servers for software.

**********
For ALL PACE run clusters, it would be wise to double check your job runs in case they may have lost their license server prior to kicking off this morning or if it was running during this time.
**********

The following head nodes went offline, but have returned:
cygnus-6
granulous
megatron
microcluster
mps
rozell
testflight-6

The following license server went offline, but has returned:
license-gt

In the cases of the head nodes, no jobs should have been affected nor any data lost because of nodes being offline.

PACE Lecture Series: New Courses Scheduled

PACE is starting the lecture and training course offering once again.
Over the next two months, we are offering 4 courses:

  1. Introduction to Parallel Programming with MPI and OpenMP (July 22)
  2. Introduction to Parallel Application Debugging and Profiling (June 17)
  3. A Quick Introduction To Python (June 24)
  4. Python For Scientific Computing (July 8)

For details about where and when, or to register your attendance (each class is limited to 30 seats), visit our PACE Training page.

Disk failure rate spike

Hey everyone,

We’ve noticed an increase in a type of disk failure on some of the storage nodes that ultimately has a severe negative impact on storage performance. In particular, we observe that certain models of drives in certain manufacturing date ranges seem to be more prone to failure.

As a result, we’re looking a bit more closely at our logs to keep an eye on how widespread this is, but most of the older storage seems fine; it has tended towards some of the newer storage using both 2Tb and 4Tb drives. The 2Tb drives are the more surprising to us as the model line involved has generally been performing as expected, with many older storage units using the same drives without having these issues.

We are also engaging our vendor to see if this is something that they are seeing elsewhere, and making sure we keep a close eye on our stock of replacements to deal with these failures.

Storage slowdowns due to failing disks

CLUSTERS INVOLVED: emory/tardis, ase1

Hey folks,

We’ve gone ahead and replaced some disks in your storage as the type of failures they are generating right now cause dramatic slowdowns in I/O performance for the disk arrays.

As a result of the replacements, the array will remain slow for a period of ~5 or so hours as the arrays rebuild themselves to have the appropriate redundancy.

We’ll be keeping an eye on this problem as we have recently noticed a spike in the number of these events as of late.

Big Data Week 2014 – Atlanta

A big Big Data event with many interesting speakers. Food and drinks will be served!
You can RSVP following the link:

http://meetup.com/Atlantas-Big-Data-Week-2014/

(needs a meetup account)

Monday, May 5, 2014 to

250 14th Street, NW , Atlanta, GA 30361

Parking is free in the deck.

Panelists

• Moderator – Bloomberg‘s Duane Stanford

Delta‘s Russell Pierce, Managing Director of Customer Data & Analytics

Home Depot‘s Steven Einbender, Lead Advanced Analytics Architect

Children’s Healthcare of Atlanta‘s Michael Thompson, VP of Business Intelligence

AirWatch‘s John Marshall, CEO

Weather Company‘s Eli Phetteplace, Director of Enterprise Data

Agenda

• 5-7 – networking and sponsor showcase

• 7-7:30 – introductions and intro to Big Data Week

• 7:30-8:30 – Keynote panel

• 8:30-9 – Q&A and closing remarks