[updated] Scratch Storage and Scheduler Concerns

Scheduler

The new server for the workload scheduler seems to have gone well.  We haven’t received much user feedback, but what we have received has been positive.  This matches with our own observations as well.  Presuming things continue to go well, we will relax some of our rate-limiting tuning paramaters on Thursday morning.  This shouldn’t cause any interruptions (even of submitting new jobs) but should allow the scheduler to start new jobs at a faster rate.  The net effect is to try and decrease wait times some users have been seeing.  We’ll slowly increase this parameter and monitor for bad behavior.

Scratch Storage

The story of the Panasas scratch storage does not go as well.  Last week, we received two “shelves” worth of storage to test.  (For comparison, we have five in production.)  Over the weekend, we put these through synthetic tests, designed to mimic the behavior that causes them to fail.  The good news is that we were able to replicate the problem in the testbed.  The bad news is that the highly anticipated new firmware provided by the vendor still does not fix the issues.  We continue to press Panasas quite aggressively for resolution and are looking into contingency plans – including alternate vendors.  Given that we are five weeks out from our normal maintenance day and have no viable fix, an emergency maintenance between now and then seems unlikely at this point.

RFI-2012, a competitive vendor selection process

Greetings GT community,

PACE is in the midst of our annual competitive vendor selection process. As outlined on the “Policy” page of our web site, we have issued a set of documents to various state contract vendors. This time around we have Dell, HP, IBM and Penguin Computing. Contained within these documents are general specifications based on the computing demand we are anticipating coming from the faculty over the next year. I’ve included a link to the documents (GT login required) below. Please bear in mind that these specs are not intended to limit configurations you may wish to purchase, but rather to normalize vendor responses and help us choose a vendor for the next year.

The document I’m sure you will be most interested in is a timeline. The overall timeline has not been published to the vendors, and I would appreciate if it was kept confidential. The first milestone, which obviously has been published, is that responses are due to us by 5:00pm today. The next step is for us to evaluate those responses. If any of you are interested in commenting on those responses, please let me know. Your feedback is appreciated.

Please watch this blog, as we will post updates as we move through the process.  We already have a number of people interested in a near-term purchase.  If you are as well, or you know somebody who is, now is the time to get the process started.  Please contact me at your convenience.

 

--
Neil Bright
Chief HPC Architect
neil.bright@oit.gatech.edu

[updated] new server for job scheduler

As of about 3:00 this afternoon, we’re back up on the new server. Things look to performing much better. Please let us know if you have troubles. Also, positive reports on scheduler performance would be appreciated as well.

Thanks!

–Neil Bright

——————————————————————-

[update: 2:20pm, 8/30/12]

We’ve run in to a last minute issue with the scheduler migration.  Rather than rush things going into a long weekend, we will reschedule for next week, 2:30pm Tuesday afternoon.

——————————————————————-

We have made our preparations to move the job scheduler to new hardware, and plan to do so this Thursday (8/30) afternoon at 2:30pm.  We expect this to be a very low impact, low risk change.  All queued jobs should move to the new server and all executing jobs should continue to run without interruption.  What you may notice is some amount of time where you will be unable to submit new jobs and job queries will fail.  You’ll see the usual ‘timeout’ messages from commands like msub and showq.

As usual, please direct any concerns to pace-support@oit.gatech.edu.

–Neil Bright

Call for Proposals for Allocations on the Blue Waters High Performance Computing System

FYI – for anybody interested in applying for time on the petaflop Cray being installed at NCSA.

Begin forwarded message:

From: “Gary Crane” <gcrane@sura.org>
To: ITCOMM@sura.org
Sent: Thursday, August 9, 2012 10:51:37 AM
Subject: Call for Proposals for Allocations on the Blue Waters High Performance Computing System
The Great Lakes Consortium for Petascale Computing (GLCPC) has issued a call for proposals for allocations on the Blue Water system. Principle investigators affiliated with a member of the Great Lakes Consortium for Petascale Computation are eligible to submit a GLCPC allocations proposal. SURA is a member of the GLCPC and PIs from SURA member schools are eligible to submit proposals. Proposals are due October 31, 2012.

The full CFP can be found here: http://www.greatlakesconsortium.org/bluewaters.html

–gary

Gary Crane
Director, SURA IT Initiatives
phone: 315-597-1459
fax: 315-597-1459
cell: 202-577-1272

maintenance day complete, ready for jobs

We are done with maintenance day – however some automated nightly processes still need to run before jobs can flow again.  So, I’ve set an automated timer to release jobs at 4:30am today.  That’s a little over two hours from now.  The scheduler will accept new jobs now, but will not start executing until 4:30am.

 

With the exception of the following two items, all of the tasks listed at our previous blog post have been accomplished.

  • * firmware updates on the scratch servers were deferred per the strong recommendation of the vendor
  • * an experimental software component of the scratch system was not tested due to the lack of test plan from the vendor.

 

SSH host keys have changed on the following head nodes.  Please accept the new keys into your preferred SSH client.

  • atlas-6
  • atlas-post5
  • atlas-post6
  • atlas-post7
  • atlas-post8
  • atlas-post9
  • atlas-post10
  • apurimac
  • biocluster-6
  • cee
  • critcel
  • cygnus-6
  • complexity
  • cns
  • ece
  • granulous
  • optimus
  • math
  • prometheus
  • uranus-6

REMINDER – upcoming maintenance day, 7/17

The  major activity for maintenance day is the RedHat 6.1 to RedHat 6.2 software update.  (Please test your codes!!)   This will affect a significant amount of our user base.  We’re also instituting soft quotas on the scratch space.  Please see the detail below.

The following are running RedHat 5, and are NOT affected:

  • Athena
  • Atlantis

The following have already been upgraded to the new RedHat 6.2 stack.  We would appreciate reports on any problems you may have:

  • Monkeys
  • MPS
  • Isabella
  • Joe-6
  • Aryabhata-6

If I didn’t mention your cluster above, you are affected by this software update.  Please test using the ‘testflight’ queue.  Jobs are limited to 48 hours in this queue.  If you would like to recompile your software with the 6.2 stack, please login to the ‘testflight-6.pace.gatech.edu’ head node.

Other activities we have planed are:

Relocating some project directory servers to an alternate data center on campus.  We have strong network connectivity, so this should not change performance of these filesystems.  No user modifications needed.

  • /nv/hp3 – Joe
  • /nv/pb1 – BioCluster
  • /nv/pb3 – Apurimac
  • /nv/pc1 – Cygnus
  • /nv/pc2 – Cygnus
  • /nv/pc3 – Cygnus
  • /nv/pec1 – ECE
  • /nv/pj1 – Joe
  • /nv/pma1 – Math
  • /nv/pme1 – Prometheus
  • /nv/pme2 – Prometheus
  • /nv/pme3 – Prometheus
  • /nv/pme4 – Prometheus
  • /nv/pme5 – Prometheus
  • /nv/pme6 – Prometheus
  • /nv/pme7 – Prometheus
  • /nv/pme8 – Prometheus
  • /nv/ps1 – Critcel
  • /nv/pz1 – Athena

Activities on the scratch space – no user change is expected for any of this.

  • We need to balance some users on volumes v3, v4, v13 and v14.  This will involve moving users from one volume to another, but we will place links in the old locations.
  • Run a filesystem consistency check on the v14 volume.  This has the potential to take a significant amount of time.  Please watch the pace-availability email list (or this blog) for updates if this will take longer than expected.
  • firmware updates on the scratch servers to resolve some crash & failover events that we’ve been seeing.
  • institute soft quotas.  Users exceeding 10TB of usage on the scratch space will receive automated warning emails, but writes will be allowed to proceed.  Currently, this will affect 6 of 750+ users.  The 10TB space represents about 5% of a rather expensive shared 215TB resource, so please be cognizant of the impact to other users.

Retirement of old filesystems.  User data will be moved to alternate filesystems.  Affected filesystems are:

  • /nv/hp6
  • /nv/hp7

Performance upgrades (hardware RAID) for NFSroot servers for the Athena cluster. Previous maintenance activities have upgraded other clusters already.

moving some filesystems off of temporary homes and onto new servers.  Affected filesystems are:

  • /nv/pz2 – Athena
  • /nv/pb2 – Optimus

If time permits, we have a number of other “targets of opportunity” –

  • relocate some compute nodes and servers, removing retired systems
  • reworking a couple of Infiniband uplinks for the Uranus cluster
  • add resource tags to the scheduler so that users can better select compute node features/capabilities from their job scripts
  • relocate a DNS/DHCP server for geographic redundancy
  • fix system serial numbers in the BIOS for asset tracking
  • test a new Infiniband subnet manager to gather data for future maintenance day activities
  • rename some ‘twin nodes’ for naming consistency
  • apply BIOS updates to some compute nodes in the Optimus cluster to facilitate remote management
  • test an experimental software component of the scratch system.  Panasas engineers will be onsite to do this and revert before going back into production.  This will help gather data and validate a fix for some other issues we’ve been seeing.

upcoming maintenance day, 7/17 – please test your codes

It’s that time of the quarter again, and all PACE-manager clusters will be taken offline for maintenance on July 17 (Tuesday). All jobs that will not complete by then will be held by the scheduler. They will be released by the scheduler once the clusters are up and running again, requiring no further action on your end. If you find that your jobs does not start running, then you might like to check its walltime to make sure it does not exceed this date.

With this maintenance, we are upgrading our RedHat 6 clusters to RedHat 6.2, which includes many bugfixes and performance improvements. This version is known to provide better software and hardware integration with our systems, particularly with the 64-core nodes we have been adding over the last year.

We are doing our best to test existing codes with the new RedHat 6.2 stack. In our experience, codes currently running on our RedHat 6 systems continue to run without problems. However we strongly recommend you test your critical codes on the new stack. For this purpose, we renovated the “testflight” cluster to include RedHat 6.2 nodes, so all you need for testing is to submit your RedHat 6 jobs to the “testflight” queue. If you would like to recompile your code, please login to the testflight-6.pace.gatech.edu head node. Please try to keep the problem sizes small since this cluster only includes ~14 nodes with cores varying from 16 to 48, plus a single 64 core node. We have limited this queue to two jobs at a time from a given user. We hope the testflight cluster will be sufficient to test drive your codes, but if you have any concerns, or notice any problems with the new stack, please let us know at pace-support@oit.gatech.edu.

We will also upgrade the software on the scratch storage Panasas. We have observed many ‘failover’ events resulting in brief interruptions of service under high loads, potentially incurring performance penalties on running codes. This version is supposed to help address these issues.

We have new storage systems for Athena (/nv/pz2) and Optimus (/nv/pb2). During maintenance day, we will move these filesystems off of temporary storage, and onto their new servers.

More details will be forthcoming on other maintenance day activities, so please keep an eye on our blog at http://blog.pace.gatech.edu/

Thank you for your cooperation!

-PACE Team

scratch space improvements

While looking into some reports of less-than-desired performance from the scratch space, we have found and addressed some issues.  We were able to enlist the help of a support engineer from Panasas, who helped us identify a few places to improve configurations.  These were applied last week, and we expect to see improvements in read/write speed.

If you notice differences in the scratch space performance (positive or negative!) please let us know by sending a note to pace-support@oit.gatech.edu.

reminder – electrical work in the data center

Just a quick reminder that Facilities will be doing some electrical work in the data center, unrelated to PACE, tomorrow.  We’re not expecting any issues, but there is a remote possibility that this work could interrupt electrical power to various PACE servers, storage and network equipment.

FYI – upcoming datacenter electrical work

In addition to our previously scheduled maintenance day activities next tuesday, the datacenter folks are scheduling another round of electrical work during the morning of Saturday 4/21.  Like the last time, this should not affect any PACE managed equipment, but just in case….