PACE quarterly maintenance – January ’15

Hi everybody,

Our January maintenance window is upon us.  We’ll have PACE clusters down Tuesday and Wednesday next week, January 13 and 14.  We’ll get started at 6:00am on Tuesday, and have things back to you as soon as possible.

Major items this time around include:

  • routine patching on our DDN system that servers project directories for many users.
  • routine patching on the file server that provides the storage for /usr/local and our virtual machine infrastructure (including most head nodes)
  • firmware updates on some NFS project directory servers to address stability issues

Additionally, the Joe and Atlas cluster users have graciously offered to test out an upgraded version of the Moab/Torque scheduler software.  Presuming we have success with these two clusters, we will look to roll out the upgrades to the rest of the PACE universe during our April maintenance period.  If you use clusters other than Atlas and Joe, this the rest of this announcement will not affect you next week. Users of Atlas and Joe can expect the following:

  • The current version uses a different database, so we will not be able to migrate submitted jobs.  The scheduler will start with an empty queue, and you will need to resubmit your jobs after the maintenance day (sorry for this inconvenience).
  • We will start using “node packing” which allocates as many jobs on a node as possible before jumping on the next one. With the current version, users can submit many single-core jobs, each landing on a separate node, making it more difficult for the scheduler to start jobs that require entire nodes.
  • You will be able to use msub for interactive jobs (which is currently broken due to a bug), although the recommendation from the vendor company is to use “qsub” for everything (we confirmed that it’s much faster than msub).
  • There will no longer be a discrepancy between job IDs generated by msub (Moab.###) and qsub (####). You will always see a single job ID (in plain number format) regardless of your msub/qsub preference.

Other improvements included in the scheduler upgrade:

  • Speed – new versions of Moab and Torque are now multithreaded, making it possible for some query commands (e.g. showq) to return instantly regardless of the load on the scheduler. Currently, when a user submits a large job array, these commands usually timeout.
  • Introduction of cpusets. When a user is given X cores, he/she will not be able to use more than that. Currently, users can easily violate the requested limits by spawning more processes and threads and Torque cannot do much to stop that. This will significantly reduce the job interference and allows us to finally use ‘node packing’ as explained above.
  • Several other benefits from bug fixes and improvements including but not limited to, zombie processes, lost output files, missing array jobs, long job allocation times, etc.

We hope these improvements will provide you with a more efficient and productive computing environment. Please let us know (pace-support@oit.gatech.edu) if you have any concerns or questions regarding this maintenance period.

PACE clusters ready for research

Greetings!

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

In general, all tasks were successfully completed.  However, we do have some compute nodes that have not successfully applied the kernel update.  We will keep those offline for the moment and continue to work through those tomorrow.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!

Georgia Tech’s HPCC Initiative Planning – Second Industry/Research Partnership Meeting

78052A32-B2DB-4BC2-9546-6F62E3DF2152

For most of you receiving this email, the Technology Square Phase Two – High Performance Computer Center (HPCC) is not a new initiative. Following up on a successful first meeting this past March where Georgia Tech hosted over 100 GT faculty and industry partners, today I’m very happy to invite you to participate in the second planning meeting for the Tech Square Phase II/HPCC.  GT faculty and researchers who work in cloud computing, smart grid, building information modeling, big data and secure storage, networking (data centers as well as community networking, network virtualization, etc.) will be in attendance as well as researchers working on the model of using the data center as a key part of urban sustainability in our community (heat reuse, analytics capabilities for startup companies). Researchers and current industry partners in these areas will present their interests and capabilities in a tight 5 minute presentation format. You will have an opportunity to participate in our discussion and review the ideas which have been proposed, helping to guide us in this endeavor.

Continuing on the momentum from our first planning meeting, we are hosting this second meeting at Georgia Tech on November 11th, from 8 AM until 12 PM.  This meeting will immediately precede Georgia Tech’s People and Technology Forum, which you are invited to attend as well.

RSVP for the meeting is requested. To RSVP for this planning session click here.

If you would like to attend the IPaT Forum as well, you can register here.

As we finalize the agenda for this meeting we will follow-up with more details.  If you have any questions, please don’t hesitate to reach out to me or the GT Corporate relations team directly.

See you in November!
ron

Ron Hutchins, PhD
Associate Vice Provost for Research and Technology and
Chief Technology Officer
Office of the Executive Vice President for Research

A Bold New Vision For Tech Square

You may have seen or heard reference to this in other places, but I wanted to highlight some exciting things coming to Tech Square.

–Neil Bright

 

http://www.news.gatech.edu/2014/09/29/bold-new-vision-tech-square

Ron Hutchins is a man on a mission. He wants to raise the visibility of Information Technology on a university campus in ways we’ve seldom seen. Hutchins, Tech’s Associate Vice Provost for Research & Technology and Chief Technology Officer, is the visionary behind the plan to build a data center in the heart of Midtown Atlanta. He’s quick to point out though that the High Performance Computing Center is more than just a building to store equipment and disseminate data. Construction of the HPCC marks the beginning of a new phase in the expansion of Tech Square.

PACE quarterly maintenance – October ’14

Hi everybody,

Our October maintenance window is rapidly approaching.  We’ll be back to the normal two day even this time around – Tuesday, October 21 and Wednesday, October 22.

Major items this time around include the continued expansion of our DDN storage system.  This will complete the build out of the infrastructure portions of this storage system, allowing for the addition of another 600 disk drives as capacity is purchased by the faculty.

Also, we have identified a performance regression in the kernel deployed with RedHat 6.5.  With some assistance from one of our larger clusters, we have been testing an updated kernel that does not exhibit these performance problems, and will be rolling out the fix everywhere.  If you’ve noticed your codes taking longer to run since the maintenance in July, this is very likely the cause.

We will also be migrating components of our server infrastructure to RedHat 6.5.  This should not be a user visible event, but worth mentioning just in case.

Over the last few months, we’ve identified a few bad links in our network.  Fortunately, we have redundancy in place that allowed us to simply disable those links.  We will be taking corrective actions on these to bring those links back to full redundancy.

recent staff changes in PACE

I’m sorry to report that Dr. Wesley Emeneker has left the team for a position in industry. We are sad to see him leave, and wish him and his family the best in his future endeavors. We will be posting a Research Scientist position soon to fill this vacancy.

Ann Zhou <dzhou62@mail.gatech.edu> has joined the team as a Systems Support Engineer II. Ann joins us from Columbus State University and will be initially focused on user and hardware support, and taking over some of the system administration work that Wes had been doing.

We are concluding a search to fill the Senior System Support Engineer position vacated by Adam Munro earlier this year. An offer is pending, and I’m hopeful this person will start soon.

Finally, we have a search currently underway for an Applications Developer II. The position description is available at https://pace.gatech.edu/application-developer-ii. Please pass the word along to anybody who may have interest.

PACE clusters ready for research

Greetings!

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

In general, all tasks were successfully completed.  However, we do have some compute nodes that are still having various issues.  We will continue to work through those tomorrow.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important to us!

 

PACE quarterly maintenance – July ’14

2014-07-15 at 6am: Maintenance has begun

Hi folks,

It is time again for our quarterly maintenance. We have a bit of a situation this time around and will need to extend our activities into a third day – starting at 6:00am Tuesday, July 15 and ending at or before 11:59pm Thursday, July 17. This is a one-time event, and I do not expect to move to three-day maintenance as a norm. Continue reading below for more details.

Over the years, we’ve grown quite a bit and filled up one side of our big Infiniband switch. This is a good thing! The good news is that there is plenty of expansion room on the other side of the switch. The bad news is that we didn’t leave a hole in the raised floor to get the cables to the other side of the switch. In order to rectify this, and install all of the equipment that was ordered in June, we need to move the rack that contains the switch as well as some HVAC units on either side. In order to do this, we need to unplug a couple hundred Infiniband connections and some ethernet fiber. Facilities will be on hand to handle the HVAC. After all the racks are moved, we’ll swap in some new raised-floor tiles and put everything back together. This is a fair bit of work, and is the impetus for the extra day.

In addition, we will be upgrading all of the RedHat 6 compute nodes and login nodes from RHEL6.3 to RHEL6.5 – this represents nearly all of the clusters that PACE manages. This image has been running on the TestFlight cluster for some time now – if you haven’t taken the opportunity to test your codes there, please do so. This important update contains some critical security fixes to go along with the usual assortment of bug fixes.

We are also deploying updates to the scheduler prologue and epilogue scripts to more effectively combat “leftover” processes from jobs that don’t completely clean up after themselves. This should help reduce situations where jobs aren’t started because compute nodes incorrectly appear busy to the scheduler.

We will also be relocating some storage servers to prepare for incoming equipment. There should be no noticeable impact to this move, but just in case, the following filesystems are involved:

  • /nv/pase1
  • /nv/pb2
  • /nv/pc6
  • /nv/pcoc1
  • /nv/pface1
  • /nv/pmart1
  • /nv/pmeg1
  • /nv/pmicro1
  • /nv/py2

Among other things, these moves will pave the way for another capacity expansion of our DDN project storage, as well as a new scratch filesystem. Stay tuned for more details on the new scratch, but we are planning a significant capacity and performance increase. Projected timeframe is to go into limited production during our October maintenance window, and ramp up from there.

We will also be implementing some performance tuning changes for the ethernet networks that should primarily benefit the non-GPFS project storage.

The /nv/pase1 filesystem will be moved back to its old storage server, which is now repaired and tested.

The tardis-6 head node will have some additional memory allocated.

And finally, some other minor changes – Firmware updates to our DDN/GPFS storage, as recommended by DDN, as well as installation of additional disks for increased capacity.

The OIT Network Backbone team will be upgrading the appliances that provide DNS & DHCP services for PACE. This should be negligible impact to us, as they have already rolled out new appliances for most of campus already.

Replacement of a fuse for the in-rack power distribution in rack H33.

— Neil Bright

Linux Cluster Institute Workshop

FYI –

Please pass along to anybody you think may be interested.  You may see some familiar faces there!  😉
We haven’t flushed out all of the details yet, but registration is likely to be somewhere in the $200-$300 range – pretty reasonable for a weeks worth of training. Official announcement follows below.

–Neil Bright

Save the date and plan to attend!
Linux Cluster Institute (LCI) Workshop
August 4-5, 2014
National Center for Supercomputing Applications (NCSA)
Urbana, Illinois

If you are a user of HPC or are responsible for maintaining an HPC resource, this is the workshop for you!  In just four days you will learn:

  • How to be an HPC cluster system administrator
  • How to be an effective HPC cluster user
  • The key issues of HPC
  • Current and emerging HPC hardware and software technologies

All sessions taught by some of the world’s best experts in HPC.
Program details and registration information coming soon!
www.linuxclusterinstitute.org