PACE clusters ready for research

Our quarterly maintenance is now complete, and the clusters are running previously submitted jobs and awaiting new submissions.

We have successfully completed a number of things:

Athena has been fully migrated to RedHat 6.3
The BioCluster /nv/pb4 filesystem has been migrated to the DDN space
All our Solaris storage servers have been patched
firewall upgrades are complete
electrical distribution
DDN updates
VMware updates
the mathlocal collection of software has been migrated to /nv/pma1

However, we were unable to complete the upgrade of the TestFlight cluster to RedHat 6.5. At the moment TestFlight is down, and we will complete the upgrade over the next couple of days.

As always, please contact us (pace-support@oit.gatech.edu) for any problems or concerns you may have. Your feedback is very important for us, especially regarding file transfers in and out of the clusters. (i.e. between your workstations and the PACE clusters)

PACE quarterly maintenance – April 15-16 2014

PACE Quarterly maintenance has begun

See this space for updates.

PACE Quarterly maintenance notification

It’s time again for our quarterly maintenance. We will have the clusters down April 15 & 16.

As usual, we’ve instructed the schedulers to avoid running jobs that would cross into a planned maintenance window. This will prevent running jobs from being killed, but also may mean jobs you submit now may not run until after maintenance completes. I would suggest checking the wall times for the jobs you will be submitting and, if possible, modify them accordingly so they will complete sometime before the maintenance. Submitting jobs with longer wall times is still OK, but they will be held by the scheduler and released after maintenance completes.

Much of our activities time around are not directly visible, with a couple of notable exceptions.

We will be upgrading the operating system on our TestFlight cluster from RedHat 6.3 to RedHat 6.5. Please do test your codes on this cluster over the coming weeks and months, as we plan to roll it out (along with any needed fixes) to all other RedHat 6 clusters in July. This update is expected to bring some performance improvements, as well as some critical security fixes. Additionally, it adds support for the Intel Ivy Bridge platform, which many of you are ordering. Any new Ivy Bridge platforms will start with RedHat 6.5.

Other user visible changes include:

conclude the migration of the Athena cluster to RedHat 6.3. We’ll plan to take Athena to 6.5 in July.
conclude the migration of the BioCluster /nv/pb4 filesystem to the DDN/GPFS space.
migrate mathlocal from /nv/hp24 to /nv/pma1 (Math cluster project space)
application of recommended and security patches to our Solaris storage systems. This is a widespread update will affect filesystems that start with /nv. A rapid reversion process is available should unanticipated events occur.
firewall upgrades to increase bandwidth between PACE and campus

Not so apparent changes include:

repairing some electrical distribution to compute node racks
minor software/firmware update to DDN to enable support of DDN/WOS evaluation
updates to VMware “hardware” levels, enabled by previous migration to VMware 5.1

As always, please follow our blog for communications, especially for announcements during our maintenance activities – and let us know of any concerns via pace-support@oit.gatech.edu.

images requested for annual CASC brochure

The time has come again to gather images for the annual CASC brochure. CASC is the Coalition for Academic Scientific Computation, and GT is a member institution. We use the brochure in our advocacy efforts at the funding agencies and in D.C. Previous brochures are online at http://casc.org/research-publications.

If you have something you would be interested in sharing, please let me know. Below is some text from the CASC regarding what they are looking for.

This year marks the 25th anniversary of CASC and we want to recognize that milestone in the new brochure. If you have historical pictures, scientific visualizations and/or stories that can help us illustrate how CASC and HPC have evolved over the years, please start gathering those now. We will set up a website soon where you can upload your images and text. We hope to have everything we need by June 1, 2014.

As always, we are looking for high-quality images and stories that illustrate the impact of HPC and related technologies. The more we have the better, but we are especially interested in images and stories about research and accomplishments in Energy, Health and Medicine, Industrial Innovation, Environment and Natural Resources, Matter and the Universe, Education and Outreach, and Big Data. More information about how to upload your images and text will be sent shortly. The deadline will be earlier this year: June 15, 2014.

Call For Papers – XSEDE14

Greetings all,

XSEDE14 is coming up soon, and has issued their call for participation. Please note that this conference is being held in Atlanta!

Selected papers from all tracks will be invited to extend the manuscripts to be considered for publication in a special issue of the journal of Concurrency and Computation Practice and Experience. Papers accepted for the “Education, Outreach, and Training” track will be invited to extend the manuscripts for publication in the Journal of Computational Science Education.

Abstracts are due March 15. Please see https://www.xsede.org/xsede14 for further information.

Reminder – January Maintenance

Hi folks,

Just a reminder of our upcoming maintenance activities next week. Please see my previous blog post here: http://blog.pace.gatech.edu/?p=5449 for details.

In addition to the items described in the previous post, we will also be fixing up some quotas on home and project directories for some users who have no quotas applied. Per policy, all users should have a 5GB quota in their home directory. A preliminary look through our accounts indicates that only one or two users have no quota applied here, and are over the 5GB limit. We will be in touch with those users shortly to address the issue. Project directory quotas are at the sized at the discretion of the faculty. For those users without a quota on their project directory, we will apply a quota that is sufficiently sized such that all users remain under their quota. After the maintenance day, we will provide a report to faculty detailing the project directory usage of their users, and work with them to make any adjustments needed. Remember, the project directory quotas are simply intended to prevent an accidental consumption of a space that would negatively impact the work of other users of that storage.

Related to the home & project quotas, I’d also like to give you a heads up of some upcoming adjustments to the scratch space quotas. Current policy is a 10TB soft quota and a 20TB hard quota. Given the space problems we’ve been having with the scratch, we will be adjusting this to a 5TB soft quota, and a 7 TB hard quota. This change should only affect a small handful of users. Given the close proximity to our maintenance next week, we will be making this change at the end of January, NOT next week. This is an easy first step that we can take to start addressing the recent lack of space on scratch storage. We are looking at a broad spectrum of other policy and technical changes, including changing retention times, improving our detection of “old” files, as well as increasing capacity. If you have any suggestions for other adjustments to scratch policy, please feel free to let me know. Please remember that the scratch space is intended for transient data – not as a long term place to keep things.

Finally, we will also be completing the upgrade of the remaining RHEL5 portions of the Atlas cluster to RHEL6. Likewise, we will continue the migration of the Athena cluster from RHEL5 to RHEL6, leaving only a few nodes as RHEL5.

–Neil Bright

PACE quarterly maintenance – 2 days; January 2014

…back to regularly scheduled events.

Our next maintenance window is fast approaching. We will continue the 2-day downtimes, with the next one occurring Tuesday, January 14 and Wednesday, January 15. The list of major changes is small this time around, but impactful.

The largest change, affecting all clusters, is a major update to the Moab & Torque scheduling system that is used to schedule and manage your jobs. The upgraded versions fix a number of long-standing problems and scaling issues with command-timeouts, stability, and processing large job-sets.

The testflight cluster has been updated, and is available to anyone that wishes to test their submission processes against the new upgraded versions.In many cases, the processes used to submit and query your jobs will remain the same. For some, a change in the way that you use the system — may be required. You will still be able to accomplish the same things, but may need to use different commands to do it.

We have updated our usage documentation to include a simple transition guide here.

In addition to the guide, we have also written a FAQ, which can be viewed by running the command ‘jan2014-faq‘ after logging in.

Because of the version differences between the old software and the new software, we will unfortunately not be able to preserve any jobs that are still in a queued state once maintenance begins. If you have any queued jobs going into maintenance, then you will need to resubmit them after maintenance.

The fixes planned for January also include the following:

Infrastructure:

Operating System upgrades to the server running scheduling software for the “shared” clusters. This will bring it up to the same level as the other scheduler servers.
Adjustments to scalability & performance parameters on our GPFS filesystem.

Optimus cluster:

Optimus users will have access to a new queue: ‘optimus-force-6’, as well as access to the iw-shared-6 queue.

Gryphon cluster:

The current (temporary) head node and scheduler server will return to their roles as compute nodes for the cluster.
New servers will be brought into production for the head node & scheduler servers.

BioCluster cluster:

Data migrations between the pb1, pb4 and DDN filesystems. This should be transparent to users, and ease the space crunch everybody has been experiencing.

October 2013 PACE maintenance complete

Greetings!We have completed our maintenance activities for October. All clusters are open, and jobs are flowing. We came across (and dealt with) a few minor glitches, but I’m very happy to say that no major problems were encountered. As such, we were able to accomplish all of our goals for this maintenance window.

All project storage servers have had their operating systems updated. This should protect from failures during high load. Between these fixes, and the networking fixes below, we believe all of the root causes of storage problems we’ve been having recently are resolved.
All of our redundancy changes and code upgrades to network equipment have been completed.
The decentralization of job scheduling services has been completed. You should see significantly improved responsiveness when submitting jobs or checking on the status of existing jobs.
- The decentralization of job scheduling services has been completed. You should see significantly improved responsiveness when submitting jobs or checking on the status of existing jobs. Please note that you will likely need to resubmit jobs that did not have a chance to run before Tuesday. Contrary to previously announced and intended designs, this affects the shared clusters as well. We apologize for the inconvenience.
- Going forward, the scheduler decentralization has a notable side effect. Previously, any login node could submit jobs to any queue, as long as the user had access to do so. Now, this may no longer be the case.
- For instance, a user of the dedicated cluster “Optimus” that also had access to the FoRCE, could simply submit jobs to the force queue from the optimus head node. Now, That user will no longer be able to do so, as Optimus and FoRCE are scheduled by different servers.
- We believe that these cases should be quite uncommon. If you do encounter this situation, you should be able to simply login to the other head node and submit your jobs from there. You will have the same home, project and scratch directories from either place. Please let us know if you have problems.
All RHEL6 clusters now have access to our new GPFS filesystem. Additionally, all of the applications in /usr/local (matlab, abacus, PGI compilers, etc.) have been moved to this storage. This should provide performance improvements for these applications as well as the Panasas scratch storage, which was the previous location of this software.
Many of our virtual machines have been moved to different storage. This should provide an improvement in the responsiveness of your login nodes. Please let us know (via pace-support@oit.gatech.edu) if you see undesirable performance from your login nodes.
The Atlantis cluster has been upgraded from RHEL5 to RHEL6 (actually, this happened before this week), and 31 Infiniband-connected nodes from the RHEL5 side of the Atlas cluster have been upgraded to RHEL6. (The 32nd has hardware problems and has been shut down.)
The /nv/pf2 project filesystem has been migrated to a server with more breathing room.

Additionally, we were able to complete a couple of bonus objectives.

You’ll notice a new message when logging in to your clusters. Part of this message is brought to you from our Information Security department, and the rest is intended to give a high-level overview of the specific cluster and the queue commonly associated with it.
Infiniband network redundancy for the DDN/GPFS storage.
The /nv/pase1 filesystem was moved off of temporary storage, and onto the server purchased for the Ase1 cluster.

PACE maintenance day, October 2013

The time has again come to discuss our upcoming quarterly maintenance. As you may recall, our activities lasted well into night during our last maintenance in July. Since then, I’ve been talking with various stakeholders and the HPC Faculty Governance Committee. Starting with our October maintenance and going forward, we will be extending our quarterly maintenance periods from one to two days. In October, this will be Tuesday the 15th & Wednesday 16th. I’ve updated our schedule on the PACE web page. Upcoming maintenance periods for January, April and July will be posted shortly.

Please continue reading below, there are a couple of things that will require user actions after maintenance day.

Scheduled fixes for October include the following:

Project storage: We will deploy fixes for the remaining storage servers. This will complete the roll out of these fixes initiated during our last maintenance period. These fixes incorporate our best known stable Solaris platform at this point. Between these fixes, and the networking fixes below, we believe this to resolve most, if not all, of the storage issues we’ve been having lately.
Networking updates: We have three categories of work here. The first is to upgrade the firmware on all of the switches in our gigabit ethernet fabric. This should solve the switch rebooting problem. The second item is to finish the ethernet redundancy work we didn’t complete in July. While this redundancy work will not ensure that individual compute nodes won’t suffer an ethernet failure, it does nearly eliminate the number of single points of failure in the network itself. No user visible impact is expected. We’re also planning on updating the firmware on some of our smaller Infiniband switches to bring them in line with the version of software we’re running elsewhere.
Moab/Torque job scheduler: In order to mitigate some response issues with the scheduler, we are transitioning from a centralized scheduler server that controls all clusters (well, almost all) to a set of servers. Sharing clusters will remain on the old server, and all of the dedicated clusters [1] will be distributed out to a series of new schedulers. In all instances, we will still run the same _version_ of the software, but we’ll just have a fair bit more aggregate horsepower for scheduling. This should provide a number of benefits, primarily in the response you see when submitting new jobs and querying the status of queued or running jobs. Provided this phase goes well, we will look to upgrade the version of the moab/torque software in January.
New storage platform: We have been enabling access to the DDN storage platform via the GPFS filesystem on all RHEL6 clusters. This is now complete and we are opening up the DDN for investment. Faculty may purchase drives on this platform to expand project spaces. Please contact me directly if you are interested in a storage purchase. Our maintenance activities will include an update to the GPFS software which provide finer grained options for user quotas.
Filesystem balancing: We will be moving the /nv/pf2 project filesystem for the FoRCE cluster to a different server. This will allow some room for expansion and guard against it filling completely. We expect no user-visiable changes here either, all data will be preserved, no paths will change, etc. The data will just reside on a different physical server(s).
vmWare improvements: We will be rebalancing the storage used by our virtual machine infrastructure (i.e. head nodes), and other related tasks aimed at improving performance and stability for these machines. This is still an active are of preparation, so the full set of fixes and improvements remain to be fully tested.
Cluster upgrades: We will be upgrading the Atlantis cluster from RHEL5 to RHEL6. Also, we will be upgrading 32 infiniband-connected nodes in Atlas-5 to Atlas-6.

[1] Specifically, jobs submitted to the following queues will be affected:

aryabhata, aryabhata-6
ase1-6
athena, athena-6, athena-8core, athena-debug
atlantis
atlas-6, atlas-ge, atlas-ib
complexity
cssbsg
epictetus
granulous
joe-6, joe-6-vasp, joe-fast, joe-test
kian
martini
microcluster
monkeys, monkeys_gpu
mps
optimus
rozell
skadi
uranus-6

Head node problems

Head nodes to many PACE clusters are currently down due to problems with our virtual machines. This should not affect running jobs, but users are unable to login. PACE staff are actively working to restore services as soon as possible.

The head nodes affected are:

apurimac-6 – BACK ONLINE 2013/08/03 00:30
aryabhata-6 – BACK ONLINE 2013/08/03 00:30
ase1-6 – BACK ONLINE 2013/08/03 03:10
athena – BACK ONLINE 2013/08/03 04:45
atlantis – BACK ONLINE 2013/08/03 04:45
atlas-6 – BACK ONLINE 2013/08/03 00:40
cee – BACK ONLINE 2013/08/03 01:40
chemprot – BACK ONLINE 2013/08/03 01:40
complexity – BACK ONLINE 2013/08/03 01:40
critcel – BACK ONLINE 2013/08/03 02:00
ece – BACK ONLINE 2013/08/03 02:00
emory-6 – BACK ONLINE 2013/08/03 02:20
faceoff – BACK ONLINE 2013/08/03 03:10
granulous – BACK ONLINE 2013/08/03 03:10
isabella – BACK ONLINE 2013/08/03 03:10
kian – BACK ONLINE 2013/08/03 03:10
math – BACK ONLINE 2013/08/03 03:10
megatron – BACK ONLINE 2013/08/03 03:10
microcluster – BACK ONLINE 2013/08/03 03:10
optimus-6 – BACK ONLINE 2013/08/03 00:30
testflight-6 – BACK ONLINE 2013/08/03 03:10
uranus-6 – BACK ONLINE 2013/08/03 00:30

The following nodes will likely generate SSH key errors upon
connection, as the key saving processes had not run on them. Please
edit your ~/.ssh/known_hosts file (Linux/Mac/Unix) and remove any host
entry with these names and save the new keys.

ase1-6
chemprot
faceoff
kian
microcluster

Additionally, the following user-facing web services are also offline:

galaxy

PACE maintenance – complete

We’ve finished. Feel free to login and compute. Previously submitted jobs are running in the queues. As always, if you see odd issues, please send a note to pace-support@oit.gatech.edu.

We were able to complete our transition to the database-driven configuration, and apply the Panasas code upgrade. Some of you will be seeing warning messages stemming from your utilization of the scratch space. Please remember that this is a shared, and limited, resource. The RHEL5 side of the FoRCE cluster was also retired, and reincorporated into the RHEL6 side.

We were able to achieve some of the network redundancy work, but this took substantially longer than planned and we didn’t get as far as we would have liked. We’ll complete this during future maintenance window(s).

We spent a lot of time today trying to address the storage problems, but time was just to short to fully implement. We were able to do some work to address the storage for the virtual machine infrastructure (you’ll notice this as the head/login nodes). Over the next days and weeks, we will work on a robust way to deploy these updates to our storage servers and come up with a more feasible implementation schedule.

Some of the less time consuming items we also accomplished was to increase the amount of memory the Infiniband cards were able to allocate. This should help those of you with codes that send very large messages. We also increased the size of the /nv/pz2 filesystem – those of you on the Athena cluster, that filesystem is now nearly 150TB. We found some Infiniband cards that had outdated firmware and brought those into line with what is in use elsewhere in PACE. We also added a significant amount of capacity to one of our backup servers, added some redundant links into our Infiniband fabric and added some additional 10-gigabit ports for our growing server & storage infrastructure.

In all of this, we have been reminded that PACE has grown quite a lot over the last few years – from only a few thousand cores, to upwards of 25,000. As we’ve grown, it’s become more difficult to complete our maintenance in four days a year. Part of our post-mortem discussions will be around ways we can more efficiently use our maintenance time, and possibly increasing the amount of scheduled downtime. If you have thoughts along these lines, I’d really appreciate hearing from you.

Thanks,

Neil Bright

Partnership for an Advanced Computing Environment

Author: admin