Hi folks,

 

Just a quick reminder here of our maintenance activities coming up on Tuesday of next week.  All PACE managed clusters will be down for the day.  For further details, please see our blog post here.

 

Thanks!

Neil Bright

PACE maintenance day – July 16

Dear PACE cluster users,

The time has come again for our quarterly maintenance day, and we would like to remind you that all systems will be powered off starting at 6:00am on Tuesday, July 16, and will be down for the entire day.

None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.

We have many tasks to complete, here are the highlights:

  1. transition to a new method of managing our configuration files – We’ve referred to this in the past as ‘database-based configuration makers’. We’ve been doing a lot of testing on this the last few months and have things ready to go. I don’t expect this to cause any visible change to your experience, just give us a greater capability to manage more and more equipment.
  2. network redundancy – we’re beefing up our ethernet network core for compute nodes. Again, not an item I expect to be a change to your experience, just improvements to the infrastructure.
  3. Panasas code upgrade – This work will complete the series of bug fixes from Panasas, and all us to reinstate the quotas on scratch space. We’ve been testing this code for many weeks and have not observed any detrimental behavior. This is potentially a visible change to you. We will reinstate the 10TB soft and 20TB hard quotas. If you are using more than 20TB of our 215TB scratch space, you will not be able to add additional files or modify existing files in scratch.
  4. decommissioning of the RHEL5 version of the FoRCE cluster – This will allow us to add 240 CPU cores to the RHEL6 side of the FoRCE cluster, pushing force-6 over 2,000 CPU cores. We’ve been dwindling this resource for some time now, this just finishes it off. Users with access to FoRCE currently have access to both RHEL5 and RHEL6 sides, access to RHEL6 via the force-6 head node will not change as part of this process.

As always, please contact us via pace-support@oit.gatech.edu for any questions/concerns you may have.

PACE Maintenance day complete

We have completed our maintenance day activities, and are now back into regular operation.  Please let us know (via email to pace-support@oit.gatech.edu) if you encounter problems.

 

–Neil Bright

PACE maintenance day – NEXT WEEK 4/16

The next maintenance day (4/16, Tuesday) is just around the corner and we would like to remind you that all systems will be powered off for the entire day. You will not be able to access the headnodes, compute nodes or your data until the maintenance tasks are complete.

None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.

We have many tasks to complete, and here’s a summary:

1) Job Resource Manager/Scheduler maintenance

Contrary to the initial plan, we decided NOT to upgrade the resource manager (torque) and job scheduler (moab) software yet. We have been testing the new versions of these software (with your help) and, unfortunately, identified significant bugs/problems along the way. Despite being old, the current versions are known to be robust, so we will maintain the status quo until we resolve all of the problems with the vendor.

2) Interactive login prevention mechanism

Ideally, compute nodes should not allow for interactive logins, unless the user has active jobs on the node. We noticed that some users can directly ssh to compute nodes and start jobs, however. This may lead to resource conflicts and unfair use of the cluster. We identified the problem and will apply the fix on this maintenance day.

3) continued RHEL-6 migration

We are planning to convert all of the remaining Joe nodes to RHEL6 in this cycle. We will also convert the 25% of the remaining RHEL5 FoRCE nodes. We are holding off the migration for Aryabhata and Atlas clusters per request of those communities.

4) Hardware installation and configuration

We noticed that some of the nodes in the Granulous, Optimus and FoRCE are still running diskless, although they have local disks. Some nodes also not using the optimal choice for their /tmp. We will fix these problems.

We received (and tested) a replacement for the fileserver for the Apurimac project storage (pb3), since we have been experiencing problems there. We will install the new system and swap the disks. This is just a mechanical process and your data will is safe. As an extra precaution, we have been taking incremental backups (in addition to the regular backups) of this storage since it first started showing the signs of failure.

5) Software/Configurations

We will also patch/update/add software, including:

  • Upgrade the node health checker scripts
  • Deploy new database-based configuration makers (in dry-run mode for testing)
  • Reconfigure licensing mechanism so different groups can use different sources for licenses

6) Electrical Work

We will also perform some electrical work to better facilitate the recent and future additions to the clusters. We will replace some problematic PDUs and redistribute the power among racks.

7) New storage from Data Direct Networks (DDN)

Last, but not least!  In concert with a new participant, we have procured a new high performance storage system from DDN.  In order to make use of this multi-gigabyte/sec performing monster, we are installing the GPFS filesystem.  This is a commercial filesystem which PACE is funding.  We will continue to operate the Panasas in parallel with DDN, and both storage systems can be used at the same time from any compute node.  We are planning a new storage offering that allows users to purchase additional capacity on this system, so stay tuned.

 

 

As always, please contact us form pace-support@oit.gatech.edu for any questions/concerns you may have.

Thank you!

PACE Team

Breaking news from NSF

Looks like Dr. Subra Suresh will be stepping down from his position as Director of NSF, effective late March to become the next President of Carnegie Mellon.

Click the link here: Staff Letter 2-4-13 to download a copy of his letter to the NSF community.

Interesting times are ahead for both NSF and DOE.

January 2013 quarterly maintenance is complete

Greetings!

We have completed our quarterly maintenance activities.  Head nodes are online again and available for use, queued up jobs have been released, and the scheduler is awaiting new submissions.

Our RedHat 6 clusters have received system software updates.  Please keep an eye on your jobs to verify everything is operating correctly.

Our Panasas scratch storage has received another round of updates.  Preliminary testing indicates that we should have a resolution to our crashes, but the quota system is known to be broken.  As advised by Panasas, we have disabled quotas on scratch.  Please do your best to stay below the 20TB threshold.  We will be monitoring usage and know where you live.  🙂

We have a new license server providing checkouts of the Portland Group and Intel compilers, Matlab DCS, the Allinea DDT debugger and Lumerical.  Please let us know if you have problems accessing this software.  The old server is still running and we will be monitoring it for a short while for extraneous activity.

More nodes from Joe and the FoRCE have been converted from RHEL5 to RHEL6.  If you are still using the RHEL5 side of the world, please prioritize a transition to RHEL6.  We stand ready to assist you with this transition.

Finally, our new configuration system has been deployed in prototype mode.  We will use this to gather operational information and other data that will facilitate a full transition to this system in a future maintenance day.

As usual, please let us know (via email to pace-support@oit.gatech.edu) if you encounter any issues.

Happy Computing!

–Neil Bright
 

Datacenter modifications

Tomorrow morning (January 9) at 8:30am, facilities management will be performing some work on the power distribution systems in the Rich datacenter.  None of this work is being performed on anything that power PACE systems; there should be zero impact on any job or computer that PACE manages.  However, due to the nature of sharing space in the datacenter; in the event of a major problem, PACE systems may be affected.

Once again, there should be zero impact on PACE systems; no jobs or computers should be affected.

Please let us know (via email to pace-support@oit.gatech.edu) if you have any questions or concerns.

Maintenance Day (October 16, 2012) – complete

We have completed our maintenance activities.  Head nodes are online again and queued up jobs are being released.

Our filesystem correction activities on the scratch found eight “objects” on the v7 volume to be damaged and were automatically removed.  Unfortunately, the process provides no indication which file or directory was problematic.

As always, please followup to pace-support@oit.gatech.edu with any problems you may see, ideally with the pace-support.sh script discussed here: https://pace.gatech.edu/support.

campus network maintenance

The Network team will be performing some scheduled maintenance this Saturday morning.  This may impact connectivity between your workstations/laptops/home, but should not affect  jobs running within PACE.  However, if your job requires access to network services outside of the PACE cluster (e.g. a remote license server), this maintenance may affect your jobs.

For further information please see the maintenance announcement on status.oit.gatech.edu.

upcoming maintenance day, 10/16 – working on the scratch storage

It’s that time again.  We’ve been working with our scratch storage vendor (Panasas) quite a lot lately, and think we finally have some good news.  Addressing the scratch space will be a major thrust of this quarterly maintenance, and we are cautiously optimistic that we will see improvements.  We will also be applying some VMware tuning to our RHEL5 virtual machines that should increase responsiveness of those head nodes & servers.  Completing upgrades to RHEL6 for a few clusters and a few other minor items round out our activities for the day.

Scratch storage

We have been testing new firmware on our loaner Panasas storage.  Despite our best efforts, we have been unable to replicate our current set of problems after upgrading our loaner equipment to this firmware.  This is good news!  However, simply upgrading is insufficient to fully resolve our issues.  So on maintenance day, we will be performing a number of tasks related to the Panasas.  After the firmware update, we need to perform some basic file integrity checks – the equivalent of a UNIX fsck – on a copule of volumes.  This process requires those volumes to be offline for the duration.  After this, we need to perform reads of every file on the scratch that was created before the firmware upgrade.  Based on our calculations, this will take weeks.  Fortunately, this process can happen in the background, and with the filesystems online and otherwise operating normally.  The net result is that the full impact of our maintenance day improvements to the scratch will not likely be realized for a couple of weeks.  If there are files (particularly large ones) that you no longer need and can delete, this process will go faster.  We will also be upgrading the Panasas client software on all compute nodes to (hopefully) address performance issues.

Finally, we will also be instituting a 20TB per user hard quota in addition to the 10TB per user soft quota currently in place.  Users that exceed the soft quota will receive warning emails, but writes will succeed.  Writes will fail for users that attempt to exceed the hard quota.

VMware tuning

With some assistance from the Architecture and Infrstructure directorate in OIT, we will be making a number of adjustments to our VMware world.  The most significant of which is adjusting the filesystem alignment of our RHEL5 virtual machines.  Users of RHEL5 head nodes are likely to see the most improvement.  We’ll also be installing the VMware tools packages and applying various tuning parameters enabled by this package.

RHEL6 upgrades

The remaining RHEL5 portions of the clusters below will be upgraded to RHEL6.  After maintenance day, RHEL5 will be unavailable to these clusters.

  • Uranus
  • BioCluster
  • Cygnus

Misc items

  • Configuration updates to redundant network switches serving some project storage
  • Capacity expansion of the ECE file server
  • Serial number updates to a small number of compute nodes lacking serial numbers in the BIOS
  • Interoperability testing of Mellanox Infiniband switches
  • Finish project directory migration of two remaining Optimus users