PACE maintenance – complete

We’ve finished.  Feel free to login and compute.  Previously submitted jobs are running in the queues.  As always, if you see odd issues, please send a note to pace-support@oit.gatech.edu.

We were able to complete our transition to the database-driven configuration, and apply the Panasas code upgrade.  Some of you will be seeing warning messages stemming from your utilization of the scratch space.  Please remember that this is a shared, and limited, resource.  The RHEL5 side of the FoRCE cluster was also retired, and reincorporated into the RHEL6 side.

We were able to achieve some of the network redundancy work, but this took substantially longer than planned and we didn’t get as far as we would have liked.  We’ll complete this during future maintenance window(s).

We spent a lot of time today trying to address the storage problems, but time was just to short to fully implement.  We were able to do some work to address the storage for the virtual machine infrastructure (you’ll notice this as the head/login nodes).  Over the next days and weeks, we will work on a robust way to deploy these updates to our storage servers and come up with a more feasible implementation schedule.

Some of the less time consuming items we also accomplished was to increase the amount of memory the Infiniband cards were able to allocate.  This should help those of you with codes that send very large messages.  We also increased the size of the /nv/pz2 filesystem – those of you on the Athena cluster, that filesystem is now nearly 150TB.  We found some Infiniband cards that had outdated firmware and brought those into line with what is in use elsewhere in PACE.  We also added a significant amount of capacity to one of our backup servers, added some redundant links into our Infiniband fabric and added some additional 10-gigabit ports for our growing server & storage infrastructure.

In all of this, we have been reminded that PACE has grown quite a lot over the last few years – from only a few thousand cores, to upwards of 25,000.  As we’ve grown, it’s become more difficult to complete our maintenance in four days a year.  Part of our post-mortem discussions will be around ways we can more efficiently use our maintenance time, and possibly increasing the amount of scheduled downtime.  If you have thoughts along these lines, I’d really appreciate hearing from you.

Thanks,

Neil Bright

Hi folks,

 

Just a quick reminder here of our maintenance activities coming up on Tuesday of next week.  All PACE managed clusters will be down for the day.  For further details, please see our blog post here.

 

Thanks!

Neil Bright

PACE maintenance day – July 16

Dear PACE cluster users,

The time has come again for our quarterly maintenance day, and we would like to remind you that all systems will be powered off starting at 6:00am on Tuesday, July 16, and will be down for the entire day.

None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.

We have many tasks to complete, here are the highlights:

  1. transition to a new method of managing our configuration files – We’ve referred to this in the past as ‘database-based configuration makers’. We’ve been doing a lot of testing on this the last few months and have things ready to go. I don’t expect this to cause any visible change to your experience, just give us a greater capability to manage more and more equipment.
  2. network redundancy – we’re beefing up our ethernet network core for compute nodes. Again, not an item I expect to be a change to your experience, just improvements to the infrastructure.
  3. Panasas code upgrade – This work will complete the series of bug fixes from Panasas, and all us to reinstate the quotas on scratch space. We’ve been testing this code for many weeks and have not observed any detrimental behavior. This is potentially a visible change to you. We will reinstate the 10TB soft and 20TB hard quotas. If you are using more than 20TB of our 215TB scratch space, you will not be able to add additional files or modify existing files in scratch.
  4. decommissioning of the RHEL5 version of the FoRCE cluster – This will allow us to add 240 CPU cores to the RHEL6 side of the FoRCE cluster, pushing force-6 over 2,000 CPU cores. We’ve been dwindling this resource for some time now, this just finishes it off. Users with access to FoRCE currently have access to both RHEL5 and RHEL6 sides, access to RHEL6 via the force-6 head node will not change as part of this process.

As always, please contact us via pace-support@oit.gatech.edu for any questions/concerns you may have.

PC1 & PB1 filesystems back online

Hey folks,

It looks like we may have finally found the issue tying up the PB1 file server and the occasional lock up of the PC1 file server. We’ve isolated the compute nodes that seemed to be generating the bad traffic, and have even isolated the processes which appear to have compounded the problem on a pair of shared nodes (thus linking the two server failures). With any luck, we’ll get those nodes online once their other jobs complete or are cancelled.

Thank you for the patience you have given us while we tracked this problem down. We know it was quite inconvenient, but we have a decent picture of what occurred and thankfully it was something that is very unlikely to repeat itself.

RESOLVED: Hardware Failure for /PC1 filesystem users

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

PACE Maintenance day complete

We have completed our maintenance day activities, and are now back into regular operation.  Please let us know (via email to pace-support@oit.gatech.edu) if you encounter problems.

 

–Neil Bright

PACE maintenance day – NEXT WEEK 4/16

The next maintenance day (4/16, Tuesday) is just around the corner and we would like to remind you that all systems will be powered off for the entire day. You will not be able to access the headnodes, compute nodes or your data until the maintenance tasks are complete.

None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.

We have many tasks to complete, and here’s a summary:

1) Job Resource Manager/Scheduler maintenance

Contrary to the initial plan, we decided NOT to upgrade the resource manager (torque) and job scheduler (moab) software yet. We have been testing the new versions of these software (with your help) and, unfortunately, identified significant bugs/problems along the way. Despite being old, the current versions are known to be robust, so we will maintain the status quo until we resolve all of the problems with the vendor.

2) Interactive login prevention mechanism

Ideally, compute nodes should not allow for interactive logins, unless the user has active jobs on the node. We noticed that some users can directly ssh to compute nodes and start jobs, however. This may lead to resource conflicts and unfair use of the cluster. We identified the problem and will apply the fix on this maintenance day.

3) continued RHEL-6 migration

We are planning to convert all of the remaining Joe nodes to RHEL6 in this cycle. We will also convert the 25% of the remaining RHEL5 FoRCE nodes. We are holding off the migration for Aryabhata and Atlas clusters per request of those communities.

4) Hardware installation and configuration

We noticed that some of the nodes in the Granulous, Optimus and FoRCE are still running diskless, although they have local disks. Some nodes also not using the optimal choice for their /tmp. We will fix these problems.

We received (and tested) a replacement for the fileserver for the Apurimac project storage (pb3), since we have been experiencing problems there. We will install the new system and swap the disks. This is just a mechanical process and your data will is safe. As an extra precaution, we have been taking incremental backups (in addition to the regular backups) of this storage since it first started showing the signs of failure.

5) Software/Configurations

We will also patch/update/add software, including:

  • Upgrade the node health checker scripts
  • Deploy new database-based configuration makers (in dry-run mode for testing)
  • Reconfigure licensing mechanism so different groups can use different sources for licenses

6) Electrical Work

We will also perform some electrical work to better facilitate the recent and future additions to the clusters. We will replace some problematic PDUs and redistribute the power among racks.

7) New storage from Data Direct Networks (DDN)

Last, but not least!  In concert with a new participant, we have procured a new high performance storage system from DDN.  In order to make use of this multi-gigabyte/sec performing monster, we are installing the GPFS filesystem.  This is a commercial filesystem which PACE is funding.  We will continue to operate the Panasas in parallel with DDN, and both storage systems can be used at the same time from any compute node.  We are planning a new storage offering that allows users to purchase additional capacity on this system, so stay tuned.

 

 

As always, please contact us form pace-support@oit.gatech.edu for any questions/concerns you may have.

Thank you!

PACE Team

Account related problems on 03/14/2013

We experienced some account management difficulties today (03/14/2013), mostly caused by exceeding the capacity of our database. We found the cause and fixed all of the issues. 

This problem might have affected you in two different ways. First, temporary login problems to the headnodes, and second, failure of some recently allocated jobs on compute nodes. As far as we know, none of the running jobs are affected.

We apologize for any inconvenience this might have caused. If you have experienced any problems, please send us a note (pace-support@oit.gatech.edu).