Posts

PACE maintenance day, October 2013

The time has again come to discuss our upcoming quarterly maintenance.  As you may recall, our activities lasted well into night during our last maintenance in July.  Since then, I’ve been talking with various stakeholders and the HPC Faculty Governance Committee.  Starting with our October maintenance and going forward, we will be extending our quarterly maintenance periods from one to two days.  In October, this will be Tuesday the 15th & Wednesday 16th.  I’ve updated our schedule on the PACE web page.  Upcoming maintenance periods for January, April and July will be posted shortly.

Please continue reading below, there are a couple of things that will require user actions after maintenance day.


Scheduled fixes for October include the following:

  • Project storage: We will deploy fixes for the remaining storage servers.  This will complete the roll out of these fixes initiated during our last maintenance period.  These fixes incorporate our best known stable Solaris platform at this point.  Between these fixes, and the networking fixes below, we believe this to resolve most, if not all, of the storage issues we’ve been having lately.
  • Networking updates:  We have three categories of work here.  The first is to upgrade the firmware on all of the switches in our gigabit ethernet fabric.  This should solve the switch rebooting problem.  The second item is to finish the ethernet redundancy work we didn’t complete in July.  While this redundancy work will not ensure that individual compute nodes won’t suffer an ethernet failure, it does nearly eliminate the number of single points of failure in the network itself.  No user visible impact is expected.  We’re also planning on updating the firmware on some of our smaller Infiniband switches to bring them in line with the version of software we’re running elsewhere.
  • Moab/Torque job scheduler: In order to mitigate some response issues with the scheduler, we are transitioning from a centralized scheduler server that controls all clusters (well, almost all) to a set of servers.  Sharing clusters will remain on the old server, and all of the dedicated clusters [1] will be distributed out to a series of new schedulers.  In all instances, we will still run the same _version_ of the software, but we’ll just have a fair bit more aggregate horsepower for scheduling.  This should provide a number of benefits, primarily in the response you see when submitting new jobs and querying the status of queued or running jobs.  Provided this phase goes well, we will look to upgrade the version of the moab/torque software in January.
      • There are some actions needed from the user community.  Users of dedicated clusters will need to resubmit jobs that did not get started before the maintenance.  The scheduler will ensure that it does not schedule a job that would not complete before maintenance, so this will only affect jobs that were submitted but never started.  You are affected if you _do not_ have access to the iw-shared-6 queue.
  • New storage platform:  We have been enabling access to the DDN storage platform via the GPFS filesystem on all RHEL6 clusters.  This is now complete and we are opening up the DDN for investment.  Faculty may purchase drives on this platform to expand project spaces.  Please contact me directly if you are interested in a storage purchase.  Our maintenance activities will include an update to the GPFS software which provide finer grained options for user quotas.
  • Filesystem balancing:  We will be moving the /nv/pf2 project filesystem for the FoRCE cluster to a different server.  This will allow some room for expansion and guard against it filling completely.  We expect no user-visiable changes here either, all data will be preserved, no paths will change, etc.  The data will just reside on a different physical server(s).
  • vmWare improvements: We will be rebalancing the storage used by our virtual machine infrastructure (i.e. head nodes), and other related tasks aimed at improving performance and stability for these machines.  This is still an active are of preparation, so the full set of fixes and improvements remain to be fully tested.
  • Cluster upgrades:  We will be upgrading the Atlantis cluster from RHEL5 to RHEL6.  Also, we will be upgrading 32 infiniband-connected nodes in Atlas-5 to Atlas-6.

 

[1] Specifically, jobs submitted to the following queues will be affected:

  • aryabhata, aryabhata-6
  • ase1-6
  • athena, athena-6, athena-8core, athena-debug
  • atlantis
  • atlas-6, atlas-ge, atlas-ib
  • complexity
  • cssbsg
  • epictetus
  • granulous
  • joe-6, joe-6-vasp, joe-fast, joe-test
  • kian
  • martini
  • microcluster
  • monkeys, monkeys_gpu
  • mps
  • optimus
  • rozell
  • skadi
  • uranus-6

COMSOL Workshop in Atlanta (9/10)

Here’a note from Siva Hariharan, COMSOL Inc., which we thought you might be interested in:

You’re invited to a free workshop focusing on the simulation capabilities of COMSOL Multiphysics. Two identical workshops will take place on Tuesday, September 10th in Atlanta, GA. There will be one AM session and one PM session. All attendees will receive a free two-week trial of the software.

During the workshop you will:

– Learn the fundamental modeling steps in COMSOL Multiphysics

– Experience a live multiphysics simulation example

– Set up and solve a simulation through a hands-on exercise

– Learn about the capabilities of COMSOL within your application area

 

Programs:

AM Session

9:30am – 10:45am An Overview of the Software

10:45am – 11:00am Coffee Break

11:00am – 12:30pm Hands-on Tutorial

 

PM Session

1:30pm – 2:45pm An Overview of the Software

2:45pm – 3:00pm Coffee Break

3:00pm – 4:30pm Hands-on Tutorial

 

Event details and registration: http://comsol.com/c/tt1

 

Seating is limited, so advance registration is recommended. 

Feel free to contact me with any questions.

 

Best regards,

Siva Hariharan

COMSOL, Inc.
1 New England Executive Park
Suite 350
Burlington, MA 01803
781-273-3322
siva@comsol.com

PB1 bad news, good news

This is not a repeat from yesterday. Well, it is, just a different server 🙂

UPDATE 2013-08-08 2:23pm

/pb1 is now online, and should not fall over under heavy loads any more.

Have at it folks. Sorry it has taken this long to get to the final
resolution of this problem.

—- Earlier Post —-
Bad news:

If you haven’t been able to tell, the /pb1 filesystem has failed again.

Good news:

We’ve been working on a new load for the OS for all storage boxes
which we had hoped to get out on last maintenance day (July 17), but
ran out of time to verify whether it was

  • deployable
  • resolved the actual issue

Memo (Mehmet Belgin) greatly assisted me is testing this issue by finding some of the cases we’ve known to cause failures and replicating them against our test installs. Many loads were broken confirming our suspicions, and also confirming our new image. It will take heavy loads a LOT better than before.

With verification done, we have been planning to have all Solaris based storage switched to this by the end
of the next maintenance day (October 15).

However, due to need, this will be going on the PB1 fileserver is just a little bit. We’ve
verified the process of how to do this without impacting any data
stored on the server, so we anticipate having this fileserver back up
and running at 2:30PM, and the bugs which have been causing this
problem since April will have been removed.

I’ll follow up with progress messages.

PC1 bad news, good news

UPDATE: 2013-08-07, 13:34 –

BEST NEWS OF ALL: /pc1 is now online, and should not fall over under heavy loads anymore.

Have at it folks. Sorry it has taken this long to get to the final
resolution of this problem.

Earlier Status:
Bad news:

If you haven’t been able to tell, the /pc1 filesystem has failed again.

Good news:

We’ve been working on a new load for the OS for all storage boxes
which we had hoped to get out on last maintenance day (July 17), but
ran out of time to verify whether it was

  • deployable
  • resolved the actual issue

Memo (Mehmet Belgin) greatly assisted me is testing this issue by finding some of the cases we’ve known to cause failures and replicating them against our test installs. Many loads were broken confirming our suspicions, and also confirming our new image. It will take heavy loads a LOT better than before.

With verification done, we have been planning to have all Solaris based storage switched to this by the end
of the next maintenance day (October 15).

However, due to need, this will be going on the PC1 fileserver is just a little bit. We’ve
verified the process of how to do this without impacting any data
stored on the server, so we anticipate having this fileserver back up
and running at 1:30pm, and the bugs which have been causing this
problem since April will have been removed.

I’ll follow up with progress messages.

Head node problems

Head nodes to many PACE clusters are currently down due to problems with our virtual machines.  This should not affect running jobs, but users are unable to login.  PACE staff are actively working to restore services as soon as possible.

The head nodes affected are:

  • apurimac-6 – BACK ONLINE 2013/08/03 00:30
  • aryabhata-6 – BACK ONLINE 2013/08/03 00:30
  • ase1-6 – BACK ONLINE 2013/08/03 03:10
  • athena – BACK ONLINE 2013/08/03 04:45
  • atlantis – BACK ONLINE 2013/08/03 04:45
  • atlas-6 – BACK ONLINE 2013/08/03 00:40
  • cee – BACK ONLINE 2013/08/03 01:40
  • chemprot – BACK ONLINE 2013/08/03 01:40
  • complexity – BACK ONLINE 2013/08/03 01:40
  • critcel – BACK ONLINE 2013/08/03 02:00
  • ece – BACK ONLINE 2013/08/03 02:00
  • emory-6 – BACK ONLINE 2013/08/03 02:20
  • faceoff – BACK ONLINE 2013/08/03 03:10
  • granulous – BACK ONLINE 2013/08/03 03:10
  • isabella – BACK ONLINE 2013/08/03 03:10
  • kian – BACK ONLINE 2013/08/03 03:10
  • math – BACK ONLINE 2013/08/03 03:10
  • megatron – BACK ONLINE 2013/08/03 03:10
  • microcluster – BACK ONLINE 2013/08/03 03:10
  • optimus-6 – BACK ONLINE 2013/08/03 00:30
  • testflight-6 – BACK ONLINE 2013/08/03 03:10
  • uranus-6 – BACK ONLINE 2013/08/03 00:30

The following nodes will likely generate SSH key errors upon
connection, as the key saving processes had not run on them. Please
edit your ~/.ssh/known_hosts file (Linux/Mac/Unix) and remove any host
entry with these names and save the new keys.

  • ase1-6
  • chemprot
  • faceoff
  • kian
  • microcluster

Additionally, the following user-facing web services are also offline:

  • galaxy

PACE maintenance – complete

We’ve finished.  Feel free to login and compute.  Previously submitted jobs are running in the queues.  As always, if you see odd issues, please send a note to pace-support@oit.gatech.edu.

We were able to complete our transition to the database-driven configuration, and apply the Panasas code upgrade.  Some of you will be seeing warning messages stemming from your utilization of the scratch space.  Please remember that this is a shared, and limited, resource.  The RHEL5 side of the FoRCE cluster was also retired, and reincorporated into the RHEL6 side.

We were able to achieve some of the network redundancy work, but this took substantially longer than planned and we didn’t get as far as we would have liked.  We’ll complete this during future maintenance window(s).

We spent a lot of time today trying to address the storage problems, but time was just to short to fully implement.  We were able to do some work to address the storage for the virtual machine infrastructure (you’ll notice this as the head/login nodes).  Over the next days and weeks, we will work on a robust way to deploy these updates to our storage servers and come up with a more feasible implementation schedule.

Some of the less time consuming items we also accomplished was to increase the amount of memory the Infiniband cards were able to allocate.  This should help those of you with codes that send very large messages.  We also increased the size of the /nv/pz2 filesystem – those of you on the Athena cluster, that filesystem is now nearly 150TB.  We found some Infiniband cards that had outdated firmware and brought those into line with what is in use elsewhere in PACE.  We also added a significant amount of capacity to one of our backup servers, added some redundant links into our Infiniband fabric and added some additional 10-gigabit ports for our growing server & storage infrastructure.

In all of this, we have been reminded that PACE has grown quite a lot over the last few years – from only a few thousand cores, to upwards of 25,000.  As we’ve grown, it’s become more difficult to complete our maintenance in four days a year.  Part of our post-mortem discussions will be around ways we can more efficiently use our maintenance time, and possibly increasing the amount of scheduled downtime.  If you have thoughts along these lines, I’d really appreciate hearing from you.

Thanks,

Neil Bright

Hi folks,

 

Just a quick reminder here of our maintenance activities coming up on Tuesday of next week.  All PACE managed clusters will be down for the day.  For further details, please see our blog post here.

 

Thanks!

Neil Bright

PACE maintenance day – July 16

Dear PACE cluster users,

The time has come again for our quarterly maintenance day, and we would like to remind you that all systems will be powered off starting at 6:00am on Tuesday, July 16, and will be down for the entire day.

None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.

We have many tasks to complete, here are the highlights:

  1. transition to a new method of managing our configuration files – We’ve referred to this in the past as ‘database-based configuration makers’. We’ve been doing a lot of testing on this the last few months and have things ready to go. I don’t expect this to cause any visible change to your experience, just give us a greater capability to manage more and more equipment.
  2. network redundancy – we’re beefing up our ethernet network core for compute nodes. Again, not an item I expect to be a change to your experience, just improvements to the infrastructure.
  3. Panasas code upgrade – This work will complete the series of bug fixes from Panasas, and all us to reinstate the quotas on scratch space. We’ve been testing this code for many weeks and have not observed any detrimental behavior. This is potentially a visible change to you. We will reinstate the 10TB soft and 20TB hard quotas. If you are using more than 20TB of our 215TB scratch space, you will not be able to add additional files or modify existing files in scratch.
  4. decommissioning of the RHEL5 version of the FoRCE cluster – This will allow us to add 240 CPU cores to the RHEL6 side of the FoRCE cluster, pushing force-6 over 2,000 CPU cores. We’ve been dwindling this resource for some time now, this just finishes it off. Users with access to FoRCE currently have access to both RHEL5 and RHEL6 sides, access to RHEL6 via the force-6 head node will not change as part of this process.

As always, please contact us via pace-support@oit.gatech.edu for any questions/concerns you may have.

Login Node Storage Server Problems

Last night (2013/06/30), one of the storage servers that is responsible for many of the cluster login nodes encountered some major problems.
These issues are preventing the login nodes from allowing any user to login or use the server.
Following is a list of the affected login nodes:
cee
chemprot
cns
cygnus-6
force-6
force
math
mokeys
optimus
testflight-6

We are aware of the problem and we are working as quickly as possible to fix this.
Please let us know of any problems you are having that may be related to this.
We will keep you posted about our progress.