Posts

Power loss in Rich Datacenter

UPDATE: All clusters are up and ready for service.

At this time, all PACE-managed clusters are believed to be working.
You should be able to login to your clusters and submit and run jobs.

Any jobs that were running before the power outage have failed, so please resubmit them.

Please let us know immediately if anything is still broken.

PACE Team

What happened

At around 0810 Thursday morning, Rich lost its N6 feed, half of the feed powering the Rich building and the Rich chiller plant. This also caused multiple failures in the high voltage vault in the Rich back alley, so Rich also lost its other feed, N5. However, the N5 feed was still up in the chiller plant. Though the chillers still had power, as a precaution operators transferred cooling over to the campus loop. Rich office space was without power, but the machine rooms failed over to the generator and UPSes.

PACE systems were powered down gracefully to prevent a hard-shutdown that would make recovery more difficult.

Original Post

This morning (December 19), the Rich datacenter suffered a power loss.
We had to perform an emergency shutdown of all nodes.

As we receive new information we will update this blog and the pace-availability email list.

COMSOL 4.4 Installed

COMSOL 4.4 – Student and Research versions

COMSOL Multiphysics version 4.4 contains many new functions and additions to the COMSOL product suite. These Release Notes provide information regarding new functionality in existing products and an overview of new products.
See the COMSOL Release Notes for information on updates to this version of COMSOL.

Using the research version of COMSOL

#Load the research version of comsol 
$ module load comsol/4.4-research
$ comsol ...
#Use the matlab livelink
$ module load matlab/r2013b
$ comsol -mlroot ${MATLAB}

Using the classroom/student version of COMSOL

#Load the classroom/student version of comsol 
$ module load comsol/4.4
$ comsol ...
#Use the matlab livelink
$ module load matlab/r2013b
$ comsol -mlroot ${MATLAB}

CDA Lecture: Python and the Future of Data Analysis

Speaker:         Peter Wang, co-Founder of Continuum Analytics

Date:                 Friday, October 18, 2013

Location:        Klaus 1447

Time:                 2-3pm

 

Abstract:
While Python has been a popular and powerful language for scientific computing for a while now, its future in the broader data analytics realm is less clear, especially as market forces and technological innovation are rapidly transforming the field.

In this talk, Peter will introduce some new perspectives on “Big Data” and the evolution of programming languages, and present his thesis that Python has a central role to play in the future of not just scientific computing, but in analytics and even computing in general. As part of the discussion, many new libraries, tools, and technologies will be discussed (both Python and non-Python), both to understand why they exist and where they are driving technical evolution.

Bio:
Peter holds a B.A. in Physics from Cornell University and has been developing applications professionally using Python since 2001. Before co-founding Continuum Analytics in 2011, Peter spent seven years at Enthought designing and developing applications for a variety of companies, including investment bankers, high-frequency trading firms, oil companies, and others. In 2007, Peter was named Director of Technical Architecture and served as client liaison on high-profile projects. Peter also developed Chaco, an open-source, Python-based toolkit for interactive data visualization. Peter’s roles at Continuum Analytics include product design and development, software management, business strategy, and training.

October 2013 PACE maintenance complete

Greetings!We have completed our maintenance activities for October.  All clusters are open, and jobs are flowing.  We came across (and dealt with) a few minor glitches, but I’m very happy to say that no major problems were encountered.  As such, we were able to accomplish all of our goals for this maintenance window.

  • All project storage servers have had their operating systems updated.  This should protect from failures during high load.  Between these fixes, and the networking fixes below, we believe all of the root causes of storage problems we’ve been having recently are resolved.
  • All of our redundancy changes and code upgrades to network equipment have been completed.
  • The decentralization of job scheduling services has been completed.  You should see significantly improved responsiveness when submitting jobs or checking on the status of existing jobs.
    • The decentralization of job scheduling services has been completed.  You should see significantly improved responsiveness when submitting jobs or checking on the status of existing jobs.  Please note that you will likely need to resubmit jobs that did not have a chance to run before Tuesday.  Contrary to previously announced and intended designs, this affects the shared clusters as well.  We apologize for the inconvenience.
    • Going forward, the scheduler decentralization has a notable side effect.  Previously, any login node could submit jobs to any queue, as long as the user had access to do so.  Now, this may no longer be the case.
    • For instance, a user of the dedicated cluster “Optimus” that also had access to the FoRCE, could simply submit jobs to the force queue from the optimus head node.  Now, That user will no longer be able to do so, as Optimus and FoRCE are scheduled by different servers.
    • We believe that these cases should be quite uncommon.  If you do encounter this situation, you should be able to simply login to the other head node and submit your jobs from there.  You will have the same home, project and scratch directories from either place.  Please let us know if you have problems.
  • All RHEL6 clusters now have access to our new GPFS filesystem.  Additionally, all of the applications in /usr/local (matlab, abacus, PGI compilers, etc.) have been moved to this storage.  This should provide performance improvements for these applications as well as the Panasas scratch storage, which was the previous location of this software.
  • Many of our virtual machines have been moved to different storage.  This should provide an improvement in the responsiveness of your login nodes.  Please let us know (via pace-support@oit.gatech.edu) if you see undesirable performance from your login nodes.
  • The Atlantis cluster has been upgraded from RHEL5 to RHEL6 (actually, this happened before this week), and 31 Infiniband-connected nodes from the RHEL5 side of the Atlas cluster have been upgraded to RHEL6.  (The 32nd has hardware problems and has been shut down.)
  • The /nv/pf2 project filesystem has been migrated to a server with more breathing room.

Additionally, we were able to complete a couple of bonus objectives.

  • You’ll notice a new message when logging in to your clusters.  Part of this message is brought to you from our Information Security department, and the rest is intended to give a high-level overview of the specific cluster and the queue commonly associated with it.
  • Infiniband network redundancy for the DDN/GPFS storage.
  • The /nv/pase1 filesystem was moved off of temporary storage, and onto the server purchased for the Ase1 cluster.

PACE maintenance day, October 2013

The time has again come to discuss our upcoming quarterly maintenance.  As you may recall, our activities lasted well into night during our last maintenance in July.  Since then, I’ve been talking with various stakeholders and the HPC Faculty Governance Committee.  Starting with our October maintenance and going forward, we will be extending our quarterly maintenance periods from one to two days.  In October, this will be Tuesday the 15th & Wednesday 16th.  I’ve updated our schedule on the PACE web page.  Upcoming maintenance periods for January, April and July will be posted shortly.

Please continue reading below, there are a couple of things that will require user actions after maintenance day.


Scheduled fixes for October include the following:

  • Project storage: We will deploy fixes for the remaining storage servers.  This will complete the roll out of these fixes initiated during our last maintenance period.  These fixes incorporate our best known stable Solaris platform at this point.  Between these fixes, and the networking fixes below, we believe this to resolve most, if not all, of the storage issues we’ve been having lately.
  • Networking updates:  We have three categories of work here.  The first is to upgrade the firmware on all of the switches in our gigabit ethernet fabric.  This should solve the switch rebooting problem.  The second item is to finish the ethernet redundancy work we didn’t complete in July.  While this redundancy work will not ensure that individual compute nodes won’t suffer an ethernet failure, it does nearly eliminate the number of single points of failure in the network itself.  No user visible impact is expected.  We’re also planning on updating the firmware on some of our smaller Infiniband switches to bring them in line with the version of software we’re running elsewhere.
  • Moab/Torque job scheduler: In order to mitigate some response issues with the scheduler, we are transitioning from a centralized scheduler server that controls all clusters (well, almost all) to a set of servers.  Sharing clusters will remain on the old server, and all of the dedicated clusters [1] will be distributed out to a series of new schedulers.  In all instances, we will still run the same _version_ of the software, but we’ll just have a fair bit more aggregate horsepower for scheduling.  This should provide a number of benefits, primarily in the response you see when submitting new jobs and querying the status of queued or running jobs.  Provided this phase goes well, we will look to upgrade the version of the moab/torque software in January.
      • There are some actions needed from the user community.  Users of dedicated clusters will need to resubmit jobs that did not get started before the maintenance.  The scheduler will ensure that it does not schedule a job that would not complete before maintenance, so this will only affect jobs that were submitted but never started.  You are affected if you _do not_ have access to the iw-shared-6 queue.
  • New storage platform:  We have been enabling access to the DDN storage platform via the GPFS filesystem on all RHEL6 clusters.  This is now complete and we are opening up the DDN for investment.  Faculty may purchase drives on this platform to expand project spaces.  Please contact me directly if you are interested in a storage purchase.  Our maintenance activities will include an update to the GPFS software which provide finer grained options for user quotas.
  • Filesystem balancing:  We will be moving the /nv/pf2 project filesystem for the FoRCE cluster to a different server.  This will allow some room for expansion and guard against it filling completely.  We expect no user-visiable changes here either, all data will be preserved, no paths will change, etc.  The data will just reside on a different physical server(s).
  • vmWare improvements: We will be rebalancing the storage used by our virtual machine infrastructure (i.e. head nodes), and other related tasks aimed at improving performance and stability for these machines.  This is still an active are of preparation, so the full set of fixes and improvements remain to be fully tested.
  • Cluster upgrades:  We will be upgrading the Atlantis cluster from RHEL5 to RHEL6.  Also, we will be upgrading 32 infiniband-connected nodes in Atlas-5 to Atlas-6.

 

[1] Specifically, jobs submitted to the following queues will be affected:

  • aryabhata, aryabhata-6
  • ase1-6
  • athena, athena-6, athena-8core, athena-debug
  • atlantis
  • atlas-6, atlas-ge, atlas-ib
  • complexity
  • cssbsg
  • epictetus
  • granulous
  • joe-6, joe-6-vasp, joe-fast, joe-test
  • kian
  • martini
  • microcluster
  • monkeys, monkeys_gpu
  • mps
  • optimus
  • rozell
  • skadi
  • uranus-6

COMSOL Workshop in Atlanta (9/10)

Here’a note from Siva Hariharan, COMSOL Inc., which we thought you might be interested in:

You’re invited to a free workshop focusing on the simulation capabilities of COMSOL Multiphysics. Two identical workshops will take place on Tuesday, September 10th in Atlanta, GA. There will be one AM session and one PM session. All attendees will receive a free two-week trial of the software.

During the workshop you will:

– Learn the fundamental modeling steps in COMSOL Multiphysics

– Experience a live multiphysics simulation example

– Set up and solve a simulation through a hands-on exercise

– Learn about the capabilities of COMSOL within your application area

 

Programs:

AM Session

9:30am – 10:45am An Overview of the Software

10:45am – 11:00am Coffee Break

11:00am – 12:30pm Hands-on Tutorial

 

PM Session

1:30pm – 2:45pm An Overview of the Software

2:45pm – 3:00pm Coffee Break

3:00pm – 4:30pm Hands-on Tutorial

 

Event details and registration: http://comsol.com/c/tt1

 

Seating is limited, so advance registration is recommended. 

Feel free to contact me with any questions.

 

Best regards,

Siva Hariharan

COMSOL, Inc.
1 New England Executive Park
Suite 350
Burlington, MA 01803
781-273-3322
siva@comsol.com

PB1 bad news, good news

This is not a repeat from yesterday. Well, it is, just a different server 🙂

UPDATE 2013-08-08 2:23pm

/pb1 is now online, and should not fall over under heavy loads any more.

Have at it folks. Sorry it has taken this long to get to the final
resolution of this problem.

—- Earlier Post —-
Bad news:

If you haven’t been able to tell, the /pb1 filesystem has failed again.

Good news:

We’ve been working on a new load for the OS for all storage boxes
which we had hoped to get out on last maintenance day (July 17), but
ran out of time to verify whether it was

  • deployable
  • resolved the actual issue

Memo (Mehmet Belgin) greatly assisted me is testing this issue by finding some of the cases we’ve known to cause failures and replicating them against our test installs. Many loads were broken confirming our suspicions, and also confirming our new image. It will take heavy loads a LOT better than before.

With verification done, we have been planning to have all Solaris based storage switched to this by the end
of the next maintenance day (October 15).

However, due to need, this will be going on the PB1 fileserver is just a little bit. We’ve
verified the process of how to do this without impacting any data
stored on the server, so we anticipate having this fileserver back up
and running at 2:30PM, and the bugs which have been causing this
problem since April will have been removed.

I’ll follow up with progress messages.

PC1 bad news, good news

UPDATE: 2013-08-07, 13:34 –

BEST NEWS OF ALL: /pc1 is now online, and should not fall over under heavy loads anymore.

Have at it folks. Sorry it has taken this long to get to the final
resolution of this problem.

Earlier Status:
Bad news:

If you haven’t been able to tell, the /pc1 filesystem has failed again.

Good news:

We’ve been working on a new load for the OS for all storage boxes
which we had hoped to get out on last maintenance day (July 17), but
ran out of time to verify whether it was

  • deployable
  • resolved the actual issue

Memo (Mehmet Belgin) greatly assisted me is testing this issue by finding some of the cases we’ve known to cause failures and replicating them against our test installs. Many loads were broken confirming our suspicions, and also confirming our new image. It will take heavy loads a LOT better than before.

With verification done, we have been planning to have all Solaris based storage switched to this by the end
of the next maintenance day (October 15).

However, due to need, this will be going on the PC1 fileserver is just a little bit. We’ve
verified the process of how to do this without impacting any data
stored on the server, so we anticipate having this fileserver back up
and running at 1:30pm, and the bugs which have been causing this
problem since April will have been removed.

I’ll follow up with progress messages.

Head node problems

Head nodes to many PACE clusters are currently down due to problems with our virtual machines.  This should not affect running jobs, but users are unable to login.  PACE staff are actively working to restore services as soon as possible.

The head nodes affected are:

  • apurimac-6 – BACK ONLINE 2013/08/03 00:30
  • aryabhata-6 – BACK ONLINE 2013/08/03 00:30
  • ase1-6 – BACK ONLINE 2013/08/03 03:10
  • athena – BACK ONLINE 2013/08/03 04:45
  • atlantis – BACK ONLINE 2013/08/03 04:45
  • atlas-6 – BACK ONLINE 2013/08/03 00:40
  • cee – BACK ONLINE 2013/08/03 01:40
  • chemprot – BACK ONLINE 2013/08/03 01:40
  • complexity – BACK ONLINE 2013/08/03 01:40
  • critcel – BACK ONLINE 2013/08/03 02:00
  • ece – BACK ONLINE 2013/08/03 02:00
  • emory-6 – BACK ONLINE 2013/08/03 02:20
  • faceoff – BACK ONLINE 2013/08/03 03:10
  • granulous – BACK ONLINE 2013/08/03 03:10
  • isabella – BACK ONLINE 2013/08/03 03:10
  • kian – BACK ONLINE 2013/08/03 03:10
  • math – BACK ONLINE 2013/08/03 03:10
  • megatron – BACK ONLINE 2013/08/03 03:10
  • microcluster – BACK ONLINE 2013/08/03 03:10
  • optimus-6 – BACK ONLINE 2013/08/03 00:30
  • testflight-6 – BACK ONLINE 2013/08/03 03:10
  • uranus-6 – BACK ONLINE 2013/08/03 00:30

The following nodes will likely generate SSH key errors upon
connection, as the key saving processes had not run on them. Please
edit your ~/.ssh/known_hosts file (Linux/Mac/Unix) and remove any host
entry with these names and save the new keys.

  • ase1-6
  • chemprot
  • faceoff
  • kian
  • microcluster

Additionally, the following user-facing web services are also offline:

  • galaxy

PACE maintenance – complete

We’ve finished.  Feel free to login and compute.  Previously submitted jobs are running in the queues.  As always, if you see odd issues, please send a note to pace-support@oit.gatech.edu.

We were able to complete our transition to the database-driven configuration, and apply the Panasas code upgrade.  Some of you will be seeing warning messages stemming from your utilization of the scratch space.  Please remember that this is a shared, and limited, resource.  The RHEL5 side of the FoRCE cluster was also retired, and reincorporated into the RHEL6 side.

We were able to achieve some of the network redundancy work, but this took substantially longer than planned and we didn’t get as far as we would have liked.  We’ll complete this during future maintenance window(s).

We spent a lot of time today trying to address the storage problems, but time was just to short to fully implement.  We were able to do some work to address the storage for the virtual machine infrastructure (you’ll notice this as the head/login nodes).  Over the next days and weeks, we will work on a robust way to deploy these updates to our storage servers and come up with a more feasible implementation schedule.

Some of the less time consuming items we also accomplished was to increase the amount of memory the Infiniband cards were able to allocate.  This should help those of you with codes that send very large messages.  We also increased the size of the /nv/pz2 filesystem – those of you on the Athena cluster, that filesystem is now nearly 150TB.  We found some Infiniband cards that had outdated firmware and brought those into line with what is in use elsewhere in PACE.  We also added a significant amount of capacity to one of our backup servers, added some redundant links into our Infiniband fabric and added some additional 10-gigabit ports for our growing server & storage infrastructure.

In all of this, we have been reminded that PACE has grown quite a lot over the last few years – from only a few thousand cores, to upwards of 25,000.  As we’ve grown, it’s become more difficult to complete our maintenance in four days a year.  Part of our post-mortem discussions will be around ways we can more efficiently use our maintenance time, and possibly increasing the amount of scheduled downtime.  If you have thoughts along these lines, I’d really appreciate hearing from you.

Thanks,

Neil Bright