Posts

maintenance day complete, ready for jobs

We are done with maintenance day – however some automated nightly processes still need to run before jobs can flow again.  So, I’ve set an automated timer to release jobs at 4:30am today.  That’s a little over two hours from now.  The scheduler will accept new jobs now, but will not start executing until 4:30am.

 

With the exception of the following two items, all of the tasks listed at our previous blog post have been accomplished.

  • * firmware updates on the scratch servers were deferred per the strong recommendation of the vendor
  • * an experimental software component of the scratch system was not tested due to the lack of test plan from the vendor.

 

SSH host keys have changed on the following head nodes.  Please accept the new keys into your preferred SSH client.

  • atlas-6
  • atlas-post5
  • atlas-post6
  • atlas-post7
  • atlas-post8
  • atlas-post9
  • atlas-post10
  • apurimac
  • biocluster-6
  • cee
  • critcel
  • cygnus-6
  • complexity
  • cns
  • ece
  • granulous
  • optimus
  • math
  • prometheus
  • uranus-6

10TB soft quota per user on scratch storage

One of the many benefits of using PACE clusters is the scratch storage, which provides a fast filesystem for I/O-bound jobs. The scratch server is designed to offer high speeds but not so much storage capacity. So far, a weekly script that deletes all files older than 60 days had allowed us sustain this service without the need for disk quotas. However this situation started changing as the PACE clusters had grown to a whopping ~750 active users, with the addition of ~300 users only since Feb 2011. Consequently, it became common for the scratch utilization to reach 98%-100% on several volumes, which is alarming for the health of the entire system.

We are planning to address this issue with a 2-step transition plan for enabling file quotas. The first step will be applying 10TB “soft” quotas for all users for the next 3 months. A soft quota means that you will receive warning emails from the system if you exceed 10TB, but your writes will NOT be blocked. This will help you adjust your data usage and get prepared for the second step, which is the 10TB “hard” quotas that will block writes when the quota is exceeded.

Considering that the total scratch capacity is 260TB, a 10TB quota for 750 users is a very generous limit. Looking at some current statistics, the number of users using more than this capacity does not exceed 10. If you are one of these users (you can check using the command ‘du -hs ~/scratch’) and have concerns that the 10TB quota will adversely impact your research, please contact us (pace-support@oit.gatech.edu).

REMINDER – upcoming maintenance day, 7/17

The  major activity for maintenance day is the RedHat 6.1 to RedHat 6.2 software update.  (Please test your codes!!)   This will affect a significant amount of our user base.  We’re also instituting soft quotas on the scratch space.  Please see the detail below.

The following are running RedHat 5, and are NOT affected:

  • Athena
  • Atlantis

The following have already been upgraded to the new RedHat 6.2 stack.  We would appreciate reports on any problems you may have:

  • Monkeys
  • MPS
  • Isabella
  • Joe-6
  • Aryabhata-6

If I didn’t mention your cluster above, you are affected by this software update.  Please test using the ‘testflight’ queue.  Jobs are limited to 48 hours in this queue.  If you would like to recompile your software with the 6.2 stack, please login to the ‘testflight-6.pace.gatech.edu’ head node.

Other activities we have planed are:

Relocating some project directory servers to an alternate data center on campus.  We have strong network connectivity, so this should not change performance of these filesystems.  No user modifications needed.

  • /nv/hp3 – Joe
  • /nv/pb1 – BioCluster
  • /nv/pb3 – Apurimac
  • /nv/pc1 – Cygnus
  • /nv/pc2 – Cygnus
  • /nv/pc3 – Cygnus
  • /nv/pec1 – ECE
  • /nv/pj1 – Joe
  • /nv/pma1 – Math
  • /nv/pme1 – Prometheus
  • /nv/pme2 – Prometheus
  • /nv/pme3 – Prometheus
  • /nv/pme4 – Prometheus
  • /nv/pme5 – Prometheus
  • /nv/pme6 – Prometheus
  • /nv/pme7 – Prometheus
  • /nv/pme8 – Prometheus
  • /nv/ps1 – Critcel
  • /nv/pz1 – Athena

Activities on the scratch space – no user change is expected for any of this.

  • We need to balance some users on volumes v3, v4, v13 and v14.  This will involve moving users from one volume to another, but we will place links in the old locations.
  • Run a filesystem consistency check on the v14 volume.  This has the potential to take a significant amount of time.  Please watch the pace-availability email list (or this blog) for updates if this will take longer than expected.
  • firmware updates on the scratch servers to resolve some crash & failover events that we’ve been seeing.
  • institute soft quotas.  Users exceeding 10TB of usage on the scratch space will receive automated warning emails, but writes will be allowed to proceed.  Currently, this will affect 6 of 750+ users.  The 10TB space represents about 5% of a rather expensive shared 215TB resource, so please be cognizant of the impact to other users.

Retirement of old filesystems.  User data will be moved to alternate filesystems.  Affected filesystems are:

  • /nv/hp6
  • /nv/hp7

Performance upgrades (hardware RAID) for NFSroot servers for the Athena cluster. Previous maintenance activities have upgraded other clusters already.

moving some filesystems off of temporary homes and onto new servers.  Affected filesystems are:

  • /nv/pz2 – Athena
  • /nv/pb2 – Optimus

If time permits, we have a number of other “targets of opportunity” –

  • relocate some compute nodes and servers, removing retired systems
  • reworking a couple of Infiniband uplinks for the Uranus cluster
  • add resource tags to the scheduler so that users can better select compute node features/capabilities from their job scripts
  • relocate a DNS/DHCP server for geographic redundancy
  • fix system serial numbers in the BIOS for asset tracking
  • test a new Infiniband subnet manager to gather data for future maintenance day activities
  • rename some ‘twin nodes’ for naming consistency
  • apply BIOS updates to some compute nodes in the Optimus cluster to facilitate remote management
  • test an experimental software component of the scratch system.  Panasas engineers will be onsite to do this and revert before going back into production.  This will help gather data and validate a fix for some other issues we’ve been seeing.

upcoming maintenance day, 7/17 – please test your codes

It’s that time of the quarter again, and all PACE-manager clusters will be taken offline for maintenance on July 17 (Tuesday). All jobs that will not complete by then will be held by the scheduler. They will be released by the scheduler once the clusters are up and running again, requiring no further action on your end. If you find that your jobs does not start running, then you might like to check its walltime to make sure it does not exceed this date.

With this maintenance, we are upgrading our RedHat 6 clusters to RedHat 6.2, which includes many bugfixes and performance improvements. This version is known to provide better software and hardware integration with our systems, particularly with the 64-core nodes we have been adding over the last year.

We are doing our best to test existing codes with the new RedHat 6.2 stack. In our experience, codes currently running on our RedHat 6 systems continue to run without problems. However we strongly recommend you test your critical codes on the new stack. For this purpose, we renovated the “testflight” cluster to include RedHat 6.2 nodes, so all you need for testing is to submit your RedHat 6 jobs to the “testflight” queue. If you would like to recompile your code, please login to the testflight-6.pace.gatech.edu head node. Please try to keep the problem sizes small since this cluster only includes ~14 nodes with cores varying from 16 to 48, plus a single 64 core node. We have limited this queue to two jobs at a time from a given user. We hope the testflight cluster will be sufficient to test drive your codes, but if you have any concerns, or notice any problems with the new stack, please let us know at pace-support@oit.gatech.edu.

We will also upgrade the software on the scratch storage Panasas. We have observed many ‘failover’ events resulting in brief interruptions of service under high loads, potentially incurring performance penalties on running codes. This version is supposed to help address these issues.

We have new storage systems for Athena (/nv/pz2) and Optimus (/nv/pb2). During maintenance day, we will move these filesystems off of temporary storage, and onto their new servers.

More details will be forthcoming on other maintenance day activities, so please keep an eye on our blog at http://blog.pace.gatech.edu/

Thank you for your cooperation!

-PACE Team

Scheduler Problems

The job scheduler is currently under heavy load (heavier than any we have seen so far).

Any commands you run to query the scheduler (showq, qstat, msub, etc.) will probably fail because the scheduler can’t respond in time.

We are working feverishly to correct the problem.

scratch space improvements

While looking into some reports of less-than-desired performance from the scratch space, we have found and addressed some issues.  We were able to enlist the help of a support engineer from Panasas, who helped us identify a few places to improve configurations.  These were applied last week, and we expect to see improvements in read/write speed.

If you notice differences in the scratch space performance (positive or negative!) please let us know by sending a note to pace-support@oit.gatech.edu.

reminder – electrical work in the data center

Just a quick reminder that Facilities will be doing some electrical work in the data center, unrelated to PACE, tomorrow.  We’re not expecting any issues, but there is a remote possibility that this work could interrupt electrical power to various PACE servers, storage and network equipment.

Upcoming Quarterly Maintenance on 4/17

The first quarter of the year had passed already, and it’s time for the quarterly maintenance once again!

Our team will offline all the clusters for regular maintenance and improvements on 04/17, for the entire day. We have a scheduler reservation in place to hold jobs that would not complete until the maintenance day, so hopefully no jobs will need to be killed. The jobs with such long wallclock times will still be queued, but they will not be released until the maintenance is over.

Please direct your concerns/questions to PACE support at pace-support@oit.gatech.edu.

Thanks!

FYI – upcoming datacenter electrical work

In addition to our previously scheduled maintenance day activities next tuesday, the datacenter folks are scheduling another round of electrical work during the morning of Saturday 4/21.  Like the last time, this should not affect any PACE managed equipment, but just in case….

New rhel6 shared/hybrid queues are ready!

We are happy to announce the availability of shared/hybrid queues for all sharing rhel6 clusters. Please run “/opt/pace/bin/pace-whoami” to see which of these queues you have access to. We did our best to test and validate these queues, but there could still be some issues left overlooked. Please contact us at pace-support@oit.gatech.edu if you notice any problems.

Here’s a list of these queues:

  • mathforce-6
  • critcelforce-6
  • apurimacforce-6
  • prometforce-6 (prometheusforce-6 was too long for the scheduler)
  • eceforce-6
  • cygnusforce-6
  • iw-shared-6

Happy computing!