Posts

New pace website launched

Welcome to our updated website! We’ve transitioned all of our content to a new website, available at pace.gatech.edu. Please be sure to check out the updated user support section, available via the front page link ‘Current User Support‘. While we aim to provide as up-to-date content as possible, if you notice anything that seems outdated, please let us know.

If you miss our old website or need content that isn’t present on our new website, please let us know – it’s temporarily available at prev.pace.gatech.edu.

As always, thanks for choosing PACE.

PACE clusters ready for research

Our April maintenance window is now complete.  As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.

We did run into an unanticipated maintenance item with the GPFS storage – no data has been lost.  As we’ve added disks to the DDN storage system, we’ve neglected to perform a required rebalancing operation to spread load amongst all the disks.  The rebalancing operation has been running over the majority of our maintenance window, but the task is large and progress has been much slower than expected.  We will continue to perform the rebalancing during off-peak times in order to mitigate the impact on storage performance as best we are able.

Removal of /nv/gpfs-gateway-* mount points

Task complete as described.  The system should no longer generate these paths.  If you have used these paths explicitly, your jobs will likely fail.  Please continue to use paths relative to your home directory for future compatibility.  (e.g. ~/data, ~/scratch, etc.)

New GPFS gateway

Task complete as described

GPFS server and client tuning

Task complete as described

Decommission old Panasas scratch

Task complete as described.  Paths starting with /panfs no longer work.  Everybody should have been transitioned to the new scratch long ago, so we do not expect anybody to have issues here.

Enabling debug mode

Task complete as described.  You may see additional warning messages if your code not well behaved with regards to memory utilization.  This is a hint that you may have a bug.

Removal of compatibility links for migrated storage 

Task complete as described.  Affected users (Prometheus and CEE clusters)  were contacted before maintenance day.  No user impact is expected, but please send in a ticket if you think there is problem.

Scheduler updates

Task complete as described

Networking Improvements

Task complete as described

Diskless node transition

Task complete as described

Security updates

Task complete as described

PACE quarterly maintenance – April ’16

Greetings!

The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, April 19 and continuing through Wednesday, April 20.  We are planning several improvements that hopefully will provide a much better PACE experience.

GPFS storage improvements

Removal of all /nv/gpfs-gateway-* mount points (user action recommended): In the past, we had noticed performance and reliability problems with mounting GPFS natively on machines with slow network connections (including most headnodes, some compute nodes, and some system servers). To address this problem, we deployed a physical ‘gateway’ machine that mounts GPFS natively and serves its content via NFS to machines with slow network (see http://blog.pace.gatech.edu/?p=5842).

We have been mounting this gateway on *all* of the machines using these locations:

/nv/gpfs-gateway-pace1
/nv/gpfs-gateway-scratch1
/nv/gpfs-gateway-menon1

Unfortunately, these mount points caused some problems in the longer run, especially when a system variable (PBS_O_WORKDIR) being assigned these locations as the “working directory” for jobs even on machines with fast network connections. As a result, a large fraction of the data operations went through the gateway server, instead of the GPFS server, causing significant slowness.

We partially addressed this problem by fixing the root cause for unintended PBS_O_WORKDIR assignment, and also with user communication/education.

On this maintenance day, we are getting rid of these mount points completely. Instead, GPFS will always be mounted on:

/gpfs/pace1
/gpfs/scratch1
/gpfs/menon1

Regardless of how that particular node is mounting GPFS (natively or via the gateways).

User action: We would like to ask all of our users to please check your scripts to ensure that old locations are not being used. Jobs that try to use these locations will fail after the maintenance day (including those that have already been submitted).

A new GPFS gateway (no user action required): We increasingly rely on GPFS filesystem for multiple storage needs, including the scratch, majority of project directories, and some home directories.  While the gateway provided some benefits, some users continued to report unresponsive/slow commands on headnodes due to a combination of high levels of activity and limited NFS performance.
On this maintenance, we are planning to deploy a second gateway server to separate headnodes from other functions (compute nodes and backup processes). This will improve the responsiveness of headnodes, providing our users with a better interactivity on headnodes. In other words, you will see much less slowness when running system commands, such as “ls”.

GPFS server and client tuning (no user action required): We identified several configuration tuning parameters to improve the performance and reliability of GPFS in light of vendor recommendations and our own analysis. We are planning to  apply these configuration changes on this maintenance day as a fine tuning step.

Decommissioning old Panasas scratch (no user action required)

When we made the switch to the new scratch space (GPFS) on the January maintenance, we kept the old (Panasas) system accessible as read-only. Some users received a link to their old data if their migration had not completed within the maintenance window. We are finally ready to pull the plug on this Panasas system. You should have no dependencies on this system anymore, but please contact the PACE support as soon as possible if you have any concerns or questions regarding decommissioning of this system.

Enabling debug mode (limited user visibility)

RHEL6, which has been used on all PACE systems for a long while,  optionally  comes with a implementation of the memory-allocation functions to perform additional heap error/consistency checks at runtime. We’ve had this functionality installed, but memory errors have been silently ignored per our configuration, which is not ideal. We are planning to change the configuration to print diagnostics on the stderr when an error is detected. Please note, you should not see any differences in the way your codes are running, this only changes how memory errors are reported.  This behavior is controlled by the MALLOC_CHECK_ environment variable. A simple example is when a dynamically allocated array is freed twice (e.g. using the ‘free’ statement in C). Here’s a demo for different behaviors for three different values of MALLOC_CHECK_ when an array is freed twice:

MALLOC_CHECK_=0
(no output)


MALLOC_CHECK_=1

*** glibc detected *** ./malloc_check: free(): invalid pointer: 0x0000000000601010 ***

MALLOC_CHECK_=2

Aborted (core dumped)

We currently have this value set to “0” and will make “1” the default to dump some description of the error(s). If this change is causing any problems for you, or you simply don’t want any changes in your environment, then you can simply assign “0” to this value in your “~/.bashrc” to overwrite the new default.

Removal of compatibility links for migrated storage (some user action may be required)

We had migrated some of the NFS project storages (namely pcee1 and pme[1-8]) to GPFS in the past. When we did that, we placed links in the older storage (that starts with /nv/…) that points to the new gpfs location (starts with /gpfs/pace1/project/…) to protect active jobs from crashing. This was only temporary to facilitate the transition.

As a part of this maintenance day, we are planning to remove these links completely. We already contacted all of the users whose project are on these locations and confirmed that their ~/data links are updated accordingly, so we expect no user impact. That said, if you are one of these users, please make sure that none of your scripts reference to the old locations mentioned in our email.

Scheduler updates (no user action required)

We have a patched version of the resource manager (Torque) that had been deployed on the scheduler servers shortly after the January maintenance day. This patch addresses a bug in the administration functions only. While it’s not critical for compute nodes, we will go ahead and update all compute nodes to bring their version at par with the scheduler for consistency. This update will not cause any visible differences for the users. No user action required.

Networking Improvements (no user action required)

Spring is here and it’s time for some cleanup. We will get rid of unused cables in the datacenter and remove some unused switches from the racks. We are also planning some recabling to take better advantage of existing switches to improve redundancy. We will continue to test and enable jumbo frames (where possible) to lower networking overhead. None of these tasks require user actions.

Diskless node transition (no user action required)

We will continue the transition away from diskless nodes that we started in October 2015.  This mainly affects nodes in the 5-6 years old range.  Apart from more predictable performance of these nodes, this should also be a transparent change.

Security updates (no user action required)

We are also planing to update some system packages and libraries to address known security vulnerabilities and bugs. There should be no user impact.

Free XSEDE/NCSI Summer Workshops

SUMMARY:

*FREE* REGISTRATION IS OPEN!
XSEDE/National Computational Science Institute Workshops
Summer 2016

(1) Computing MATTERS: Inquiry-Based Science and Mathematics
Enhanced by Computational Thinking

(1a) May 16-18 2016, Oklahoma State U, Stillwater OK
(1b) July 18-20 2016, Boise State U, Boise ID
(1c) Aug 1-3 2016, West Virginia State U, Institute WV

(2) LittleFe Curriculum Module Buildout
June 20-22 2016, Shodor, Durham NC

Contact: Kate Cahill (kcahill@osc.edu)
http://computationalscience.org/workshops2016

DETAILS:

The XSEDE project is pleased to announce the opening of
registrations for faculty computational science education
workshops for 2016.

There are no fees for participating in the workshops.

The workshops also cover local accommodations and food during the
workshop hours for those outside of commuting distance to the host
sites.

This year there are three workshops at various locations focused
on Inquiry-Based Science and Mathematics Enhanced by Computational
Thinking and one workshop on the LittleFe Curriculum Module
Buildout.

The computational thinking workshops are hosted at Oklahoma State
University on May 16-18, 2016, at Boise State University on
July 18-20, 2016, and at West Virginia State University on
August 1-3.

The Little Fe curriculum workshop will be held on June 20-22 at
Shodor Education Foundation.

To register for the workshop, go to

http://computationalscience.org/workshops2016

and begin the registration process by clicking on the Register
through XSEDE button for the relevant workshop.

Participants will be asked to create an XSEDE portal account
if they do not yet have one.

Following that registration, participants will be directed back to

http://computationalscience.org/workshops2016

to provide additional information on their background and travel
plans.

A limited number of travel scholarships may also be available as
reimbursements for receipted travel to more distant faculty
interested in attending the workshops.

The scholarships will provide partial or full reimbursement of
travel costs to and from the workshops.

Preference will be given to faculty from institutions that are
formally engaged with the XSEDE education program and to those
who can provide some matching travel funds.

Recipients are expected to be present for the full workshop.

The travel scholarship application is available via a link at

http://computationalscience.org/workshops2016

For questions about the summer workshops please contact:

Kate Cahill (kcahill@osc.edu)

Large Scale Slowness Issues

Update: (3/10/2016, 5:00pm) Most issues are resolved, back to normal operation

The GPFS storage is back to normal performance and has now been stable for several days. However, we will continue to explore additional steps with DDN to improve the performance of the GPFS storage and schedule any upgrades recommended for our maintenance time in April. Please continue to let us know if you observe any difficulties with this critical component of the PACE clusters.

What happened:

As with most any significant storage issue, there were multiple vectors of difficulty encountered. We identified multiple factors contributing to the GPFS performance problems including uncommonly high user activity, a bad cable connection, a memory misconfiguration on most systems when we added the new GPFS scratch file system and a scheduler configuration to correctly use the new scratch space.

What was impacted:

Performance of all GPFS file systems suffered greatly during the event. Compute, login and interactive nodes and scheduler servers temporarily lost their mount points, impacting some of the running jobs. There was never any loss of user data or data integrity.

What we did:

We contacted the storage vendor support and worked with them via several phone and screen-sharing sessions to isolate and correct each of the several problems. We have added storage and node monitoring to be able to detect the memory and file system conditions that were contributing factors to this failure and have discussed operation and optimization steps with the necessary users.

What is ongoing:

We continue to work with the vendor to resolve any remaining issues and will strive to further improve performance of the file system.

 

Update: (3/4/2016, 6:30pm) GPFS storage still stable, albeit with intermittent slowness 

GPFS storage has been mostly stable. While not back to previous levels, the performance of the GPFS storage continued to improve today. We identified multiple factors contributing to the problem, including uncommonly high user activity. There are almost half a billion files in the system and the bandwidth usage has approximated the design peak a few times, which is unprecedented. While it’s great that the system is utilized at that levels, the impact of problems inevitably gets amplified under high load. We continue to work with the vendor to resolve the remaining issues and really appreciate your patience with us during this long battle.

You can help us a great deal by avoiding large data operations (e.g. cp, scp, rsync) on the headnodes. The headnodes are low capacity VMs that do not mount GPFS using native clients. Instead, all the traffic goes through a single NFS fileserver. The proper location for all data operations is the datamover node (iw-dm-4.pace.gatech.edu), which is a physical machine with fast network connections to all storage servers. Please limit your activity on the datamover machine strictly with data operations. We noticed several users running regular computations on this node and had to kill those processes.

Update: (3/1/2016, 7:30pm) GPFS storage has stabilized and schedulers are resumed.

GPFS storage appears to have stabilized, without any data loss, and we resumed scheduling to allow new jobs on the cluster.

It seems our system had outgrown an important storage configuration parameter (tokenMemLimit) having to do roughly with the number of open files times file systems times number of nodes for the whole storage-system. There was no warning message given by the storage system of impending failure. There were some symptoms observed which we were investigating which we, of course, now understand more clearly. We have asked the vendor to review the remaining parameters and recommend any additional changes.

Update: (3/1/2016, 4:30pm) Schedulers paused, new jobs will not start

Unfortunately we lost GPFS storage on the majority of compute nodes, potentially impacting running jobs using this file storage system (most project directories and all scratch). To prevent future problems, we temporarily paused schedulers. Your submissions (qsub) will appear to be hanging until we resume the scheduling functions.

What’s happening
Many users noticed that GPFS storage has slowed down recently. In some cases, this causes unresponsive commands (e.g. ‘ls’) on headnodes.

Who’s impacted

GPFS storage includes some project space (data), the new scratch, and Tardis-6 queue home directories.

How PACE is responding

We are taking the first of the several steps to address this issue. Instead of an unplanned downtime, we are planning to submit jobs to request entire nodes to facilitate the fix. This way, the solution can be applied when there are no jobs actively running on the node.

These jobs will be submitted by “pcarey8” user and will run on all of the queues. You will continue to see these jobs until all nodes are fixed, which may span a long time period (depending on when the nodes become available, which are already running long-walltime jobs). Once a node is acquired, the fix will not take too long to apply, however.

How can you help

* Replace “$PBS_O_WORKDIR” with the actual path to working directory in submission (PBS) scripts.

* Prevent concurrent data transfers and operations for very large number of files.

PACE clusters ready for research

Our January maintenance window is now complete.  As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.

Transition to new scratch storage
Of approximately 1,700 PACE users, we were unable to migrate less than 35.  All users should have received an email as to their status.  Additionally, those users who were not migrated will have support tickets created on their behalf so we can track their migrations through completion.  We expect about 25 of those 35 users to complete within the next 72 hours.  The remaining 10 have data in excess of the allowable quota and will be handled on a case-by-case basis.

Scheduler update
The new schedulers are in place and processing jobs.

Server networking
Task is complete as described.

GPFS tuning
Task is complete as described.

Filesystem migration – /nv/pk1
Task is complete as described.

Read-Only /usr/local
Task is complete as described.

Diskess node transition
We upgraded approximately 65 diskless nodes with local operating system storage.

PACE quarterly maintenance – January ’16

Greetings!

The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, January 26 and continuing through Wednesday, January 27.  We have a couple of major items that hopefully will provide a much better PACE experience.

Transition to new scratch storage

Building on the new DDN hardware deployed in October, this item is the dominant activity in this maintenance period.  Our old Panasas scratch storage has now exceeded its warranty, so this is a “must do” activity.  Given then performance level of the Panasas and the volume of data it contains, we do not believe we will be able to migrate all data during this maintenance period.  So, we will optimize the migration to maximize the number of users migrated.  Using this approach, we believe we will be able to migrate more than 99% of the PACE user community.  After the maintenance window, we will work directly with those individuals who we are not able to migrate.  You will receive an email when your migration begins, and when it is complete.  (Delivered to your official GT email address, see previous post!)

After the maintenance period, the old Panasas scratch will still be available, but in read-only mode.  All users will have scratch space provisioned on the new GPFS scratch.  Files for users who are successfully migrated will not be deleted, but will be rendered inaccessible except to PACE staff.  This provides a safety net in the unlikely event that something goes wrong.

For the time being, we will preserve the 5TB soft quota and 7TB hard quota on the new GPFS scratch, as well as the 60 day maximum file age.  However, the timestamps of the files will get reset as they migrate, so the 60 day timer gets reset for all files.

The ~/scratch symlinks in your home directories will also be updated to point to the new locations, so please continue to use these paths to refer to your scratch files.  File names beginning with /panfs will no longer work once your migration is complete.

Scheduler update

Pending successful testing, we will also be rolling out a bug fix update to our Moab & Torque scheduler software and increase network connectivity for our most heavily used schedulers.  Among issues addressed in this release is a bug where we have seen erratic notifications about failures canceling jobs, incorrect groups being included in reports and some performance improvements.  Unlike previous scheduler upgrades, all previously submitted jobs will be retained.  No user action should be required as a result of this upgrade.

Server networking

We will be upgrading network connectivity on some of our servers to take advantage of network equipment upgraded in October.  No user action required.

GPFS tuning

We will adjust parameters on some GPFS clients to more appropriately utilize their Infiniband connections.  This only affects the old (6+ years) nodes with DDR connections.  We will also substitute NFS access for native GPFS access on machines that lack Infiniband connectivity or have otherwise been identified as poorly performing GPFS clients.  In particular, this will affect most login nodes.  The /gpfs path names on these machines will be preserved, so no user action is needed here either.

Filesystem migration – /nv/pk1

The /nv/pk1 filesystem for the Aryabhata cluster will be migrated to GPFS.

Read-Only /usr/local

The /usr/local filesystem will be exported read-only.  This is a security measure, and should not impact normal operations.

Diskless node transition

We will continue the transition away from diskless nodes that we started in October.  This mainly affects nodes in the 5-6 years old range.  Apart from more predictable performance of these nodes, this should also be a transparent change.

Changing the way PACE handles email

Greetings!

In order to help ensure the reliability of email communications from PACE, we will be changing how we deliver mail effective Wednesday, January 20. (next week!) From this time forward, PACE will use only the officially published email addresses as defined in the Georgia Tech Directory.

This is a needed change, as we have many, many messages that we have been unable to deliver due to outdated or incorrect destinations.

The easiest way to determine your official email address is to visit http://directory.gatech.edu and enter your name. If you wish to change your official email address, visit http://passport.gatech.edu.

In particular, this change will affect the address which is subscribed to PACE related email lists (i.e. pace-availability and such) as well as job status emails generated automatically from the schedulers.

For the technically savvy, we will be changing our mail servers to lookup addresses from GTED. We will no longer use the contents of a users ~/.forward file.

p.s. Users of the Tardis cluster do not have entires in the Georgia Tech directory, this change does not apply to you.