tech support – Page 3 – Partnership for an Advanced Computing Environment

PACE clusters (mostly) ready for research (cont.)

Hello all,

I’d like to apologize again for the delays in getting back to an operational state after this maintenance period. At this point we have most things stable, although there may have been a small number of jobs interrupted over the last couple of days.

About 85% of our compute nodes are available for jobs at this point, and we continue efforts to bring those back into service.

We’ve also worked out some performance issues with the new home directory file servers that primarily impacted users of the tcsh shell.

At this point, if you see any strange behavior (other than missing nodes! 😉 please do let us know via a request to pace-support@oit.gatech.edu.

PACE clusters (mostly) ready for research

Greetings,

We’ve made substantial progress getting through our activities, and are releasing jobs. We still have a number of compute nodes that still need to be brought online, however all clusters have some amount of resources and are running jobs. We will continue to work through these issues later today. After sleep.

Major upgrade to DDN & a new scratch storage

All data migrated successfully to new front ends, additional disks have been added for upcoming scratch. Substantial delays due to unanticipated long running processes to join compute nodes to the new GPFS cluster. This work is still ongoing. Benchmarking suggests a slight performance improvement for those of you with project directories in GPFS.

New PACE router and firewall hardware & additional core network capacity

successfully completed without incident.

Panasas scratch filesystem maintenance

successfully completed without incident.

Migration of home directories

successfully completed without incident.

Migration of /usr/local storage

successfully completed without incident.

Begin transition away from diskless compute nodes.

migrated approximately 100 compute nodes. Some of these still have issues with GPFS, as above.

ONGOING: PACE quarterly maintenance – October ’15

We’ve had some unexpected delays and challenges this go around. The short version is that we will need to extend our maintenance activities into tomorrow. We’ll do a rolling release to you as we can bring compute nodes online.

The long version:

The storage system that is responsible for /usr/local and our virtual machine infrastructure experienced a hardware failure that caused us a significant amount of lost time. Some PACE staff have spent 40 of the last 48 hours on site in order to try and make corrections. We were already planning on transitioning /usr/local off of this storage and had alternate storage in place. Likewise for the virtual machines, although our plan was to live-migrate those after maintenance activities were complete. The good news is that we don’t have data loss, the bad news is that we’ve had to accelerate the virtual machine migration, resulting in additional unplanned effort.

Also, the DDN work is taking far longer than expected. Part of this work required us to remove all nodes from the GPFS filesystem and add them back in again. Current estimates to bring everything back to full production range from an additional 12 to 24 hours. This means between 10am and 10pm tomorrow before we have everything back up. As mentioned above, we will make things available as soon as we can. Pragmatically, that means that clusters will initially be available at reduced capacity. Look for another post here when we start enabling logins again.

UNDERWAY: PACE quarterly maintenance – October ’15

Our maintenance activities are now underway. All PACE clusters are down. Please watch this space for updates.

For details on work to be completed, please see our previous posts, here.

PACE quarterly maintenance – October ’15

Greetings,

The PACE team is preparing for our quarterly maintenance that will occur, Tuesday, October 20 and Wednesday, October 21. We have a number of activities scheduled that should provide positive improvements across the board.

Major upgrade to DDN & a new scratch storage. This is the flagship activity in this maintenance period. We have negotiated a no-cost upgrade of the DDN infrastructure to add additional performance and ability to expand our DDN storage. In particular, we will be adding a dedicated set of drives to serve as a replacement for our aging Panasas scratch storage. This should more than double the storage available in the scratch filesystem available to campus and provide a substantial performance increase as well. We’ve heard your concerns about scratch, and are doing our best to make improvements in this area.

** NO USERS WILL BE MIGRATED DURING THE MAINTENANCE PERIOD **

After the maintenance period, we will begin migrating users to the new scratch storage. This will be a lengthy process, with some user actions and coordination required. We will do our best to minimize the impact of the migration. We are targeting our January maintenance to retire the Panasas storage, as the service contracts expire at the end of December.

New PACE router and firewall hardware. This replaces the stalwart router and firewalls that have been the core of our network for the better part of 10 years. Additional redundancy will provide increased protection from datacenter failures and increased firewall capability should result in increased file transfer speed in and out of PACE. Our dual 10-gigabit link to the rest of campus remains unchanged, but the new firewalls should allow us to actually use more of that capacity.
Additional core network capacity. Upgrades to 40-gigabit switching in the core of our network provides additional capacity and allows 40-gigabit upgrades to various infrastructure services.
Panasas scratch filesystem maintenance. We need to do a filesystem check on a couple of the scratch storage volumes. This should be an innocuous operation, but may take a long time to complete.
Migration of home directories. We are replacing the aging servers providing home directories with new high-availability NFS storage. This should be a transparent change. Home directory quotas will not change.
Migration of /usr/local storage. We are migrating the location of the /usr/local software repository to a new storage device as the company from whom we purchased the old storage has gone out of business. This should also be a transparent change.
Begin transition away from diskless compute nodes. Many of our older nodes currently operate without any local storage. Using old, but tested, disks reclaimed from retired equipment, we will be transitioning as many as possible away from a diskless mode of operation. This is the beginning of a long-running project to fully transition away from diskless nodes. Apart from more predictable performance of these nodes, this should also be a transparent change.

Changes in qstat format

Before the July maintenance, “qstat” command did now allow querying jobs belonging to others. The only way to list cluster/scheduler wide information was the “showq” command. However we received (and confirmed) multiple reports that showq may get out of sync from time to time.

For this reason, we configured qstat to display all of the jobs managed by the scheduler (regardless of users or queues).

You will notice two differences:

(1) qstat, when run without any parameters, lists all of the jobs in
the schedulers (not just yours).
(2) You can still filter the results to show your jobs only using “qstat
-u <username>”, but the output format will be slightly different.

If you have scripts that parse the qstat output, please modify and test them to make sure they are working as intended.

PACE clusters ready for research

Greetings,

Our quarterly maintenance is now complete. We have no known outstanding issues affecting general operations, but have a few straggling nodes that we will address over the next couple of days.

GPFS client

All compute, login and interactive nodes have been updated to version 3.5.0-25 of the GPFS client per recommendation of DDN. This update addresses the bugs identified in the -20 version that caused problems during our April maintenance. No user changes should be needed.

Software Repository

The “newrepo” software repository has been made the default. Please note that there are a significant number of changes in available versions of software relative to the old repository. Jobs that reference versions that are no longer available will have difficulty running. If you have been running by doing a ‘module load newrepo’ before our maintenance activities, you should not experience any difference.

Reset Infiniband fabric

We’ve reset our infiniband fabric and it appears to be in good health.

New home directory and /usr/local storage

The storage devices for this project finally arrived earlier today. This item will be deferred until a future maintenance period.

New “data mover” servers

We weren’t quite ready to complete this bonus objective, so we’ll try and find a period of inactivity to do so between now and our next maintenance period. Whenever this happens, no user changes will be needed.

UNDERWAY: PACE quarterly maintenance – July ’15

Our maintenance activities are now underway. All PACE clusters are down. Please watch this space for updates.

For details on work to be completed, please see our previous posts, here and here.

REMINDER & UPDATE: PACE quarterly maintenance – July ’15

First, I’d like to remind folks of our quarterly maintenance activities NEXT WEEK starting at 6:00am Tuesday morning.

Second, we have a little more information regarding some of our high-level tasks. The storage we plan to use for home directories and /usr/local isn’t due to be delivered until Friday of this week. As such, we’ll not have time to get it installed and tested in time. We’ll defer this until a future maintenance period.

Our new data mover servers have been delivered, and we are beginning some tests. We’ll consider these a bonus objective at this point, pending the outcome of testing.

PACE quarterly maintenance – July ’15

Greetings!

The PACE team is again preparing for our quarterly maintenance that will occur Tuesday, July 21 and Wednesday July 22. We’re approximately a month away, but I wanted to remind folks of our upcoming activities and give a preview of what we are planning.

Updated GPFS client – We are currently testing version 3.5.0-25 for deployment, as recommended by DDN. Preliminary testing has shown it to have the fix for the problems encountered during our April maintenance.

“newrepo” becomes the default software repository – We will make the new PACE software repository (currently referred to as ‘newrepo’) the default. This means you will no longer need to explicitly switch to it using ‘module load newrepo’ and all of the modules will point to this new repository by default. The current repository will continue to be available, and can be accessed via loading a module we will continue to be available as ‘oldrepo’ as long as needed, but all new software installations, upgrades and fixes will go into newrepo.

Full reset of Infiniband fabric – We will reboot all of our Infiniband switches and subnet managers to ensure we have cleared out all of the gremlins from the Infiniband troubles earlier this month.

New storage devices for home directories and /usr/local – We’ve ordered some new storage servers to upgrade the aging servers that are currently providing home directories and /usr/local. These new servers come in a high-availability configuration so as to better guard against equipment failures. As a bonus item, we may begin the migration of our virtual machine backing storage to a separate new storage device. Both of these items are contingent on the new equipment arriving in time to be installed and tested before the maintenance period.

New “data mover” servers – Also pending arrival and testing of new equipment, we will replace the “data mover” systems known as iw-dm3 and iw-dm4. These servers are intended to be used for large data movement activities, and will come with 40-gigabit ethernet and Infiniband connectivity.

Partnership for an Advanced Computing Environment

Tag: tech support

PACE clusters (mostly) ready for research (cont.)

PACE clusters (mostly) ready for research

ONGOING: PACE quarterly maintenance – October ’15

UNDERWAY: PACE quarterly maintenance – October ’15

PACE quarterly maintenance – October ’15

Changes in qstat format

PACE clusters ready for research

GPFS client

Software Repository

Reset Infiniband fabric

New home directory and /usr/local storage

New “data mover” servers

UNDERWAY: PACE quarterly maintenance – July ’15

REMINDER & UPDATE: PACE quarterly maintenance – July ’15

PACE quarterly maintenance – July ’15

Georgia Institute of Technology