Posts

PC1 (Cygnus) filesystem woes

We’ve continued to have issues with the server, and we’ve now identified a networking issue tied to this server as well as a corrupted OS image.

The networking issue has be rectified, and I am installing a new software image onto this machine as I type this.

Despite the nature of the failure, we have not lost any of your already saved data — the drive units which house the OS are separate from the ones storing your data.

We should have this machine back in about a half hour.

Call For Papers – XSEDE14

Greetings all,

XSEDE14 is coming up soon, and has issued their call for participation.  Please note that this conference is being held in Atlanta!

Selected papers from all tracks will be invited to extend the manuscripts to be considered for publication in a special issue of the journal of Concurrency and Computation Practice and Experience.  Papers accepted for the “Education, Outreach, and Training” track will be invited to extend the manuscripts for publication in the Journal of Computational Science Education.

Abstracts are due March 15.  Please see https://www.xsede.org/xsede14 for further information.

Jobs Accidentally Killed

Dear users,

At least 1,600 queued and running jobs were accidentally killed last week by a member of the PACE-team, who was trying to clear out their own jobs. PACE-team accounts have elevated rights to certain commands, and the person who deleted the jobs did not realize that the command they were using would apply to more then just their own jobs.

If you have access to the iw-shared-6 queue, and were running jobs and/or had jobs queued earlier this week, this accident has likely impacted you.

Our deepest apologies for the unexpected and early job-terminations. We are re-evaluating our need to grant elevated permissions to our regular accounts, in order to prevent this from happening again

Thank you,
PACE team

Emergency reboot of compute nodes due to power/cooler outage

The Rich data center cooling system experienced a power outage today (2/6/2014) at around 9:20am when both the main and backup power systems failed requiring an emergency shutdown of all PACE compute nodes. We have since received confirmation from the operations team the room cooling is now stable but using the backup chillers while work proceeds to correct the problem. We are currently bringing the compute nodes back online as quickly as possible.

If you had queued jobs before the incident, they should start running as soon as sufficient number of compute nodes are brought online. However, all of the jobs running at the time of the failure are killed, and they need to be resubmitted. You can monitor the node status using ‘pace-stat’ and ‘pace-check-queue’ commands.

We are sorry for the inconvenience this failure have caused. Please contact us if you have any concerns or questions.

 

Scratch Quota Policies are Changing

We would like to give you a heads up of some upcoming adjustments to the scratch space quotas.

Current policy is a 10TB soft quota and a 20TB hard quota.  Given the space problems we’ve been having with the scratch, we will be adjusting this to a 5TB soft quota, and a 7 TB hard quota.  This change should only affect a small handful of users.  Given the close proximity to our maintenance next week, we will be making this change at the end of January.  This is an easy first step that we can take to start addressing the recent lack of space on scratch storage.  We are looking at a broad spectrum of other policy and technical changes, including changing retention times, improving our detection of “old” files, as well as increasing capacity.  If you have any suggestions for other adjustments to scratch policy, please feel free to let us know (pace-support@oit.gatech.edu).

Please remember that the scratch space is intended for transient data – not as a long term place to keep things.

January Maintenance is over

January maintenance is complete, and clusters started accepting and running jobs. We accomplished all of the primary objectives, and even found time to address a few bonus items.

Most importantly, we completed updating the resource and scheduling managers (torque and moab) throughout the entire PACE realm. This upgrade should bring visible improvements in the speed and reliability. Please note that the job submission process will show some differences after this update, therefore we strongly encourage you to read the transition guide here: http://www.pace.gatech.edu/job-submissionmanagement-transition-guide-jan-2014

Also, please make sure that you check the FAQ for common problems and their solutions by running the command on your headnode:  jan2014-faq  (use the spacebar to skip pages).

We had a hardware failure in the DDN storage system, which caused an interruption in the planned Biocluster data transfer. We expect to receive the replacement parts and fix the system in a few days. This failure has not caused any data loss, and the system will be up and running (perhaps with some performance degradation). We learned that the repairs will require a short downtime, and we will soon get in touch with the users of Gryphon, Biocluster and Skadi clusters (current users of this system), for scheduling this work.

Other accomplishments include:

– Optimus is now a shared cluster. All Optimus users now have access to optimusforce-6 and iw-shared-6.

– All of the Atlas nodes are upgraded to RHEL6.

– Most of the Athena nodes are upgraded to RHEL6.

– The old scheduler server (repace) is replaced with the upgraded (shared-sched). You may notice a difference in the generated job numbers and files.

– Some networking cable cleanup and improvements

– Gryphon has new scheduler and login servers, and the nodes used for these purposes have been put back in the computation pool.

– Deployed project file space quotas as previously agreed with PIs to users who did not have quotas prior to maintenance, and adjusted for those already over to allow some head room before abutting their quota. To check your quotas, use “quota -s”.

Reminder – January Maintenance

Hi folks,

Just a reminder of our upcoming maintenance activities next week. Please see my previous blog post here: http://blog.pace.gatech.edu/?p=5449 for details.

In addition to the items described in the previous post, we will also be fixing up some quotas on home and project directories for some users who have no quotas applied. Per policy, all users should have a 5GB quota in their home directory.  A preliminary look through our accounts indicates that only one or two users have no quota applied here, and are over the 5GB limit.  We will be in touch with those users shortly to address the issue.  Project directory quotas are at the sized at the discretion of the faculty.  For those users without a quota on their project directory, we will apply a quota that is sufficiently sized such that all users remain under their quota.  After the maintenance day, we will provide a report to faculty detailing the project directory usage of their users, and work with them to make any adjustments needed.  Remember, the project directory quotas are simply intended to prevent an accidental consumption of a space that would negatively impact the work of other users of that storage.

Related to the home & project quotas, I’d also like to give you a heads up of some upcoming adjustments to the scratch space quotas.  Current policy is a 10TB soft quota and a 20TB hard quota.  Given the space problems we’ve been having with the scratch, we will be adjusting this to a 5TB soft quota, and a 7 TB hard quota.  This change should only affect a small handful of users.  Given the close proximity to our maintenance next week, we will be making this change at the end of January, NOT next week.  This is an easy first step that we can take to start addressing the recent lack of space on scratch storage.  We are looking at a broad spectrum of other policy and technical changes, including changing retention times, improving our detection of “old” files, as well as increasing capacity.  If you have any suggestions for other adjustments to scratch policy, please feel free to let me know.  Please remember that the scratch space is intended for transient data – not as a long term place to keep things.

Finally, we will also be completing the upgrade of the remaining RHEL5 portions of the Atlas cluster to RHEL6.  Likewise, we will continue the migration of the Athena cluster from RHEL5 to RHEL6, leaving only a few nodes as RHEL5.

 

–Neil Bright

PACE quarterly maintenance – 2 days; January 2014

…back to regularly scheduled events.

Our next maintenance window is fast approaching.  We will continue the 2-day downtimes, with the next one occurring Tuesday, January 14 and Wednesday, January 15.  The list of major changes is small this time around, but impactful.

The largest change, affecting all clusters, is a major update to the Moab & Torque scheduling system that is used to schedule and manage your jobs.  The upgraded versions fix a number of long-standing problems and scaling issues with command-timeouts, stability, and processing large job-sets.

The testflight cluster has been updated, and is available to anyone that wishes to test their submission processes against the new upgraded versions.In many cases, the processes used to submit and query your jobs will remain the same. For some, a change in the way that you use the system — may be required.  You will still be able to accomplish the same things, but may need to use different commands to do it.

We have updated our usage documentation to include a simple transition guide here.

In addition to the guide, we have also written a FAQ, which can be viewed by running the command ‘jan2014-faq‘ after logging in.

Because of the version differences between the old software and the new software, we will unfortunately not be able to preserve any jobs that are still in a queued state once maintenance begins. If you have any queued jobs going into maintenance, then you will need to resubmit them after maintenance.

The fixes planned for January also include the following:

Infrastructure:

  • Operating System upgrades to the server running scheduling software for the “shared” clusters.  This will bring it up to the same level as the other scheduler servers.
  • Adjustments to scalability & performance parameters on our GPFS filesystem.

Optimus cluster:

  • Optimus users will have access to a new queue: ‘optimus-force-6’, as well as access to the iw-shared-6 queue.

Gryphon cluster:

  • The current (temporary) head node and scheduler server will return to their roles as compute nodes for the cluster.
  • New servers will be brought into production for the head node & scheduler servers.
BioCluster cluster:
  • Data migrations between the pb1, pb4 and DDN filesystems.  This should be transparent to users, and ease the space crunch everybody has been experiencing.

Power loss in Rich Datacenter

UPDATE: All clusters are up and ready for service.

At this time, all PACE-managed clusters are believed to be working.
You should be able to login to your clusters and submit and run jobs.

Any jobs that were running before the power outage have failed, so please resubmit them.

Please let us know immediately if anything is still broken.

PACE Team

What happened

At around 0810 Thursday morning, Rich lost its N6 feed, half of the feed powering the Rich building and the Rich chiller plant. This also caused multiple failures in the high voltage vault in the Rich back alley, so Rich also lost its other feed, N5. However, the N5 feed was still up in the chiller plant. Though the chillers still had power, as a precaution operators transferred cooling over to the campus loop. Rich office space was without power, but the machine rooms failed over to the generator and UPSes.

PACE systems were powered down gracefully to prevent a hard-shutdown that would make recovery more difficult.

Original Post

This morning (December 19), the Rich datacenter suffered a power loss.
We had to perform an emergency shutdown of all nodes.

As we receive new information we will update this blog and the pace-availability email list.