Posts

PACE clusters ready for research

Our May 2018 maintenance (http://blog.pace.gatech.edu/?p=6158) is complete ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, Aug 9 through Saturday, Aug 11, 2018.

Schedulers

Job-specific temporary directories (may require user action): Complete as planned. Please see the maintenance day announcement (http://blog.pace.gatech.edu/?p=6158)  to see how this impacts your jobs.

ICE (instructional cluster) scheduler migration to a different server (may require user action): Complete as planned. Users should not notice any differences.

Systems Maintenance

ASDL cluster (requires no user action): Complete as planned.   Bad CMOS batteries are replaced and the fileserver has a replacement CPU. Memory problems were related to bad CPU, which are resolved without changing any Memory DIMMs.
Replace PDUs on Rich133 H37 Rack (requires no user action): Deferred per the request of cluster owner.

LIGO cluster rack replacement (requires no user action): Complete as planned.

Storage

GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): Complete as planned, and tested. Please report any missing storage mounts to pace-support.
Run routine system checks on GPFS filesystems (requires no user action): Complete as planned, no problems found!
Network

The IB network card firmware upgrades (requires no user action): Complete as planned.
Enable 10GbE on physical headnodes (requires no user action): Complete as planned.
Several improvements on networking infrastructure (requires no user action): Complete as planned.

 

[Resolved] Large Scale Storage Problems

Current Status (5/3 4:30pm): Storage problems are resolved, all compute nodes are back online, accepting jobs. Please resubmit crashed jobs and contact pace-support@oit.gatech.edu if there is anything we can assist with.

update (5/3 4:15pm): We found that the storage failure was caused by a series of tasks we have been performing with guidance from the vendor, in preparation for the maintenance day. These steps were considered safe and no failures were expected. We are still investigating to find more about which step(s) lead to this cascading failure.

update (5/3 4:00pm): All of the compute nodes will appear offline and will not accept jobs until this issue is resolved.

 

Original Message:

We received reports of the main PACE storage (GPFS) failures around 3:30pm today (5/3, Thr), impacting jobs. We found that this issue applies to all GPFS systems (pace1, pace2, menon1), with a large scale impact PACE-wide.

We are actively working with the vendor to resolve this issue urgently and will continue to update this post as we find more about the root cause.

We are sorry for this inconvenience and thank you for your patience.

 

 

PACE quarterly maintenance – (May 10-12, 2018)

The next PACE maintenance will start on 5/10 (Thr) and may take up to 3 days to complete, as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the systems that day. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 5/10 and resubmit if this will give them enough time to complete successfully.

We will follow up with a more detailed announcement with a list of planned maintenance tasks with their impact on users, if any. If you miss that email, you can still find all of the maintenance day related information in this post, which will be actively updated with the details and progress.

List of Planned Tasks

 

Schedulers

 

  • Job-specific temporary directories (may require user action): We have been receiving reports of  nodes getting offline due to files left over from jobs filling up their local disk. To address this issue, we will start employing a scheduler feature that creates job-specific temporary directories, which are automatically deleted after the job is complete. In this direction, we created a “/scratch” folder on all nodes. Please note that this is different from your scratch directory in your home (note the difference between ‘~/scratch’ and ‘/scratch’). We ensured that if the node has a separate (larger) HD or SSD on the node(e.g. biocluster, dimer, etc), /scratch will be located on it to offer more space.

Without needing any specific user action, the scheduler will create a temporary directory uniquely named after the job under /scratch. For example:

/scratch/324105.shared-sched.pace.gatech.edu

And assign the $TMPDIR environment variable (which is normally ‘/tmp’) to point to this path.

You can creatively use $TMPDIR in your scripts. For example if you have been creating temporary directories under /scratch manually before, e.g. ‘/tmp/mydir123’, please use “$TMPDIR/mydir123” from now on to ensure that this directory will be deleted after the job is complete.

  • ICE (instructional cluster) scheduler migration to a different server (may require user action): We’ll move the scheduler server we use for the ICE queues on a new machine that’s better suited for this service. This change will be completely transparent from the users and there will be no changes in the way jobs are submitted. Jobs that are waiting in the queue will need to be resubmitted and we’ll contact the users separately for that. If you are not a student using ICE clusters, then you will not be affected from this task in any way.

 

Systems Maintenance

 

  • ASDL cluster (requires no user action)We’ll replace some failed CMOS batteries on several compute nodes, replace a failed CPU and add more memory on the file server.
  • Replace PDUs on Rich133 H37 Rack (requires no user action): We’ll replace PDUs on this rack, which includes nodes from a single dedicated cluster with no expected impact on other PACE users or clusters even if something goes wrong.
  • LIGO cluster rack replacement (requires no user action): We’ll replace the LIGO cluster rack with a new one with new power supplies.

 

Storage

 

  • GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): The new version is tested, but please contact pace-support@oit.gatech.edu if you notice any missing mounts, failing data operations or slowness issues after the maintenance day.
  • Run routine system checks on GPFS filesystems (requires no user action): As usual, we’ll run some file integrity checks to find and fix filesystem issues, if any. Some of these checks take a long time and may continue to run after the maintenance day, with some impact on performance, although minimal.

 

Network

 

  • The IB network card firmware upgrades (requires no user action)The new version is tested, but please contact pace-support@oit.gatech.edu if you notice failing data operations or crashing MPI jobs after the maintenance day.
  • Enable 10GbE on physical headnodes (requires no user action)Physical headnode (e.g. login-s, login-d, coc-ice, etc) will be reconfigured to use 10GbE interface for faster networking.
  • Several improvements on networking infrastructure (requires no user action)We’ll reconfigure some of the links, add additional uplinks and replace fabric modules on different components of the network to improve reliability and performance of our network.

 

 

 

[RESOLVED] PACE Storage Problems

Update (3/29, 11:00am): We continue to see some problems overnight and this morning. It’s important to mention that these back-to-back problems, namely power loss, network, GPFS storage failures and readonly headnodes, are separate events. Some of these could be related, and they probably are, and network is the most likely culprit. We are still investigating with the help of storage and network teams.

The readonly headnodes is an unfortunate outcome of VM storage failures. We restored these system and VM storages and will start rebooting the headnodes shortly. We can’t tell for sure that these events will not recur. Frequent reboots of headnodes and denied logins should be expected while we are recovering these systems. Please be mindful of these possibilities and save your work frequently, or refrain from using headnodes for anything but submitting jobs.

The compute nodes appear to be mostly stable, although we identified several with leftover storage issues.

Update (3/28, 11:30pm):  Thanks to instant feedback from some of the users, we identified a list of headnodes that got read only because of the storage issues. We started rebooting them for filesystem checks. This process may take more than an hour to complete.

Update (3/28, 11:00pm):  At this point, we resolved the network issues, restored storage systems and brought back compute nodes, which started running jobs.

We believe that the cascading issues were triggered by a network problem, we will continue to monitor the systems and continue to work with the vendor tomorrow  to find out more.

Update (3/28, 9:30pm): All network and storage related issues are addressed, we started bringing nodes back online and running tests to make sure they are healthy and can run jobs.

Original Post:

As several of you already noticed and reported, PACE main storage systems are experiencing problems. The symptoms indicate a wide scale network event and we are working with the OIT Network Team to investigate this issue.

This issue has potential impact on jobs, so please refrain from submitting new jobs until all systems and services are stabilized again.

We don’t have an estimated time for resolution yet, but will continue to update this blog with the progress.

[RESOLVED] Major power failure at PACE datacenter, jobs are impacted

Update (3/26, 12:15pm): At this point, most nodes are back online, except for the nodes located on the P-row. To see if your cluster is on the P-row, you can run ‘pace-check-queue <queue_name>’ and look for nodes named as either “rich133-p*” or “iw-p*” in the list. Gryphon and Uranus are two large clusters that are impacted, and there are many other smaller clusters with nodes on this row. We are actively working to bring these nodes back online ASAP.

Update (3/24, 6:15pm): We have powered on majority of compute nodes which started running jobs again. We’ll continue to online more nodes during next week. Please contact pace-support@oit.gatech.edu if you are seeing continued job crashes or nodes that are not mounting storage.

Update (3/24, 11:22am): We have identified affected queues as follows (not a complete list):

apurimac-bg-6,aryabhata-6,ase1-debug-6,atlas-6,complexity,datamover,davenporter,
epictetus,granulous,jabberwocky-6,kennedy-lab,martini,megatron,monkeys_gpu,monkeys,
mps,njord-6,semap-6,skadi,uranus-6,breakfix,gryphon-debug,gryphon-ivy,gryphon-prio,
gryphon,gryphon-test,gryphon-tmp,roc,apurimacforce-6,b5force-6,biobot,biocluster-6,
bioforce-6,biohimem-6,ceeforce,chemprot,chemxforce,cns-6-intel,cnsforce-6,
critcelforce-6,critcel-prv,critcel,cygnusforce-6,cygnus,dimerforce-6,eceforce-6,
enveomics-6,faceoffforce-6,faceoff,flamelforce,force-6,force-gpu,habanero,hummus,
hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joeforce,
kastellaforce-6,mathforce-6,mday-test,microcluster,micro-largedata,optimusforce-6,
optimus,prometforce-6,prometheus,rombergforce,sonarforce-6,spartacusfrc-6,spartacus,
threshold,try-6

 

Original Post:

What’s happening?

PACE’s Rich datacenter suffered a major power failure at around 8:30am this morning, impacting roughly half of the compute nodes. Storage systems are not affected and your data are safe, but all of the jobs running on affected nodes have been killed. Please see below for a list of all impacted queues.

Current Situtation:
OIT Operations team has restored power and PACE is bringing nodes back online as soon as possible. This is a sequential process and it may take several hours to online all of the nodes.

What user action is needed?
Please check your jobs to see which ones have crashed and re-submit them as needed. We are still working on bringing nodes back online, but it’s safe to submit jobs now. Submitted jobs will wait in the queue and start running once the nodes are available again.

Please follow our updates on pace-availability email list and blog.pace.gatech.edu.

Thank you,
PACE Team

[RESOLVED] Continued storage slowness, impacting compute nodes as well

Update (3/22, 10:00AM): The initial findings point to hardware issues, but we don’t have a conclusive diagnosis yet. The vendor is collecting new logs to better understand the issue. We have been fixing some of the issues we found in the network and wondering if they made any difference at all. If you have opened tickets with us, please give us an update on your current experience, whether it’s better, same or worse.

Data is everything when it comes to computing and we certainly understand how these issues can have a big impact on your research progress. We are doing everything we can, with the support of the vendor, to resolve these issues ASAP.

Thank you for your feedback, cooperation and patience.

Update (3/21, 8:00PM): We continue to work with the vendor and found several issues to fix, but the system is not fully stabilized yet. Please keep an eye on this post for more updates.

Original Post:

The storage slowness issues that were initially reported on headnodes seem to be impacting some of the compute nodes as well. We are actively working to address this issue with some guidance from the vendor.

If your jobs are impacted, please open a ticket with pace-support@oit.gatech.edu and report the job IDs. This will allow us to identify specific nodes that could be contributing to the problem.

The intermittent nature of the problem is making troubleshooting difficult. We’d appreciate your patience while we are trying to identify the culprit.

Thank you.

 

[RESOLVED] PACE login nodes slowness

As reported by many of our users, we are experiencing storage related slowness on the majority of login nodes. At this point, we have reason to believe that this is caused by heavy-duty data operations running on login nodes by several users. We are currently working on pinpointing these processes contributing to the problem and the users running them.
We’d like to once again ask all of our users to not perform any data operations (e.g. SFTP connections, rsync, scp,  tar, zip/unzip, etc) on login nodes. Instead, please use the data mover machine (iw-dm-4.pace.gatech.edu). This will not only help keep the login nodes responsive, but will provide you with a significantly faster data performance compared to login nodes.
This issue has been recurring for a long while and PACE has been working on an alternative mechanisms to address this issue permanently. We now have an experimental solution in place and looking for a small group of volunteers to test it. If you are experiencing slowness on login nodes and would like to volunteer for some testing, please contact mehmet.belgin@oit.gatech.edu directly.
In the mean time, PACE system engineers will continue to work on this issue and eliminate the slowness as soon as possible.

 

PACE clusters ready for research

Our February 2018 maintenance (http://blog.pace.gatech.edu/?p=6158) is complete ahead of schedule.  We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, May 10 through Saturday, May 12, 2018.

Storage
– Both pace1 and pace2 GPFS systems now apply a limit of 2 Million files/directories per user. Please contact us if you have problems with creating new files or updating existing ones, or see messages saying that your quota is exceeded.
– We performed several maintenance tasks for both pace1 and pace2 systems to improve reliability and performance. This included rebalancing data on the drives as recommended by the vendor.
– Temporary links pointing to storage migrated in the previous maintenance window (November 2017) are now removed. All direct references to old paths will fail. We strongly recommend Math and ECE users (whose repositories are relocated as a part of storage migration) to run tests. Please let us know if you see ‘file not found’ type errors referencing the old paths staring with “/nv/…”
– Deletion of old copies of bio-konstantinidis and bio-soojinyi are currently pending, but we will start deletions sometime after the maintenance day.
– CNS users are are migrated to their new home and project directories.
Power
– We completed all power work as planned.
Rack/Node maintenance
– To rebalance power utilization, a few ASDL nodes are moved and renamed. Users of this cluster should not notice any differences other than hostnames.
– VM servers received a memory bump, allowing for more capacity
Network
– Recabling and reconfiguration of IB network is complete
– All planned Ethernet network improvements are complete
As always, please contact us  (pace-support@oit.gatech.edu) if you have notice any problems.

 

PACE quarterly maintenance – (Feb 8-10, 2018)

PACE maintenance activities are scheduled to start at 6am this Thursday (2/8) and may continue until Saturday (2/10). As usual, jobs with long walltimes are being held by the scheduler to prevent them from getting killed when we power off the systems. These jobs will be released as soon as the maintenance activities are complete.

Some of the planned improvements, new storage quotas in particular, require user action. Please read on for more details and action items.

Storage

* (Requires user action) The “2 Million files/directories per user” limitation on the GPFS system (as initially announced http://blog.pace.gatech.edu/?p=6103) will take effect on both pace1 and pace2 storages, which constitute almost all of the project space with the exception of ASDL cluster. We have been sending weekly reminders to users who are exceeding this limit since the November maintenance. If you have been receiving these notifications and haven’t reduced your usage yet, please contact pace-support urgently to prevent interruptions to your research.

* (Requires user action) As a last step to conclude the storage migration performed during November maintenance, PACE will remove the redirection links left at the old storage locations as a temporary precaution. The best way to tell whether your codes/scripts will be impacted is to test them on testflight cluster, which doesn’t have these links as described in http://blog.pace.gatech.edu/?p=6153 . If you find that your codes/scripts are working on testflight, then it means they will continue to work on any other PACE cluster after the links are removed.

We have been working with ECE and Math departments, which maintain their own software repositories, to ensure that the existing software will continue to run in the new locations. We have been strongly encouraging users of these repositories to run tests on the testflight cluster to identify potential problems. If you haven’t had a chance to try your codes yet, please try to do that until the maintenance day and contact pace-support urgently if you notice any problems.

* (Requires user action) The two storage locations that had been migrated between two GPFS systems, namely bio-konstantinidis and bio-soojinyi, will be deleted from the old (pace1) location. If you need any data from the old location, please contact pace-support urgently to retrieve them before the maintenance day.

* (May require user action) We will complete the migration of CNS cluster users to their new home (hcns1) and project storage (phy-grigoriev). We will replace the symbolic links (e.g. ~/data) accordingly to make this migration as transparent from the users as possible. If some of your codes/scripts include hardwired references to the old locations, they need to be updated with the new locations. We strongly recommend the use of available symbolic links such as “~/data” rather than absolute paths such as “/gpfs/pace2/project/pf1” to ensure that your codes/scripts will not be impacted by future changes we may need to make.

* (No user action needed) We will apply some maintenance (disk striping) on the pace1 GPFS system. We are also exploring a possibility to update some components in the pace2, but the final decision is waiting on the vendor recommendation. None of this work requires any user action.

Power Work

* (No user action needed) We will install new power distribution units (PDUs) and reconfigure some connections on some racks to achieve a better power distribution and increase redundancy.

Rack/Node maintenance

* (No user action needed) We will physically move some of the ASDL nodes to a different rack. While this requires renaming of those nodes, there will be no differences in the way users are submitting jobs via the scheduler. One exception is the unlikely scenario of users explicitly requesting nodes by their hostnames in PBS scripts.

* (No user action needed) We will increase the memory capacity on Virtual machine servers from 64GB to 256GB, which host most of the headnodes. The memory available per VMs, however, will not change.

Network

* (No user action needed) We will do some recabling and reconfiguration on the Infiniband (IB) network to achieve a more efficient connectivity, which will also allow its to retire an old switch.

* (No user action needed) We will install a new Ethernet switch and replace some others to optimize the network.

 Instructional Cluster

The instructional cluster (a.k.a PACE/COC ICE) will be offlined as a part of this maintenance. This is a brand new resource that’s not officially made available to any classes yet, but we noticed some logins by some users. Please refrain from using these resources for any classes yet, until we release it following a training session that we will schedule in the next week.

 

Please test your codes on Testflight if your storage had been migrated in November

As you would recall, our November 2017 maintenance included consolidation of multiple different filesystems into a single system (pace2) as announced here: http://blog.pace.gatech.edu/?p=6103. All of the files should be successfully migrated by now, with links replaced to point to the new location.

We created temporary links in old locations as a temporary measure to prevent immediate job crashes, as explained in the link above (please see the “What if I don’t fix existing references to the old locations after my data are migrated?” section). Our plan is to remove these temporary links as a part of the next maintenance day (Feb 8, 2018). If your codes/scripts are still referencing to the old locations, they will most certainly crash after that day.

We removed these temporary links on the testflight cluster (mimicking the environment you’d expect to see after the Feb maintenance) and strongly encourage you to try your codes/scripts there to ensure an eventless transition. Some of the locally compiled codes with hardcoded references to old references may require recompilation if they fail to run on testflight.

As always, please contact pace-support@oit.gatech.edu if you need any assistance.