Posts
[Resolved] Datacenter cooling problem with potential impact on PACE systems
Update (06/29/2018, 3:30 pm): We’re happy to report that the issues with cooling systems are largely addressed without any visible impact on systems and/or running jobs. The schedulers are resumed, allocating new jobs as they are submitted. There is more work to be done to resolve the issue fully, but it can be performed without any disruption to services. You may continue to use PACE systems as usual. If you notice any problems, please contact pace-support@oit.gatech.edu
For a related status update from OIT, please see: https://status.gatech.edu/incidents/0ykh9wwnw50j
Original post:
The operations team notified PACE of cooling problems that started around noon today, impacting the datacenter housing the storage and virtual machine infrastructure. We immediately started monitoring the temperatures and turning off some non-critical systems as a precautionary step, and paused schedulers to prevent new jobs from running. Submitted jobs will be held until the problem is sufficiently addressed.
Depending on the course of this issue, there is a possibility that we may need to power down critical systems such as storage and Virtual Headnodes, but all critical systems are currently online for now.
We will continue to provide updates as we have them here on this blog and pace-available email list as needed.
Thank you!
Possible Water Service May Impact PACE Clusters
Impact on PACE Clusters:
—————————————–
Original communication from Georgia Tech Office of Emergency Management:
To the campus community:
Out of an abundance of caution, Georgia Tech Emergency Management and Communications has taken steps to prepare the campus for the possibility of a water outage tonight in light of needed repairs to the City of Atlanta’s water lines.
The City of Atlanta’s Department of Watershed will repair a major water line beginning tonight between 11 p.m. and midnight. The repair is scheduled to be completed this week and should not negatively impact campus. If all goes according to plan, the campus will operate as usual.
In the event the repairs cause a significant loss of water pressure or loss of water service completely, the campus will be closed and personnel will be notified through the Georgia Tech Emergency Notifications System (GTENS).
If GTENS alerts are sent, essential personnel who are pre-identified by department leadership should report even if campus is closed. If the campus loses water, all non-essential activities will be canceled on campus.
Those with specialized research areas need to make arrangements tonight in the event there is a water failure. All lab work and experiments that can be delayed should be planned for later in the week or next week.
In the event of an outage, employees are asked to work with department leadership to work remotely. Employees who can work remotely should prepare before leaving work June 4 to work remotely for several days. Toilets won’t be operational, drinking water will not be available, and air conditioning will not be functioning in buildings on campus and throughout the city.
All who are housed on campus should fill bathtubs and other containers to have water on hand to manually flush toilets should there be a loss in pressure. Plans are underway to relocate campus residents to nearby campuses such as Emory University or Kennesaw State University in the event of a complete loss of water to the campus.
Parking and Transportation Services will continue on-campus transportation as long as the campus is open.
In the event of an outage, additional instructions and information on campus operations will be shared at gatech.edu.
Major Outage of GT network on Sunday, May 27
OIT Operations team informed us about a service outage on Sunday (5/27, 8am). Their detailed note is copied below.
This outage should not impact running jobs, however you will not be able to login to headnodes and VPN for the duration of this outage.
If you have ongoing data transfers (using SFTP, scp, rsync), they *will* be terminated. We strongly recommend waiting until successful completion of this work before starting any large data transfers. Similarly, your active SSH connections will be interrupted, please save your work and exit all sessions as you can.
PACE team will be in contact with the Operations team and provide status updates in this blog post as needed: http://blog.pace.gatech.edu/?p=6259
More details:
Storage (GPFS) slowness impacting pace1 and menon1 systems
update (5/18/2018, 4:15pm): We’ve identified a large number of jobs overloading the storage and worked with their owners to delete them. This resulted in an immediate improvement in performance. Please let us know if you observe any of the slowness comes back over the weekend.
original post: PACE is aware of GPFS (storage) slowness that impacts a large fraction of users from the pace1 and menon1 systems. We are actively working, with guidance from the vendor, to identify the root cause and resolve this issue ASAP.
This slowness is observed from all nodes mounting this storage, including headnodes, compute nodes and the datamover.
We believe that we’ve found the culprit, but more investigation is needed for verification. Please continue to report any slowness problems to us.
PACE clusters ready for research
Our May 2018 maintenance (http://blog.pace.gatech.edu/?p=6158) is complete ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
Our next maintenance period is scheduled for Thursday, Aug 9 through Saturday, Aug 11, 2018.
Schedulers
Job-specific temporary directories (may require user action): Complete as planned. Please see the maintenance day announcement (http://blog.pace.gatech.edu/?p=6158) to see how this impacts your jobs.
ICE (instructional cluster) scheduler migration to a different server (may require user action): Complete as planned. Users should not notice any differences.
Systems Maintenance
ASDL cluster (requires no user action): Complete as planned.   Bad CMOS batteries are replaced and the fileserver has a replacement CPU. Memory problems were related to bad CPU, which are resolved without changing any Memory DIMMs.
Replace PDUs on Rich133 H37 Rack (requires no user action): Deferred per the request of cluster owner.
LIGO cluster rack replacement (requires no user action): Complete as planned.
Storage
GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): Complete as planned, and tested. Please report any missing storage mounts to pace-support.
Run routine system checks on GPFS filesystems (requires no user action): Complete as planned, no problems found!
Network
The IB network card firmware upgrades (requires no user action): Complete as planned.
Enable 10GbE on physical headnodes (requires no user action): Complete as planned.
Several improvements on networking infrastructure (requires no user action): Complete as planned.
[Resolved] Large Scale Storage Problems
Current Status (5/3 4:30pm): Storage problems are resolved, all compute nodes are back online, accepting jobs. Please resubmit crashed jobs and contact pace-support@oit.gatech.edu if there is anything we can assist with.
update (5/3 4:15pm): We found that the storage failure was caused by a series of tasks we have been performing with guidance from the vendor, in preparation for the maintenance day. These steps were considered safe and no failures were expected. We are still investigating to find more about which step(s) lead to this cascading failure.
update (5/3 4:00pm): All of the compute nodes will appear offline and will not accept jobs until this issue is resolved.
Original Message:
We received reports of the main PACE storage (GPFS) failures around 3:30pm today (5/3, Thr), impacting jobs. We found that this issue applies to all GPFS systems (pace1, pace2, menon1), with a large scale impact PACE-wide.
We are actively working with the vendor to resolve this issue urgently and will continue to update this post as we find more about the root cause.
We are sorry for this inconvenience and thank you for your patience.
PACE quarterly maintenance – (May 10-12, 2018)
The next PACE maintenance will start on 5/10 (Thr) and may take up to 3 days to complete, as scheduled.
As usual, jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the systems that day. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 5/10 and resubmit if this will give them enough time to complete successfully.
We will follow up with a more detailed announcement with a list of planned maintenance tasks with their impact on users, if any. If you miss that email, you can still find all of the maintenance day related information in this post, which will be actively updated with the details and progress.
List of Planned Tasks
Schedulers
- Job-specific temporary directories (may require user action): We have been receiving reports of nodes getting offline due to files left over from jobs filling up their local disk. To address this issue, we will start employing a scheduler feature that creates job-specific temporary directories, which are automatically deleted after the job is complete. In this direction, we created a “/scratch” folder on all nodes. Please note that this is different from your scratch directory in your home (note the difference between ‘~/scratch’ and ‘/scratch’). We ensured that if the node has a separate (larger) HD or SSD on the node(e.g. biocluster, dimer, etc), /scratch will be located on it to offer more space.
Without needing any specific user action, the scheduler will create a temporary directory uniquely named after the job under /scratch. For example:
/scratch/324105.shared-sched.pace.gatech.edu
And assign the $TMPDIR environment variable (which is normally ‘/tmp’) to point to this path.
You can creatively use $TMPDIR in your scripts. For example if you have been creating temporary directories under /scratch manually before, e.g. ‘/tmp/mydir123’, please use “$TMPDIR/mydir123” from now on to ensure that this directory will be deleted after the job is complete.
- ICE (instructional cluster) scheduler migration to a different server (may require user action): We’ll move the scheduler server we use for the ICE queues on a new machine that’s better suited for this service. This change will be completely transparent from the users and there will be no changes in the way jobs are submitted. Jobs that are waiting in the queue will need to be resubmitted and we’ll contact the users separately for that. If you are not a student using ICE clusters, then you will not be affected from this task in any way.
Systems Maintenance
- ASDL cluster (requires no user action): We’ll replace some failed CMOS batteries on several compute nodes, replace a failed CPU and add more memory on the file server.
- Replace PDUs on Rich133 H37 Rack (requires no user action): We’ll replace PDUs on this rack, which includes nodes from a single dedicated cluster with no expected impact on other PACE users or clusters even if something goes wrong.
- LIGO cluster rack replacement (requires no user action): We’ll replace the LIGO cluster rack with a new one with new power supplies.
Storage
- GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): The new version is tested, but please contact pace-support@oit.gatech.edu if you notice any missing mounts, failing data operations or slowness issues after the maintenance day.
- Run routine system checks on GPFS filesystems (requires no user action): As usual, we’ll run some file integrity checks to find and fix filesystem issues, if any. Some of these checks take a long time and may continue to run after the maintenance day, with some impact on performance, although minimal.
Network
- The IB network card firmware upgrades (requires no user action): The new version is tested, but please contact pace-support@oit.gatech.edu if you notice failing data operations or crashing MPI jobs after the maintenance day.
- Enable 10GbE on physical headnodes (requires no user action): Physical headnode (e.g. login-s, login-d, coc-ice, etc) will be reconfigured to use 10GbE interface for faster networking.
- Several improvements on networking infrastructure (requires no user action): We’ll reconfigure some of the links, add additional uplinks and replace fabric modules on different components of the network to improve reliability and performance of our network.
[RESOLVED] PACE Storage Problems
Update (3/29, 11:00am): We continue to see some problems overnight and this morning. It’s important to mention that these back-to-back problems, namely power loss, network, GPFS storage failures and readonly headnodes, are separate events. Some of these could be related, and they probably are, and network is the most likely culprit. We are still investigating with the help of storage and network teams.
The readonly headnodes is an unfortunate outcome of VM storage failures. We restored these system and VM storages and will start rebooting the headnodes shortly. We can’t tell for sure that these events will not recur. Frequent reboots of headnodes and denied logins should be expected while we are recovering these systems. Please be mindful of these possibilities and save your work frequently, or refrain from using headnodes for anything but submitting jobs.
The compute nodes appear to be mostly stable, although we identified several with leftover storage issues.
Update (3/28, 11:30pm): Thanks to instant feedback from some of the users, we identified a list of headnodes that got read only because of the storage issues. We started rebooting them for filesystem checks. This process may take more than an hour to complete.
Update (3/28, 11:00pm): At this point, we resolved the network issues, restored storage systems and brought back compute nodes, which started running jobs.
We believe that the cascading issues were triggered by a network problem, we will continue to monitor the systems and continue to work with the vendor tomorrow to find out more.
Update (3/28, 9:30pm): All network and storage related issues are addressed, we started bringing nodes back online and running tests to make sure they are healthy and can run jobs.
Original Post:
As several of you already noticed and reported, PACE main storage systems are experiencing problems. The symptoms indicate a wide scale network event and we are working with the OIT Network Team to investigate this issue.
This issue has potential impact on jobs, so please refrain from submitting new jobs until all systems and services are stabilized again.
We don’t have an estimated time for resolution yet, but will continue to update this blog with the progress.
[RESOLVED] Major power failure at PACE datacenter, jobs are impacted
Update (3/26, 12:15pm): At this point, most nodes are back online, except for the nodes located on the P-row. To see if your cluster is on the P-row, you can run ‘pace-check-queue <queue_name>’ and look for nodes named as either “rich133-p*” or “iw-p*” in the list. Gryphon and Uranus are two large clusters that are impacted, and there are many other smaller clusters with nodes on this row. We are actively working to bring these nodes back online ASAP.
Update (3/24, 6:15pm): We have powered on majority of compute nodes which started running jobs again. We’ll continue to online more nodes during next week. Please contact pace-support@oit.gatech.edu if you are seeing continued job crashes or nodes that are not mounting storage.
Update (3/24, 11:22am): We have identified affected queues as follows (not a complete list):
apurimac-bg-6,aryabhata-6,ase1-debug-6,atlas-6,complexity,datamover,davenporter, epictetus,granulous,jabberwocky-6,kennedy-lab,martini,megatron,monkeys_gpu,monkeys, mps,njord-6,semap-6,skadi,uranus-6,breakfix,gryphon-debug,gryphon-ivy,gryphon-prio, gryphon,gryphon-test,gryphon-tmp,roc,apurimacforce-6,b5force-6,biobot,biocluster-6, bioforce-6,biohimem-6,ceeforce,chemprot,chemxforce,cns-6-intel,cnsforce-6, critcelforce-6,critcel-prv,critcel,cygnusforce-6,cygnus,dimerforce-6,eceforce-6, enveomics-6,faceoffforce-6,faceoff,flamelforce,force-6,force-gpu,habanero,hummus, hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joeforce, kastellaforce-6,mathforce-6,mday-test,microcluster,micro-largedata,optimusforce-6, optimus,prometforce-6,prometheus,rombergforce,sonarforce-6,spartacusfrc-6,spartacus, threshold,try-6
Original Post:
What’s happening?
PACE’s Rich datacenter suffered a major power failure at around 8:30am this morning, impacting roughly half of the compute nodes. Storage systems are not affected and your data are safe, but all of the jobs running on affected nodes have been killed. Please see below for a list of all impacted queues.
Current Situtation:
OIT Operations team has restored power and PACE is bringing nodes back online as soon as possible. This is a sequential process and it may take several hours to online all of the nodes.
What user action is needed?
Please check your jobs to see which ones have crashed and re-submit them as needed. We are still working on bringing nodes back online, but it’s safe to submit jobs now. Submitted jobs will wait in the queue and start running once the nodes are available again.
Please follow our updates on pace-availability email list and blog.pace.gatech.edu.
Thank you,
PACE Team