Posts

[Reminder] Policy Update to Shared Clusters’ Scheduler

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X].pace.gatech.edu headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Preventative Maintenance for UPS units at Rich Data Center

OIT will be performing an Annual Preventative Maintenance for UPS units for Rich Data Center on Saturday, October 12, from 7:00AM – about 5:00PM.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Hive Cluster Status 10/3/2019

This morning we noticed that the Hive nodes were all marked offline. After a short investigation, it was discovered that there was an issue with configuration management, which we have since corrected. Ultimately, the impact from this should be negligible, as jobs that were running continued accordingly, while queued jobs were simply held until resources became available. After our correction, these jobs started as expected. Nonetheless, we want to ensure that the Hive cluster provides performance and reliability as expected, so if you find any problems in your workflow due to this minor hiccup, or for any problems you encounter moving forward, please do not hesitate to send an email to pace-support@oit.gatech.edu.

OIT Planned Maintenance

The OIT Network Services team will be performing a software upgrade on our campus Carrier-Grade NAT (CGN) appliances this week – see OIT Status for a full description. The affected subnet is the out of band management of the Hive/MRI servers; additionally, only internet-bound connections are being serviced. As such, no failures are expected for users of the Hive/MRI servers. Nonetheless, if you encounter connectivity issues to Hive resources, please do not hesitate to contact pace-support@oit.gatech.edu for assistance.

Distributed MATLAB now available on PACE

PACE is excited to announce that distributed MATLAB use is now available on PACE resources. Georgia Tech’s new license allows for unlimited scaling of MATLAB on clusters. This change means that users can now run parallelized MATLAB code across multiple nodes. For detailed instructions, please visit our distributed MATLAB documentation at docs.pace.gatech.edu/software/matlab-distributed/.

Data center maintenance

Beginning tomorrow morning (9/18/19), there will be urgent maintenance on the cooling system in the Rich data center, which houses all PACE clusters except for Hive. A temporary cooling unit has been installed, but should the secondary cooling unit fail, the room will begin to heat.  If that happens, portions of the data center will need to shut down until the temperature has returned to an acceptable level.  If the clusters are shut down, this will terminate any running jobs on the compute nodes.
Please follow updates on this maintenance and find a full list of affected services across campus at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5d7fd6219012a0316b71ef83.
– PACE team

[Resolved] Shared Clusters Scheduler Down

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arraysGNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.

The Launcher Documentation Available

The Launcher (link) is a framework for running large collections of serial or multi-thread applications as a single job on a batch-scheduled HPC system. The Launcher was developed at the Texas Advanced Computing Center (TACC) and has been deployed at multiple HPC centers throughout the world. The Launcher allows High-Throughput Computing users to take advantage of the benefits of scheduling larger single jobs and to better fit within the HPC environment. 

To better serve our High-throughput Computing users, we have adapted this software for use on the PACE systems.

Information on using Launcher on PACE is available at PACE Documentation.

COMSOL use at PACE

As you may know, the College of Engineering is changing the licensing model for COMSOL on September 16, 2019, and will now restrict access for research use to named users who have purchased access through CoE. Use of COMSOL for research on PACE is licensed through CoE (regardless of your college affiliation). If you or your PI have not yet made arrangements with CoE, please contact Angelica Remolina in CoE IT (angie.remolina@coe.gatech.edu). You will not be able to run COMSOL on PACE without permission from CoE after September 16.

[Resolved] Campus Network Down

[Update] September 5

OIT reports that the campus network is again fully functional.

[Update] September 4 4:28 PM

This is brief update,  OIT Network Services has identified the cause of the campus network issues.  One of the enterprise routers for campus rebooted unexpectedly that impacted our campus network.  Since this event, the network has been stabilized.  OIT continues to monitor this situation for any further issues.  For latest update, please check on OIT status page.

As for PACE cluster(s), you should be able to access the cluster(s) without issues.  If you continue to experience an issue, please try to disconnect and reconnect to restore your connectivity.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[Original] September 4 2:30 PM

Our campus network is down.  OIT is investigating this incident, and you may check on the details from the link below:

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d6ff4f4daca6a0543918df2

This incident will prevent you from accessing the PACE resources, but your current jobs running on PACE should not be interrupted.

Please check the status link above for up to date details.  If you have any questions, please send us a note to pace-support@oit.gatech.edu.  Also note, we are impacted by the outage and our responses to your email will be delayed.

Thank you for your patience.