Aaron Jezghani – Page 2 – Partnership for an Advanced Computing Environment

[COMPLETED] PACE Maintenance – February 27-29

[COMPLETED – 6:51 PM 2/28/2020]

We are pleased to announce that our February 2020 maintenance period (https://blog.pace.gatech.edu/?p=6676) has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible.

As usual, there are a small number of straggling nodes that will require additional intervention.

A summary of the changes and actions accomplished during this maintenance period are as follows:

(Completed) RHEL7 clusters received critical patches
(Completed) Updates will be made to PACE databases and configurations.
(Deferred) [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
(Completed) [Hive cluster] Replace failed InfiniBand leaf on EDR switch
(Completed) [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy.
(In Progress) [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version.
(Completed) [Hive cluster] Run OSU Benchmark test on idle resources
(Completed) [GPFS file system] Apply latest maintenance releases and firmware updates
(In Progress) [Lustre file system] Apply latest maintenance releases and firmware updates

Thank you for your patience!

[UPDATE – 8:52 AM 2/27/2020]

The PACE maintenance period is underway. For the duration of maintenance, users will be unable to access PACE resources. Once the maintenance activities are complete, we will notify users of the availability of the cluster.

Also, we have been told by Georgia Power that they expect their work may take up to 72 hours to complete; as such, the maintenance outage for the CODA research hall (Hive and Testflight-CODA clusters) will extend until 6:00 AM Monday morning. We will provide updates as they are available.

[Original Message]

We are preparing for PACE’s next maintenance days on February 27-29, 2020. This maintenance period is planned for three days starting on Thursday, February 27, and ending Saturday, February 29. However, Georgia Power will begin work to establish a Micro Grid power generation facility beginning on Thursday, and while work should complete within 48 hours, any delays may extend the maintenance outage for the Hive and Testflight-CODA clusters through Sunday instead; PACE clusters in Rich will not be impacted by any delays in Georgia Power’s work. Should any issues and resultant delays occur, users will be notified accordingly. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

None

ITEMS NOT REQUIRING USER ACTION:

RHEL7 clusters will receive critical patches.
Updates will be made to PACE databases and configurations.
[Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
[Hive cluster] Replace failed InfiniBand leaf on EDR switch
[Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy
[Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version
[Hive cluster] Run OSU Benchmark test on idle resources
[GPFS and Lustre file systems] Apply latest maintenance releases and firmware updates

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu .

Rich Data Center UPS Maintenance

The Rich data center uninterrupted power system (UPS) will undergo maintenance to replace failed batteries on 11-January, starting at 8.00am. Due to the power configuration, it’s not expected for any of the systems in Rich to lose power during this time. All PACE services should function normally.

Please contact pace-support@oit.gatech.edu if you need more details.

New PACE utilities: pace-jupyter-notebook and pace-vnc-job now available!

Good Afternoon Researchers!

We are pleased to announce two new tools to improve interactive job experiences on the PACE clusters: pace-jupyter-notebook and pace-vnc-job!

Jupyter Notebooks are invaluable interactive programming tools that consolidate source code, visualizations, and formatted documentation into a single interface. These notebooks run in a web browser, and Jupyter provides support for many languages by allowing user to switch between desired programing kernels, such as Python, MATLAB, R, Julia, C, Fortran, just to name a few. In addition to providing an interactive environment for development and debugging code, Jupyter Notebooks are an ideal tool for teaching and demonstrating code and results, which PACE has utilized for its recent workshops.

The new utility pace-jupyter-notebook provides an easy to run command for launching Jupyter notebook from the following login nodes/clusters (login-s[X], login-d[x], login7-d[x], testflight-login, zohar, gryphon, login-hive[X], pace-ice, coc-ice…) that will enable Jupyter on your workstation/laptop browser of your choice. To launch Jupyter, simply login to PACE, and run the command pace-jupyter-notebook -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect to your Jupyter Notebook! Full documentation on the use of pace-jupyter-notebook, including available options to change such as job walltime, processors, memory, and etc., can be found at http://docs.pace.gatech.edu/interactiveJobs/jupyterInt/. Please note that on busy queues, you may experience longer wait times to launch the notebook.

In addition, we are providing a similar utility for running software with graphical user interfaces (GUIs), such as MATLAB, Paraview, ANSYS, and many more) on PACE clusters. VNC sessions offer a more robust experience when compared to traditional X11 forwarding. With a local VNC Viewer client, you can connect to the remote desktop on a compute node and interact with the software as if running on your local machine. Similar to the Jupyter Notebook utility, the new utility pace-vnc-job provides an easy to run command for launching VNC session on a compute node and connecting your client to the session. To launch a VNC session, login to PACE, and run the command pace-vnc-job -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect your VNC Viewer to the remote session, start up the software you wish to run. Full documentation on the use of pace-vnc-job, including available options to change such as job walltime, processors, memory, and etc.,, can be found at http://docs.pace.gatech.edu/interactiveJobs/setupVNC_Session/. Again, please note that on busy queues, you may experience longer wait times to launch a VNC session.

Happy Interactive computing!

Best,
The PACE Team

Hive Cluster Status 10/3/2019

This morning we noticed that the Hive nodes were all marked offline. After a short investigation, it was discovered that there was an issue with configuration management, which we have since corrected. Ultimately, the impact from this should be negligible, as jobs that were running continued accordingly, while queued jobs were simply held until resources became available. After our correction, these jobs started as expected. Nonetheless, we want to ensure that the Hive cluster provides performance and reliability as expected, so if you find any problems in your workflow due to this minor hiccup, or for any problems you encounter moving forward, please do not hesitate to send an email to pace-support@oit.gatech.edu.

OIT Planned Maintenance

The OIT Network Services team will be performing a software upgrade on our campus Carrier-Grade NAT (CGN) appliances this week – see OIT Status for a full description. The affected subnet is the out of band management of the Hive/MRI servers; additionally, only internet-bound connections are being serviced. As such, no failures are expected for users of the Hive/MRI servers. Nonetheless, if you encounter connectivity issues to Hive resources, please do not hesitate to contact pace-support@oit.gatech.edu for assistance.

[Resolved] Shared Clusters Scheduler Down

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arrays, GNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019. After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.

Partnership for an Advanced Computing Environment

Author: Aaron Jezghani