Posts

[Restored] GPFS Filesystem Issue

[Update 1/29/20 5:32 PM]

We are happy to report that our GPFS filesystem was restored to functionality early this afternoon. Our CI team was able to identify a failed switch as the source of problems on a group of nodes. We restored the switch, and we are investigating the deployment of improved backup systems to handle such cases in the future.

We apologize for the recent issues you have faced. As always, please send an email to pace-support@oit.gatech.edu with any concerns, so we can investigate.

 

 

[Original Post 1/28/20 12:46 PM]

We have been experiencing intermittent disruptions on our GPFS filesystem, especially on the mounted GPFS scratch (i.e., ~/scratch) filesystem, since yesterday. The PACE team is actively investigating the source of this issue, and we are working with our support vendor to restore the system to full functionality. A number of users have reported slow reads of files, hanging commands, and jobs that run more slowly than usual or do not appear to progress. We apologize for any interruptions you may be experiencing on PACE resources at this time, and we will alert you when the issue is resolved.

Hive Cluster Scheduler Down

The Hive scheduler has been restored at around 2:20PM.  The scheduler services had crashed, which we were able to restore successfully and place measures to prevent similar reoccurrence in future.  There is a potential that some user jobs may have been impacted during this scheduler outage.  Please check your jobs, and if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Thank you again for your patience, and we apologize for the inconvenience.

[Original Note — January 27, 2020, 2:16PM] The Hive scheduler has gone down.  This has come to our attention at around 1:40pm.  PACE team is investigating the issue, and we will follow up with the details.  During this period, you will not be able to submit jobs or monitor current jobs on Hive.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

 

Globus authentication and endpoints

We became aware this morning of an issue with Globus authentication to the “gatechpace#datamover” endpoint that many of you use to transfer files to/from PACE resources. We are working to repair this right now, but please use the “PACE Internal” endpoint instead. This endpoint provides access to the same filesystem that you use with the datamover endpoint (plus PACE Archive storage, for those who have signed up for our archive service). Going forward, you may continue to use this newer endpoint instead of the older datamover one, even once we have datamover functioning again soon. For full instructions on using Globus with PACE, visit our Globus documentation page. PACE Internal functions in exactly the same way as gatechpace#datamover when interacting with Globus. 

Please keep in mind that Globus is the best way to transfer files to/from PACE resources. Contact us at pace-support@oit.gatech.edu if you have any questions about using Globus.

[Re-Scheduled] Advisory of Hive cluster outage 1/20/20

We are writing to inform you of the upcoming Hive cluster outage that we learned about yesterday.  PACE has no control on this outage.  As part of the design of the Coda data center, we are working with the Southern Company (Ga Power) in the creation and operation of a Micro Grid power generation facility. This is a set of products to enable research of local generation of up to 2MW of off-grid power.

In order to connect this facility/Micro grid to the Coda data center power, Southern Company will need to shut down all power to the research hall in Coda. As a result, Hive cluster will need to be shutdown during this procedure, and we are placing a scheduler reservation to prevent any jobs from running during the shutdown.  This is currently planned to begin at 8am on the Georgia Tech MLK Holiday of January 20th. GT has checked to see if this date could be rescheduled to give a longer notice, but GT was unable to change the date.   As a result, GT is working with the Southern Company to minimize the duration of this power outage but a final outage time requirement is not known. It is currently expected to be at least 24 hours in length.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

If you have any questions, please contact PACE Support at pace-support@oit.gatech.edu.

[Re-Scheduled] Hive Cluster — Policy Update

After the deployment of Hive cluster this Fall, we are pleased with the rapid growth of our user community on this cluster along with the utilization of the cluster that has been rapidly increasing. During this period, we have received user feedback that compels us to make changes that will further increase productivity for all users of Hive.  Hive PIs have approved the following changes listed below that were deployed on January 9:

  1. Hive-gpu: The maximum walltime for jobs on hive-gpu will be decreased to 3 days from the current 5 day max walltime, which is to address the longer job wait times that users have experienced on the hive-gpu queue
  2. Hive-gpu:  To ensure that GPUs do not sit idle, jobs will not be permitted to use a CPU:GPU ratio higher than 6:1 (i.e., 6 core per GPU). Each hive-gpu nodes are 24 CPUs and 4 GPUs.
  3. Hive-nvme-sas: create a new queue, hive-nvme-sas that combines and shares compute nodes between the hive-nvme and hive-sas queues.
  4. Hive-nvme-sas, hive-nvme, hive-sas: Increase the maximum walltime for jobs on the hive-nvme, hive-sas, hive-nvme-sas queues to 30 days from the current 5 day max walltime.
  5. Hive-interact: A new interactive queue, hive-interact, will be created. This queue provide access to 32 Hive compute nodes (192 GB RAM and 24 cores).  This queue is provided for  quick access to resources for testing and development. The walltime limit will be 1 hour.
  6. Hive-priority: a new hive-priority queue will be created. This queue is reserved for researchers with time-sensitive research deadlines.  For access to this queue, please communicate the appropriate dates/upcoming deadlines to the PACE team in order to get the necessary approvals for us to provide you access to high-priority queue.  Please note that we may not be able to provide access to priority queue for requests made less than 14 days in advance of the time when the resource is needed, which is due to the running jobs at the time of the request.

Who is impacted:

  • All Hive users who use hive-gpu, hive-nvme and hive-sas queues
  • The additional queues that are created will benefit, and by that, impact all Hive users.

User Action:

  • Users will need to update their PBS scripts to reflect the new walltime limits and CPU:GPU ratio requirement on hive-gpu queue
  • The mentioned changes will not impact the currently running jobs.

Additionally:

We would like to remind you of the upcoming Hive cluster outage due to the creation of a Micro Grid power generation facility. At 8 AM on Monday, January 20th (Georgia Tech holiday for MLK day), the Hive cluster will be shutdown for an anticipated 24 hours. A reservation has been put in place on all Hive nodes during this period, and any user jobs submitted that will overlap with this outage will be provided with a warning indicating this detail, and enqueued until after completion of work. A similar warning will be generated for jobs overlapping with the upcoming cluster maintenance on February 27.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

Our documentation has been updated to reflect these changes and queue additions, and can be found at http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Upcoming VPN updates

We would like to let you know about upcoming upgrades to Georgia Tech’s VPNs. The VPN software will be updated by OIT to introduce a number of bug fixes and security improvements, including support for macOS 10.15 as well as Windows 10 ARM64 based devices. After the upgrade, your local VPN client will automatically download and install an update upon your next connection attempt. Please allow the software to update, then continue with your connection on the upgraded interface.

The main campus “anyc” VPN, which is used to access PACE from off-campus locations, will be upgraded on January 28. The “pace” VPN, which is used to access our ITAR/CUI clusters from any location, will be upgraded on January 21.

If you wish to try the new client sooner, you may do so by connecting to the dev.vpn.gatech.edu VPN, which will prompt download of the upgraded client. Due to capacity limitations, please disconnect after update and return to using your normal VPN service.

For ongoing updates, please visit the OIT status announcements for the pace VPN or the anyc VPN.

As always, please contact us at pace-support@oit.gatech.edu with any concerns.

OIT Network Maintenance 12/18/2019-12/19/2019

To Our Valued PACE Research Community,

We are writing to inform our research community of upcoming maintenance, as follows: 

The Office of Information Technology (OIT) will be performing a series of upgrades to the networking infrastructure to improve the performance and reliability of networking operations. Some of these upcoming enhancements may impact PACE users’ ability to connect and interact with computational and storage resources. We do not expect that this network maintenance to have any impact on currently running jobs.   

12/18/2019 20:00-23:59 (Router Code Upgrade) An upgrade to the software on some routers is scheduled and will include an approximate 30-minute disruption to telecommunication services.  

12/18/2019 20:00 – 12/19/2019 02:00 (Date Center Router Code Upgrade & Routing Engine Upgrade)  An upgrade to the software on multiple devices will impact network connectivity across the main campus of the Georgia Institute of Technology. This disruption will include the CODA Building. 

OIT Technical Teams will be actively monitoring the progress of upgrades during the maintenance windows described above. These teams will be providing ongoing communications to student, faculty, and staff members of the Institute. A central location for progress communications will be available at http://status.gatech.edu 

Issues during the upgrade may be reported to the OIT Network Operations Center at (404)894-4669. 

We do not expect any impact on running jobs and no changes to the PACE computational and storage resources are part of this OIT Network maintenance. 

Thank you for your time and diligence,

PACE Outreach and Faculty Interaction Team

New PACE utilities: pace-jupyter-notebook and pace-vnc-job now available!

Good Afternoon Researchers!

We are pleased to announce two new tools to improve interactive job experiences on the PACE clusters: pace-jupyter-notebook and pace-vnc-job!

Jupyter Notebooks are invaluable interactive programming tools that consolidate source code, visualizations, and formatted documentation into a single interface. These notebooks run in a web browser, and Jupyter provides support for many languages by allowing user to switch between desired programing kernels, such as Python, MATLAB, R, Julia, C, Fortran, just to name a few. In addition to providing an interactive environment for development and debugging code, Jupyter Notebooks are an ideal tool for teaching and demonstrating code and results, which PACE has utilized for its recent workshops.

The new utility pace-jupyter-notebook provides an easy to run command for launching Jupyter notebook from the following login nodes/clusters (login-s[X], login-d[x], login7-d[x], testflight-login, zohar, gryphon, login-hive[X], pace-ice, coc-ice…) that will enable Jupyter on your workstation/laptop browser of your choice.  To launch Jupyter, simply login to PACE, and run the command pace-jupyter-notebook -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect to your Jupyter Notebook! Full documentation on the use of pace-jupyter-notebook, including available options to change such as job walltime, processors, memory, and etc., can be found at http://docs.pace.gatech.edu/interactiveJobs/jupyterInt/.  Please note that on busy queues, you may experience longer wait times to launch the notebook.

In addition, we are providing a similar utility for running software with graphical user interfaces (GUIs), such as MATLAB, Paraview, ANSYS, and many more) on PACE clusters. VNC sessions offer a more robust experience when compared to traditional X11 forwarding. With a local VNC Viewer client, you can connect to the remote desktop on a compute node and interact with the software as if running on your local machine.  Similar to the Jupyter Notebook utility, the new utility pace-vnc-job  provides an easy to run command for launching VNC session on a compute node and connecting your client to the session.  To launch a VNC session, login to PACE, and run the command pace-vnc-job -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect your VNC Viewer to the remote session, start up the software you wish to run. Full documentation on the use of pace-vnc-job, including available options to change such as job walltime, processors, memory, and etc.,, can be found at http://docs.pace.gatech.edu/interactiveJobs/setupVNC_Session/.  Again, please note that on busy queues, you may experience longer wait times to launch a VNC session.

Happy Interactive computing!

Best,
The PACE Team

[COMPLETED] PACE Quarterly Maintenance – November 7-9

[Update 11/5/19]

We would like to remind you that PACE’s maintenance period begins tomorrow. This quarterly maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

These activities will be performed:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– (Completed) Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy implemented last week (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– (Completed) PBSTools, which records user job submissions, will be upgraded.
– (Completed) Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– (Completed) [Hive cluster] Infiniband switch firmware will be upgraded.
– (Completed) [Hive cluster] Storage system firmware will be updated.
– (Completed) [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– (Completed) [Hive cluster] Lmod, the environment module system, will be updated to a newer version.
– (Completed) The athena-6 queue will be upgraded to RHEL7.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Update 11/1/19]

We would like to remind you that we are preparing for PACE’s next quarterly maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:

– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.

– RHEL7 clusters will receive critical patches.

– Updates will be made to PACE databases and configurations.

– PBSTools, which records user job submissions, will be upgraded.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.

– [Hive cluster] Infiniband switch firmware will be upgraded.

– [Hive cluster] Storage system software will be updated. – [Hive cluster] Subnet managers will be reconfigured for better redundancy.

– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

 

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Original post]

We are preparing for PACE’s next maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year. This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions, will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– RHEL7 clusters will receive critical patches.
– Updates will be made to PACE databases and configurations.
– Firmware for DDN storage will be updated.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– [Hive cluster] Infiniband switch firmware will be upgraded.
– [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.