mweiner3 – Page 7 – Partnership for an Advanced Computing Environment

[Resolved] Hive scheduler outage

[Update 4:40 PM 7/23/21]

After continued investigation, cleaning up the scheduler logs, and rebooting the scheduler node, we have restored the Hive scheduler to full functionality. Jobs that have been submitted and queued are now running, and there was no interruption to running jobs. New jobs submitted at this time should start as space becomes available, as usual. Thank you for your patience as we investigated this situation.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 1:35 PM 7/23/21]

The Hive scheduler has been experiencing intermittent outages over the last few days while under heavy load, and jobs have been unable to start for nearly all of today (Friday). You may find that jobs you have submitted to Hive remain queued and do not start. We are actively working to restore functionality and will update you as more information becomes available. Thank you for your patience as we investigate this situation.
Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix storage issue

[Update 2:05 PM 7/22/21]

The controller reboot is complete, and we believe no disruption occurred in access to Phoenix storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 12:55 PM 7/22/21]

In coordination with our support vendor, we are working to resolve an issue with a Phoenix Lustre metadata controller, which supports both project and scratch storage.
At 1:30 PM today, we will reboot one of the controllers. We do not expect any impact to users, as the other controller is running without error at this time. Should there be any unexpected impact, we will work to restore full functionality as quickly as possible. We will provide an update when this work is complete.
Please contact us at pace-support@oit.gatech.edu with any questions.

Hive Project Storage Quota Update

In coordination with the Hive PIs, PACE has updated our quota policies for project storage on Hive, in order to facilitate easier access to available storage capacity for our users. For project storage, accessed via the “data” symbolic link in your home directory, block quotas are now shared by an entire research group, rather than being set at the user level. All users in a single PI’s storage allocation have access to the entire quota, which brings Hive in line with Phoenix’s quota arrangement. Most research groups have 50 TB of project storage on Hive, with the exception of those specifically provided with a higher allocation in the NSF grant funding the cluster. Each user maintains a limit of 2 million files within their research group’s project storage.

You can review your storage usage on Hive by running the updated “pace-quota” command on any Hive node. Quotas for home (5 GB per user) and scratch (7 TB per user) directories are unchanged. Please visit our documentation for more details about Hive storage.

Please contact us at pace-support@oit.gatech.edu with any questions about using Hive.

OIT NetApp upgrade

A low-risk upgrade is planned for Georgia Tech OIT’s NetApp storage appliances, beginning Saturday, July 10, at 6:00 AM. We do not expect any impact on PACE systems from this upgrade.

OIT’s NetApp appliance is in use on PACE’s Phoenix, PACE-ICE, and COC-ICE clusters. It hosts home directories as well as pace-apps, our software module repository. Should there be an unexpected disruption, users may face issues with logins, access to home directories, and loading or using PACE-supported software modules. We will provide updates in the unlikely event of a disruption this weekend.

Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scratch storage update

We would like to remind you about scratch storage policy on Phoenix. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Phoenix’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who used PACE prior to the migration to Phoenix or who use Hive are already familiar with this workflow.

Some of you will receive such an email this week. The first deletion of old scratch files in Phoenix will occur on July 7, covering files noted in these messages. We are extending the time beyond the normal one-week notification for this first round to give you time to adjust to this weekly process again.

Phoenix project storage is the intended location for your important research data. You can find out more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Phoenix.

[Resolved] Hive scheduler outage

The Hive scheduler experienced an outage this afternoon, as the resource and workload managers were unable to communicate. Our team identified the issue as relating to a missing library file and corrected the issue, restoring functionality at approximately 5 PM today.
Jobs submitted this afternoon would not have been able to start until the repair was implemented. Already-running jobs should not have been affected.
Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Phoenix scratch outage

[Update 6/12/21 6:30 PM]

Phoenix Lustre scratch has been restored. We paused the scheduler at 4:40 PM to prevent additional jobs from starting and resumed scheduling at 6:20 PM. As noted, please contact us with the job number for any job that began prior to 4:40 PM and was affected by the scratch outage, in order to receive a refund.

[Original post, 6/12/21 4:30 PM]

We are experiencing an outage on Phoenix’s Lustre scratch storage. Our team is currently investigating and has confirmed that this issue is related to the scratch mount and does not affect home or project storage. Users may be unable to list, read, or write files in their scratch directories.
If your running job has failed or runs without producing output as a result of this outage, please contact us at pace-support@oit.gatech.edu with the affected job number(s), and we will refund the value of the job(s) to your charge account. Please refrain from submitting additional jobs utilizing your networked Lustre scratch directory until the service is repaired, in order to avoid increasing the number of failed jobs.

Parallel Computing with MATLAB and Scaling to HPC on PACE clusters at Georgia Tech

MathWorks and PACE are partnering to offer a two-part parallel computing virtual workshop, taught by a MathWorks engineer, to PACE users and other members of the Georgia Tech community.

During this self-paced, hands-on workshop, you will be introduced to parallel and GPU computing in MATLAB for speeding up your application and offloading computations. By working through common scenarios and workflows, you will gain an understanding of the parallel constructs in MATLAB, their capabilities, and some of the issues that may arise when using them. You will also learn how to take advantage of PACE resources, which are available to all researchers at Georgia Tech (including a free tier available at no cost), to scale your MATLAB computations.

Highlights
·      Speeding up programs with parallel computing
·      Working with large data sets
·      GPU computing
·      Scaling to PACE clusters (Phoenix, Hive, ICE, or Firebird)

Agenda
This virtual workshop will be held in two parts:
Part I, Tuesday, May 18, 1-4 PM, will focus on speeding up MATLAB with Parallel Computing Toolbox.
Part II, Tuesday, May 25, 1-4 PM, will focus on running MATLAB parallel code on PACE clusters.

Who should attend?
PhD students, post docs and faculty at Georgia Tech that want to (Part I) use parallel and GPU computing in MATLAB, and (Part II) scale their computations to take advantage of PACE resources.

Requirements
· Basic working knowledge of MATLAB
· Access to the Georgia Tech VPN. You do NOT need to be a PACE user, and all participants will receive access to PACE-ICE for hands-on activities.

Please contact PACE at pace-support@oit.gatech.edu with any questions.

Phoenix Project Storage Quotas Begin March 31

[Update 3/31/21 10:45 AM]

As previously announced, we have applied quotas to Phoenix project storage today based on each faculty PI’s choice. You can run the pace-quota command to check your research group’s utilization and quota at any time. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/. April will be the first month where storage quotas incur charges for PIs choosing quotas beyond the 1 TB funded by the Institute.

If your research group exceeds your quota, you will not be able to write to your project storage, and jobs running in your project storage may fail. We are in the process of directly contacting all users in storage projects over quota today.

Please contact PACE Support with any questions about Phoenix project storage quotas. Faculty may also choose to contact their PACE Research Scientist liaison.

[Update 3/24/21 12:30 PM]

We’d like to remind you that storage quotas on Phoenix project storage will be set one week from today, on March 31.

You can run the pace-quota command to check your research group’s utilization at any time. Your PI/faculty sponsor is choosing the quota that will be set, based on your group’s storage needs. You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

After quotas are set on March 31, we will notify all users, and you will be able to see your quota via pace-quota.

Users and faculty should contact their PACE Research Scientist liaison or PACE Support with any questions about Phoenix project storage quotas.

[Original Post]

As part of completing the migration to Phoenix, we will set quotas on Phoenix project storage on March 31, ending the period of unlimited project storage. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

PACE’s free tier offers 1 TB of Institute-funded project storage to GT faculty. Faculty members must fund additional storage beginning with the month of April. PACE has provided faculty members with Phoenix storage allocations (except those recently created) with information regarding their group’s storage needs. Users can contact their advisors if they have concerns about their allocation.

All users can run the “pace-quota” command on Phoenix to see their research group’s storage usage. Quotas will generally show as unlimited (zero) until March 31.

Please contact us at pace-support@oit.gatech.edu with any questions about Phoenix project storage.

Hive Scratch Storage Update

We would like to remind you about scratch storage policy on Hive. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Hive’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who use the main PACE system are already familiar with this workflow.

Some of you received such an email yesterday. As always, if you need additional time to migrate valuable data off of scratch, please respond to the email as directed to request a delay.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Hive.

Partnership for an Advanced Computing Environment

Author: mweiner3

[Resolved] Hive scheduler outage

Phoenix storage issue

Hive Project Storage Quota Update

OIT NetApp upgrade

Phoenix scratch storage update

[Resolved] Hive scheduler outage

[Resolved] Phoenix scratch outage

Parallel Computing with MATLAB and Scaling to HPC on PACE clusters at Georgia Tech

Phoenix Project Storage Quotas Begin March 31

Hive Scratch Storage Update

Georgia Institute of Technology