PACE’s centralized OSG service, powered with a new cluster “Buzzard”

We are happy to announce a new addition to PACE’s service portfolio to support Open Science Grid (OSG) efforts on campus and beyond. This service is kick-started by a brand new cluster, named “Buzzard”, funded by an NSF award* lead by Dr. Mehmet Belgin and Semir Sarajlic of PACE, in collaboration with Drs. Laura Cadonati, Nepomuk Otte, and Ignacio Taboada of the Center for Relativistic Astrophysics (CRA). 

Open Science Grid (OSG) is a unique consortium that provides shared infrastructure and services to unify access to supercomputing sites across the nation, making a vast array of High Throughput Computing (HTC) resources available to US-based researchers. OSG has been instrumental in ground-breaking scientific advancements, including but not limited to the Nobel-winning Gravitational Waves research (LIGO).  

Did you know that all of the GT researchers already qualify for OSG? This means you can join today and start running jobs on this vast resource at no cost. We highly encourage you to register for PACE’s next OSG orientation class, which will get you started with the basics of running on OSG.  As an added resource, PACE offers documentation to get researchers quickly started with OSG. 

In addition to training and documentation, PACE offers resource integration services. More specifically, GT faculty members now have an option to acquire new resources to expand Buzzard with their own OSG projects, similar to the High Performance Computing (HPC) services PACE had been successfully offering since 2009 prior to the new cost model. As a part of the NSF award, PACE already started supporting several exceptional OSG projects, namely LIGO, IceCube and CTA/VERITAS, and we look forward to supporting more OSG projects in the future! 

If you are interested in the OSG service, please feel free to reach out to us (pace-support@oit.gatech.edu) and we’ll be happy to discuss how our new service can transform your research. 

Thank you! 

 

* This material is based upon work supported by the National Science Foundation under grant number 1925541. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 

Announcing the PACE OSG Orientation Class

Dear PACE Researchers, 

PACE is pleased to announce the launch of the PACE Open Science Grid (OSG) Orientation class that introduces Georgia Tech’s research community to OSG and the distributed high throughput computing resources that are available via OSG Connect.   Join us for this virtual orientation to learn about OSG and how it may benefit your research needs. 

Please see below the dates for the sessions and the registration form: 

Dates and times:  October 15, 10:30am – 12:15pm 

                               November 11, 1:30pm – 3:15pm 

Registration:         https://b.gatech.edu/3Bi4Yie 

This class is based in part on the work supported by the NSF CC* award 1925541: “Integrating Georgia Tech into the Open Science Grid for Multi-Messenger Astrophysics”. With this award, PACE, in collaboration with Center for Relativistic Astrophysics, added CPU/GPU/Storage to the existing OSG capacity, as well as the first regional StashCache service that benefits all OSG institutions in the Southeast region, not just Georgia Tech.  

This orientation is the first step into PACE’s longer-term plans to support OSG initiatives on campus. Please be on the lookout for more exciting announcements from our team in the very near future. 

We look forward to you joining us for the OSG orientation. 

Best,

The PACE Team

Hive and Phoenix Scheduler Configuration Change

Dear PACE Researchers, 

We would like to announce an upcoming change to the scheduler configuration on the Phoenix and Hive clusters at 9:00 AM on Thursday, September 23rd. This change should improve the scheduler performance given the large number of jobs executed by our users. 

What will PACE be doing: PACE will reduce the retention time for job-specific logs from 24 hours to 6 hours after job completion.  Reducing the amount of job information the scheduler needs to process regularly should provide a more stable and faster job submission environment. Additionally, the downtime associated with scheduler restarts should improve, as job ingestion time will be reduced accordingly.  

Who does this message impact: Any user who attempts to use qstat for a job more than 6 hours after completion will be unable to do so moving forward. In addition to the scheduler job STDOUT/STDERR files, job statistics for completed jobs on Phoenix and Hive can be queried at https://pbstools-coda.pace.gatech.edu. 

What PACE will continue to do: We will monitor the clusters for issues during and after the configuration change to assess any immediate impacts from the update. We will continue to assess the scheduler health to ensure a stable job submission environment. 

As always, please contact us at pace-support@oit.gatech.edu with any questions or concerns regarding this change. 

Best Regards, 
The PACE Team

[Complete] PACE Maintenance Period (November 3 – 5, 2021)

[Complete 11/5/21 3:15 PM]

Our scheduled maintenance has completed ahead of schedule! All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, February 9, 2022, and conclude by 11:59PM on Friday, February 11, 2022. We have also tentatively scheduled the remaining maintenance periods for 2022 for May 11-13, August 10-12, and November 2-4.

The following tasks were part of this maintenance period:
ITEMS REQUIRING USER ACTION:
• [Complete] TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details are available on our blog.

ITEMS NOT REQUIRING USER ACTION:
• [Complete][Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
• [Complete][System] Operating system patch installs
• [Complete][Storage/Phoenix] Lustre controller firmware and other upgrades
• [Complete][Storage/Phoenix] Lustre scratch upgrade and expansion
• [Postponed][Storage] Hive GPFS storage upgrade
• [Complete][System] System configuration management updates
• [Complete][System] Updates to NVIDIA drivers and libraries
• [Complete][System] Upgrade some PACE infrastructure nodes to RHEL 7.9
• [Complete][System] Reorder group file
• [Complete][Headnode/ICE] Configure c-group controls on COC-ICE and PACE-ICE headnodes
• [Complete][Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
• [Complete][Network] update ethernet switch firmware
• [Complete][Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 11/1/21 2:00 PM]

C-group controls will be configured on the login nodes for both COC-ICE and PACE-ICE during the maintenance period this week. This should help mitigate overuse of the login node by students running heavy computations, which has slowed the node for others.

Please use compute nodes for all computational work and avoid resource-intensive processes on the login nodes. Students who need an interactive environment are requested to submit an interactive job. Students who are uncertain about how to use ICE schedulers to work on compute nodes should contact their course’s instructor or TA for assistance. They can help you with workflows on the cluster. PACE will stop processes that overuse the login nodes, in order to restore functionality for all students.

Thank you for your efforts to ensure ICE clusters are an available resource for all students in participating courses.

[Reminder 10/26/21 4:30 PM]

Additional details and instructions for the TensorFlow upgrade are available in another blog post.

[Full announcement 10/20/21 10:30 AM]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, November 3, and end at 11:59 PM on Friday, November 5. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Please see below for a tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details and instructions will follow in a separate message.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
  • [System] Operating system patch installs
  • [Storage/Phoenix] Lustre controller firmware and other upgrades
  • [Storage/Phoenix] Lustre scratch upgrade and expansion
  • [System] System configuration management updates
  • [System] Updates to NVIDIA drivers and libraries
  • [System] Upgrade some PACE infrastructure nodes to RHEL 7.9
  • [System] Reorder group file
  • [Headnode/COC-ICE] Configure c-group controls on COC-ICE headnode
  • [Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
  • [Network] update ethernet switch firmware
  • [Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Early announcement]

Dear PACE Users,

This is a friendly reminder that our next Maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to the Maintenance Period, we will communicate the list of activities to be completed and update this blog post.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Complete] PACE is transitioning from current ticketing system FootPrints to ServiceNow

[Update – September 3]

Dear PACE Users,

PACE has successfully transitioned to ServiceNow, and we have begun receiving user tickets as expected in ServiceNow.

As previously mentioned, you may continue to use the pace-support@oit.gatech.edu email to reach out to PACE support, and for your reference, the following three links listed below are direct links to the ServiceNow forms that you may use going forward to request for help, request new software for the PACE Apps software repository, and request access to ICE cluster.

PACE team will continue to work on the remaining support requests that are in FootPrints system.  Thank you all for your attention and patience through this transition.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

 

[Original Message – September 1]

Dear PACE Users,  

We are reaching out to inform you that PACE is transitioning from our current ticketing system FootPrints to ServiceNow. 

What’s happening and what we are doing:   PACE team is transitioning from current ticketing system, FootPrints, to ServiceNow. From September 3, all new PACE support requests will be processed in ServiceNow.  PACE will continue to work on any existing support requests that are in FootPrints.  As part of this transition, we have created two new request forms that replace our existing Software Request Form and PACE ICE Instructional Cluster Request Form.  

How does this impact me: Overall, the transition is seamless to the users for most cases with the exception of the links to our software and ICE request forms that are changing. On Friday, September 3rd, PACE support email address, pace-support@oit.gatech.edu, will redirect users’ emails/requests to ServiceNow, and the new software and ICE request form links will be available on our website. Please use those new forms if you would like to request new software for the PACE Apps software repository or if you are a course instructor interested in using PACE-ICE for your students.  Users who submitted ticket requests via FootPrints directly may use ServiceNow at https://services.gatech.edu (navigate to “Technology” & then “PACE” tile) and submit their request from the available forms.   

The following direct links to ServiceNow forms will be live and available to users on September 3: 

What we will continue to do:   We will continue to work on the existing tickets that are in FootPrints, and you may check the status of this transition on this blog post.   

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

Email Relay Reconfiguration that’s Impacting PACE Utilities

Dear PACE Users,

We are reaching out to inform you that on Monday, August 30, PACE will begin reconfiguring it’s utilities that send out messages to users, which will result in a change in an email address that’s listed in the “from” address to the following one, no-reply@pace.gatech.edu. These changes are required in order for us to be compliant with email notification requirements by the Institute. We want to bring this to your attention so that you are aware of the new email address that you will be receiving messages from PACE.

What’s happening and what we are doing: PACE will be making changes to utilities that send out messages to users, which will result in a change in an email address that’s listed in the “from” address. PACE will begin updating it’s utilities on Monday, August 30, that will continue through the coming weeks. More specifically, the following utilities will be reconfigured:

  • [Complete] Scheduler (all clusters): Emails from the scheduler with job status information will change from moabadmin@<scheduler>.pace.gatech.edu to being from no-reply@pace.gatech.edu.
  • PACE Support script (all clusters): Currently the pace-support script is disabled. The script will change how it sends information to the ticketing system to send it from no-reply@pace.gatech.edu and embed your email address to change the source of the ticket rather than sending as from you. This should be transparent to you the user. Previously it was sending the message to the ticket system as though it was sent from you to accomplish getting the source of the ticket identified properly.
  • [Complete] PI and Department CSR Monthly statements for Phoenix and Firebird clusters: These will change from having a pace-support@oit.gatech.edu from address to being from no-reply@pace.gatech.edu, with a reply-to of pace-support@oit.gatech.edu.
  • Security/system information (all clusters): Security violations and general system mail will be redirected to be from no-reply@pace.gatech.edu. This will include mail sent using the mail commands. System mail will be redirected to your email account as identified in GT systems. This may result in you getting mail messages that were previously left on system in an undeliverable state.
  • Head node violation messages (all clusters): The from for these messages will change from pace-support@oit.gatech.edu to being from no-reply@pace.gatech.edu and the reply-to being set to pace-support@oit.gatech.edu.
  • Scratch storage deleter messages (Phoenix & Hive): The from for these messages will change from pace-support@oit.gatech.edu to being from no-reply@pace.gatech.edu and the reply-to being set to pace-support@oit.gatech.edu.
  • Reconfigure PACE servers to send via GT outgoing mail servers (all clusters): This will increase the likelihood of email messages being delivered and also not being identified as spam. This should be transparent to you, but adds email headers for signatures and changes the server that will deliver the email.

How does this impact me: All messages that you receive from PACE utilities will be addressed from no-reply@pace.gatech.edu. If you have created email rules for your inbox for prior messages coming from PACE, please do update them accordingly with this new address, no-reply@pace.gatech.edu

What we will continue to do: In the coming weeks, PACE will work in implementing the changes listed above. You may check the status of each of the changes on this blog post.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

[Update – 08/13/2021 – 10:00AM]

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Original Message – 07/13/2021 – 4:15PM that was updated on August 4, 2021 with list of tasks] 

Dear PACE Users,

This is another friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/11/2021, which is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.  Please note, as usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

Please see the list of activities to be completed:

ITEMS REQUIRING USER ACTION:

  • Currently, none.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [System/Security] Operating system patch installs
  • [System/Security] Endpoint Protection Updates
  • [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [System] Update Nvidia drivers and add Nvidia specific libraries
  • [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Resolved] OIT’s Data Warehouse Service Outage

[Update – July 13, 2021] 

OIT has restored operation to Data Warehouse service on July 12, 11:22AM.  Shortly after, PACE has restored functionality to our database and our administrative services.   OIT has continued to monitor the Data Warehouse service.  At this time, all PACE user facing utilities such as pace-check-queue, pace-quota, and pace-whoami are operational.

Please accept our sincere apology for any inconvenience that this temporary limitation may have caused you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

[Original Message – July 12, 2021]

Dear PACE Users,

We are reaching out to inform you that on Saturday at about 10:00am, there was an outage to OIT’s Enterprise Data Warehouse service, which PACE relies on for hosting our database instance that subsequently went down at 11:07am.  The impact to PACE from this service outage is mainly limited to administrative side, and there is some impact to user facing utilities such as pace-check-queue; however, there is no impact to users’ jobs or ability to submit jobs.

What’s happening and what we are doing:  Currently, OIT is investigating the outage impacting the Data Warehouse service that occurred on Saturday, and this outage is tracked at OIT’s status page.   PACE is monitoring this development closely.

How does this impact me:  This data warehouse service outage impacts user facing utilities such as pace-check-queue, pace-quota, pace-whoami that are partially or fully nonfunctional.   In addition, until the Data Warehouse service is restored, PACE will be unable to create new user and PI account requests.  

What we will continue to do:  PACE team will continue to monitor this development, and we will report as needed.   

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

pace-support.sh is disabled on PACE Clusters — please email pace-support directly for inquiries

Dear PACE Users,

It has come to our attention that we are not receiving support requests generated by the pace-support.sh script, which allows submission of support tickets directly from PACE clusters. Our investigation is ongoing.

At this time, please email us at pace-support@oit.gatech.edu from a non-PACE system for all support requests, to ensure that we receive your message.

From our initial investigation, it appears that this outage began at some point in May. We apologize for any lost messages since then. If you have been trying to reach us via the pace-support script, please email us instead. You should receive an automated acknowledgement email from Service Desk when your request is successfully processed.

Please contact us at pace-support@oit.gatech.edu with questions.

The PACE Team