Posts

December 1, 2020 – PACE Users will have Access to Rich Datacenter Disabled (This does not apply to users accessing CUI resources in Rich)

Dear PACE Users,

In the past couple months, we have reached out to research groups with regards to the required user migrations from Rich to CODA datacenter.  At this time we are actively migrating users into CODA, and we have another migration of research groups scheduled for December 1st.  In an abundance of caution, if you have not received an email about your migration to CODA datacenter, please contact PACE about your migration at your earliest convenience.

What is happening:  On December 1, the remaining PACE users (non-CUI) in the Rich datacenter will have their access disabled as part of the last migration to CODA datacenter that starts on December 1.  Please note, this does not apply to CUI resources and their user migrations at this time.

Who does this message impact, and what should I do:  If you are NOT already migrated to CODA, in the process of migrating to CODA, or received an email from PACE research scientist about your planned migration to CODA, then please contact pace-support@oit.gatech.edu so that we may address your migration and prevent interruption to your research as we disable access to Rich datacenter.

This message is being sent out of abundance of caution to ensure that no user is left behind in Rich datacenter as we disable access to all non-CUI resources in Rich datacenter on December 1, 2020.   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Best,

The PACE Team

 

 

 

[Resolved] Phoenix Storage (Lustre) slowness that’s impacting data and scratch

[Update – 11/03/2020 – 11:01am]

As of late last night, the slowness experienced on Phoenix storage was resolved.   Thank you for your patience and understanding while we worked to address this issue.

What is happening and what we have done:   In response to reports from users about slowness in accessing files on Phoenix’s Lustre storage, PACE team was able to replicate this issue during our investigation, and through our troubleshooting that included Lustre Metadata Service (MDS) reboots, we were able to resolve the slowness.   The Phoenix Lustre storage is stable at this time, and there was no loss of user data during this incident.   

What we will continue to do: PACE will continue to monitor the Phoenix storage out of abundance of caution, and we will update as needed.

Again, this issue did not impact any of the other resources in Coda and Rich Datacenter.

Thank you for your attention to this message, and we apologize for this inconvenience.

 

[Original Post – 11/02/2020 – 1:03pm]

Dear PACE Users,

PACE is aware of the slowness experienced on Phoenix’s storage.  At this time, PACE is able to replicate the issue, and we are investigating the root cause of the storage issue.

What is happening and what we have done:   We’ve received couple reports from users about slowness in accessing files from ‘data’ and ‘scratch’ directories on Phoenix’s Lustre storage.  Some users are experiencing slowness in accessing their files, and running commands such as ‘ls’ or opening a file with ‘vim’ may be very slow.  During our investigation, PACE team is able to replicate this issue, and we are investigating the root cause of the slowness with storage.   

What we will continue to do: This is an active situation, and we will follow up with updates as they become available.

This issue does not impact any of the other resources in Coda and Rich Datacenter.

Thank you for your attention to this message, and we apologize for this inconvenience.

The PACE Team

Hive Scratch Storage Update

We would like to remind you about scratch storage policy on Hive. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Hive’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who use the main PACE system are already familiar with this workflow.

Some of you received such an email yesterday. As always, if you need additional time to migrate valuable data off of scratch, please respond to the email as directed to request a delay.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Hive.

CoE HPC Cost Model Listening Session

Over the past few months, a team from the EVPR, OIT, EVP-A&F, and GTRC has been working with Institute leadership to develop a more sustainable and flexible way to support research cyberinfrastructure. This new model is described in more detail below and will affect researchers who leverage PACE services. The model enjoys strong support, but it is not yet fully approved.  We are communicating at this stage because we wanted you to be aware of the upcoming changes and we welcome your feedback. Please submit comments to the PACE Team <pace-support@oit.gatech.edu> or to Lew Lefton <lew.lefton@gatech.edu>. This listening session is organized for the College of Engineering.

Date:           11/02/2020, 4:00pm – 5:00pm

Location:   BlueJeans (link provided via email)

Host:           EVPR/PACE

In a nutshell, PACE will transition from a service that purchases nodes with equipment funds, to a service which operates as a Cost Center. This means that major research cyberinfrastructure (including compute and storage services) will be treated like other core facilities. This new model will begin as the transition to the new equipment in the CODA data center happens. We recognize that this represents a shift in how we think about research computing. But, as shown below, the data indicates that the long-term benefits are worth the change.  When researchers only pay for actual consumption – similar to commercial cloud offerings from AWS, Azure, and GCP – there are several advantages:

  • Researchers have more flexibility to leverage new hardware releases instead of being restricted to hardware purchased at a specific point in time.
  • The PACE team can use capacity and usage planning to make compute cycles available to faculty in days or week as opposed to having to wait for months due to procurement bottlenecks.
  • We have secured an Indirect Cost Waiver on both PACE services and commercial cloud offerings for two years to allow us to collect data on the model and see how it is working.
  • Note that a similar consumption model has been used successfully at other institutions such as Univ. Washington and UCSD, and this approach is also being developed by key sponsors (e.g. NSF’s cloudbank.org).
  • A free tier that provides any PI the equivalent of 10,000 CPU-hours on a 192GB compute node and 1 TB of project storage at no cost.

For further details on the new cost model, please visit out Web page

CoS HPC cost model listening session

Over the past few months, a team from the EVPR, OIT, EVP-A&F, and GTRC has been working with Institute leadership to develop a more sustainable and flexible way to support research cyberinfrastructure. This new model is described in more detail below and will affect researchers who leverage PACE services. The model enjoys strong support, but it is not yet fully approved.  We are communicating at this stage because we wanted you to be aware of the upcoming changes and we welcome your feedback. Please submit comments to the PACE Team <pace-support@oit.gatech.edu> or to Lew Lefton <lew.lefton@gatech.edu>. This listening session is organized for the College of Sciences.

Date:           10/13/2020, 10:00am – 11:00am

Location:   BlueJeans (link provided via email)

Host:           EVPR/PACE

In a nutshell, PACE will transition from a service that purchases nodes with equipment funds, to a service which operates as a Cost Center. This means that major research cyberinfrastructure (including compute and storage services) will be treated like other core facilities. This new model will begin as the transition to the new equipment in the CODA data center happens. We recognize that this represents a shift in how we think about research computing. But, as shown below, the data indicates that the long-term benefits are worth the change.  When researchers only pay for actual consumption – similar to commercial cloud offerings from AWS, Azure, and GCP – there are several advantages:

  • Researchers have more flexibility to leverage new hardware releases instead of being restricted to hardware purchased at a specific point in time.
  • The PACE team can use capacity and usage planning to make compute cycles available to faculty in days or week as opposed to having to wait for months due to procurement bottlenecks.
  • We have secured an Indirect Cost Waiver on both PACE services and commercial cloud offerings for two years to allow us to collect data on the model and see how it is working.
  • Note that a similar consumption model has been used successfully at other institutions such as Univ. Washington and UCSD, and this approach is also being developed by key sponsors (e.g. NSF’s cloudbank.org).
  • A free tier that provides any PI the equivalent of 10,000 CPU-hours on a 192GB compute node and 1 TB of project storage at no cost.

For further details on the new cost model, please visit out Web page

[Resolved] Power Outage at Rich Datacenter

[Update – 10/07/2020 – 8:02]
After nearly-28 hours since the initial power outage in the Rich Datacenter that further caused complications and failures with the networks and systems, we are pleased to report that we have restored the PACE resources in Rich Datacenter and released the user jobs.   We understand the impact this has had on your research, and we are very grateful for your patience and understanding as we worked through this emergency.  During this outage, the PACE clusters in the Coda datacenter (Hive, Testflight-Coda, CoC-ICE, PACE-ICE, and Phoenix) have not been impacted.
What we have done:  Since last night after the network repairs were conducted, we were closely monitoring the network/fabric, and we have gradually brought the infrastructure back up.  We conducted application and fabric testing across the systems to assure the systems are operational, and we addressed problematic nodes and issues with schedulers.  The power and fabric are stable. We have identified the users whose jobs were interrupted by this power outage from yesterday, and we will reach out to impacted users directly.  We have released user jobs that were queued prior to the power outage when we paused the schedulers, and jobs are currently running.
What we will continue to do: PACE team will continue to monitor the systems, and we will report as needed.      We have some straggling nodes that will remain offline, and we will work to bring them back up in the coming days.
Please don’t hesitate to contact us at pace-support@oit.gatech.edu if you have any questions or if you encounter any issues on the clusters.  Thank you again for your patience.
[Update – 10/06/2020 – 11:20]

We are following up to update you on the current status of the Rich Datacenter.    After a tireless evening, the PACE team in collaboration with OIT have successfully restored the network at approximately 11:00pm.  We replaced a failed management module on the core InfiniBand switch, now, the switch is operational.  Preliminary spot checks indicate that the fabric is stable.   In abundance of caution, we will monitor the network overnight.  In the morning, we aim to conduct additional testing and online the compute resources in Rich Datacenter, followed by releasing user jobs that are currently paused.    The power remains stable after the repairs were conducted, and the UPS is back at nearly full charge.

As always, thank you for your patience and understanding during this outage as we know how critical these resources are to your research.   

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update – 10/06/2020 – 6:30]

This is brief update on the current power outage.   Power has been restored in Rich datacenter, and recovery is underway.  Some Ethernet network switches have failed, and replacements and re-configurations are underway to try and restore services.  Currently, our core InfiniBand switch has not restarted yet.  We will continue to update you as we have more information.  For up to date information, please check the status and blog pages:

Again, this emergency work does not impact any of the resources in CODA datacenter.

Thank you for your continued patience and understanding as we work through this emergency.

[Original Post – 10/06/2020 – 4:54] 
We have a power outage on a section of campus that includes the Rich datacenter’s 133 computer room.  We are urgently shutting down the schedulers and remaining servers in Rich133.  Power to storage and login nodes in Rich are currently on generator power and will remain safe.
What is happening and what we have done:  At 3:45pm the campus (not GA Power) distribution power issued a power outage, and at 4:05 Rich 133 UPS went out.  Power to the chillers and to 2/3 of the computer room in Rich Datacenter is out. .  Facilities is on site and investigating the situation, also, High Voltage contractor is in route. We have initiated urgent shutdown of schedulers and remaining servers in the Rich datacenter’s 133 computer room.   Storage and login nodes are running on generators, but most of the running user jobs will have been interrupted by this power outage.

What we will continue to do: This is an active situation, and we will follow up with updates as they become available, and for most up to date information, please check the status and blog pages:

This emergency work does not impact any of the resources in CODA datacenter.

Thank you for your attention to this urgent message, and we apologize for this inconvenience.

The PACE Team

[RESOLVED] URGENT – CODA datacenter research hall emergency shutdown

[Update – 10/05/2020 8:18]

Thank you for your patience as we worked through this emergency to restore cooling in the CODA datacenter’s Research Hall.  At this time, we have Hive, COC-ICE, PACE-ICE, Testflight-CODA and Phoenix clusters back online with users’ previously queued jobs having started.

What has happened and what we did:   At 4:30pm today, the main chiller for the research computing failed fully in CODA datacenter’s Research Hall side.  PACE had urgently shutdown the compute nodes for the Hive, COC-ICE, PACE-ICE, Testflight-CODA and Phoenix clusters.  Storage and login nodes were not impacted during this outage.  Working with DataBank, we were able to restore enough cooling  using economizer module that can handle all cooling in the Research Hall.   At 6:30pm, we had onlined Hive cluster, and since then we have continued to bring back up the remaining cluster’s compute nodes for COC-ICE, PACE-ICE, Testflight-CODA, and Phoenix clusters while maintaining normal operating temperatures.   At about 7:00pm vendor has arrived, and is working on chiller, and no interruption should occur when the fixed chiller is brought online.  Our storage did not experience data loss, but users’ running jobs were interrupted by this emergency shutdown.  We encourage users to check on their jobs and resubmit any jobs that may have been interrupted.  Currently, previously queued user jobs are running on the clusters.

What we will continue to do:   PACE team continue to monitor the situation, and report accordingly as needed.

For your reference we are including OIT’s status page link and blog post:

Status page:  https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5f7b9062cb294e04bbe8cbda

Blog post: http://blog.pace.gatech.edu/?p=6931

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Thank you for your patience and attention to this emergency.

 

[Original Post – 10/05/2020 6:16]

The cooling has failed in CODA datacenter’s research hall.  We have initiated and completed emergency shutdown of all resources in CODA research hall that includes: Hive, COC-ICE, PACE-ICE, Testflight-CODA, and the Phoenix clusters.

What is happening and what we have done:   We have urgently  completed emergency shutdown of all the clusters in CODA datacenter.  Research data and cluster headnodes are fine, but all running user jobs will have been interrupted by this outage.  At this time, we are using economizer module to provide some cooling, and we are beginning to bring back up Hive cluster while closely monitoring the temperatures.

What we will continue to do: This is an active situation, and we will follow up with updates as they become available.

Also, please follow the updates on the OIT’s status page: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5f7b9062cb294e04bbe8cbda

Additionally, we are tracking the updates in our blog at: http://blog.pace.gatech.edu/?p=6931

This emergency work does not impact any of the resources in Rich datacenter.

Thank you for your attention to this urgent message.

 

 

[Resolved] TestFlight-Coda, COC-ICE, and PACE-ICE Schedulers Down

[Update – 10/05/2020 – 10:20am]

PACE has completed testing across the resources in Coda datacenter over the weekend.  These tests did not impact Hive cluster or PACE resources in Rich datacenter.  We brought the schedulers for coc-ice, pace-ice, and Testflight-coda clusters online.  Users queued jobs have resumed.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Thank you again for your patience during this testing.

[Original Post – 10/03/2020 – 1:43pm]

In order to complete preparations for bringing PACE’s new Coda resources on the research cluster into production, we had to urgently offline testflight-coda, coc-ice, and pace-ice schedulers on Saturday at about 10:30am, which is in effect until 8 AM on Monday. We did have a job reservation in place to prevent interruptions to any user jobs, and at the time of taking testflight-coda, coc-ice, and pace-ice schedulers offline, there were no users running jobs on the system.  We apologize for this inconvenience. You can still access the login node over the weekend, but you will receive an error message if you attempt to submit a job.  Your files are all accessible via the login node.  All queued jobs prior to the offlining of the schedulers will resume on Monday.

Hive and all PACE resources in the Rich datacenter are not affected.

Again, we apologize for the late notice. Please contact us at pace-support@oit.gatech.edu with questions.

[RESOLVED] Hive Scheduler Degraded Performance

[UPDATE – 10/01/20 5:51pm]

We are following up to let you know that the Hive scheduler has been restored to operation, and users may submit new jobs.  We appreciate your patience as we conducted our investigation and resolved this matter.   We are providing a brief summary of our findings and actions taken to address this issue.

What Happened and what we did: Yesterday, a user ran an aggressive script that spammed the scheduler with roughly 30,000 job submissions and extremely frequent client queries to both Moab and Torque. This resulted in a chain reaction in which the scheduler utilities were fully overwhelmed and producing log files hundreds of times larger in both size and number of files than normal.  Additionally, system utilities were stressed as they tried to keep up with backups and archival. Once PACE became aware of the issue, we terminated the user’s script and began working to clean up the scheduler environment.  Ultimately, we had to forcefully remove some of the egregious job logs associated with the user.   Other users job(s) that were already submitted to the scheduler prior to the incident have operated normally in that we did not observe abrupt job cancelations/interruptions during this situation.   Also, PACE has followed up with the user, and we are working with them to improve their workflow and prevent any future issues such as this one.

What we continue to do:  As we blogged this morning at 10:02AM, the scheduler is accepting jobs and running.  We have observed some residual effects in system utilities that we have been addressing and monitoring throughout the day.   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

As always, we appreciate your patience as we worked to address this situation.

 

[UPDATE – 10/01/20 10:02am]

Yesterday, a user ran an aggressive script that spammed the scheduler with roughly 30,000 job submissions and extremely frequent client queries to both Moab and Torque. This resulted in a chain reaction in which the scheduler utilities were fully overwhelmed and producing log files hundreds of times larger in both size and number of files than normal, followed by system utilities being stressed as they tried to keep up with backups and archival. Once we became aware of the issue, we terminated the user’s script and began working to clean up the scheduler environment, ultimately having to forcefully remove some of the egregious job logs associated with the user. At this point, the scheduler is accepting jobs and running, although there are still some residual effects in system utilities that we are addressing. As always, we appreciate your patience as we address this situation.
[ORIGINAL POST – 09/30/20 7:21pm]

At about 4:30pm, we began experiencing degraded performance with the Hive scheduler.  Currently, the scheduler is under significant load, and some users may notice their new job submissions hanging as couple users have already reported to us.  PACE is investigating the issue, and we will update once the scheduler is restored to normal operation.

We apologize for the inconvenience this is causing.

 

 

[RESOLVED] PACE Maintenance – October 14-16, 2020

[Update – October 19 – 5:30pm]

We are following up to inform you that our maintenance for TestFlight-Coda and Phoenix clusters has completed.  At this time, all Rich and Coda datacenter clusters are ready for research. We appreciate everyone’s patience as we worked through this partially extended maintenance day to address our activities in Coda datacenter.

At this time, we are updating you on the status of tasks:

ITEMS REQUIRING USER ACTION:

  • [COMPLETE] [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Applying a tuned profile to the Hive compute nodes
  • [COMPLETE] [Compute] Update Nvidia GPU drivers on coda to Support CUDA 11 SDK
  • [COMPLETE] [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [COMPLETE] [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [COMPLETE] [Network] Update Phoenix subnet managers to RHEL7.8
  • [COMPLETE] [Storage] Replace DDN 7700 storage controller 1
  • [COMPLETE] [Storage] Replace DDN SFA18KE storage enclosure 8
  • [COMPLETE] [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [COMPLETE] [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [COMPLETE] [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes
  • [COMPLETE] [Storage] Lustre Client Patches
  • [COMPLETE] [Storage] Lustre filesystem controller to be replaced
    • [COMPLETE – 10/19/2020] We conducted, further testing of Lustre storage in coordination with our vendor.
  • [COMPLETE] [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive, COC-ICE and PACE-ICE clusters).

 

ITEMS REQUIRING USER ACTION:

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

Best,
PACE Team

 

[UPDATE – October 16 – 6:31pm]

We are following up with an update on the PACE maintenance period.  As mentioned yesterday, our maintenance for Rich datacenter has completed 1-day ahead of the schedule, and we are partially complete with CODA datacenter. All clusters in Rich datacenter are ready for research. Hive, COC-ICE and PACE-ICE clusters in Coda datacenter are ready for research and instructional learning. We have released users jobs on Hive, COC-ICE, PACE-ICE clusters, and the Rich datacenter clusters. The Phoenix cluster in CODA will remain under maintenance through Monday, October 19, as scheduled. Also, we need to extend the maintenance for the Testflight-Coda cluster through Monday, October 19, to address remaining pending task.

At this time, we are updating you on the status of tasks:

ITEMS REQUIRING USER ACTION:

  • [COMPLETE] [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Applying a tuned profile to the Hive compute nodes
  • [COMPLETE] [Compute] Update Nvidia GPU drivers on coda to Support CUDA 11 SDK
  • [COMPLETE] [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [COMPLETE] [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [COMPLETE] [Network] Update Phoenix subnet managers to RHEL7.8
  • [COMPLETE] [Storage] Replace DDN 7700 storage controller 1
  • [COMPLETE] [Storage] Replace DDN SFA18KE storage enclosure 8
  • [COMPLETE] [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [COMPLETE] [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [COMPLETE] [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes
  • [COMPLETE] [Storage] Lustre Client Patches
  • [COMPLETE] [Storage] Lustre filesystem controller to be replaced
  • [PENDING] [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive, COC-ICE and PACE-ICE clusters).

ITEMS REQUIRING USER ACTION:

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

We will follow up with further updates.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

 

[UPDATE – October 15, 2020, 8:44pm]

Our maintenance period has completed for Rich datacenter 1-day ahead of the schedule, and we are partially complete for CODA datacenter.   All clusters in Rich datacenter are ready for research.  Only Hive cluster in Coda datacenter is ready for research.  We have released users jobs on Hive cluster, and the Rich datacenter clusters.

The remaining clusters in CODA datacenter that include, Phoenix, Testflight-Coda, CoC-ICE, and PACE-ICE will remain under maintenance for the remainder of the maintenance period as we address the remaining tasks from our maintenance period.

At this time, we are updating you on the status tasks:

ITEMS REQUIRING USER ACTION:

  • [COMPLETE] [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Applying a tuned profile to the Hive compute nodes
  • [COMPLETE] [Compute] Update Nvidia GPU drivers on coda to Support CUDA 11 SDK
  • [COMPLETE] [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [COMPLETE] [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [COMPLETE] [Network] Update Phoenix subnet managers to RHEL7.8
  • [COMPLETE] [Storage] Replace DDN 7700 storage controller 1
  • [COMPLETE] [Storage] Replace DDN SFA18KE storage enclosure 8
  • [COMPLETE] [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [COMPLETE] [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [COMPLETE] [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes
  • [PENDING] [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive cluster).
  • [PENDING] [Storage] Lustre Client Patches
  • [PENDING] [Storage] Lustre filesystem controller to be replaced

ITEMS REQUIRING USER ACTION:

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

We will follow up tomorrow regarding the remaining CODA datacenter tasks impacting Phoenix, CoC-ICE, PACE-ICE, and Testflight-CODA.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

 

[Update – October 12, 1:07PM]

We are following up with a reminder that our scheduled maintenance period begins at 6:00AM on October 14th, 2020 and concludes at 11:59PM on October 16th, 2020.  Please note our blog post: http://blog.pace.gatech.edu/?p=6905contains an updated list of tasks for this upcoming maintenance period, and for your reference the updated list is provided below:

ITEMS REQUIRING USER ACTION:

  • [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Applying a tuned profile to the Hive compute nodes
  • [Compute] Update Nvidia GPU drivers on coda to Support Cuda 11 SDK
  • [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [Network] Update Phoenix subnet managers to RHEL7.8
  • [Storage] Replace DDN 7700 storage controller 1
  • [Storage] Replace DDN SFA18KE storage enclosure 8
  • [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive cluster).
  • [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

[Original – September 30, 4:42PM]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on October 14th, 2020 and conclude at 11:59 PM on October 16th, 2020. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  Please note, during the maintenance period, users will not have access to Rich and Coda datacenter resources.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:

  •  [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Applying a tuned profile to the Hive compute nodes
  • [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [Storage] Replace DDN 7700 storage controller 1
  • [Storage] Replace DDN SFA18KE storage enclosure 8
  • [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes

Regarding the renaming of primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.