jcoulter8 – Partnership for an Advanced Computing Environment

PACE Maintenance Period – 01/12/26 to 01/16/26

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM Monday January 12^th and is scheduled to end no later than 11:59PM on Thursday January 15th; ICE will open to Spring 2026 courses on Friday, January 16^th^. The additional day is needed to install a second cooling pump at the data center to provide redundancy for PACE clusters. PACE will release each cluster (Phoenix, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed. PACE will release each cluster (Phoenix, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

None

ITEMS NOT REQUIRING USER ACTION:

[all] DataBank will install a second cooling pump into the research hall cooling loop, providing redundancy.

[all] Apply maintenance updates to all compute nodes

[Phoenix, ICE, Firebird] Upgrade clusters to Slurm 25.05.5

[Storage] Enable Write Back on all VAST storage for performance improvements

[all] Replace some PDU and IB network switches with new equipment

[Storage] Apply maintenance upgrades to Lustre file system appliances

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

[Maintenance] Reminder – May 5th-May 9th 2025

[Update] May 9, 2025 at 5:34 pm

Dear PACE Community,

While all PACE clusters are up, have passed tests, and are accepting jobs, you may encounter errors due to the packages installed across our systems. We are aware of minor inconsistencies in the list of packages installed on compute nodes of the same type and are working on addressing this as quickly as possible.

Please let us know via email to pace-support@oit.gatech.edu if you encounter any unusual job errors.

We will continue working to resolve the situation and provide updates as we learn more.

The PACE Team

[Update] May 9, 2025 at 5:16 pm

Dear Firebird users,

The Firebird cluster is back in production and has resumed running jobs.

As previously mentioned, this cluster is now only running the RHEL9 operating system. Please reference our prior emails about SSH keys on Firebird if you experience any trouble logging in!

1 RTX6000 GPU node is currently unavailable, but all other GPU types (A100 and H200) are available – we will work to repair this node next week.

Thank you for your patience as we continue to work on the Firebird cluster.

Best,

The PACE Team

[Update] May 9, 2025 at 12:15 pm

Dear PACE users,

Maintenance on the Hive, Buzzard, ICE and Phoenix clusters is complete. These clusters are back in production, and all jobs held by the scheduler have been released.

The Firebird cluster is still under maintenance; these users will be notified separately once work is complete.

We are happy to share that all PACE clusters are now running the RHEL9 operating system and that other important security updates are complete.

The update to IDEaS storage is ongoing – the storage is currently accessible, but it is still necessary to use the `newgrp` command to set the order of your group membership just as before maintenance.

If you are building or running MPI applications on Phoenix’s H100/H200 nodes, please be aware that the MVAPICH2 and OpenMPI modules are no longer compatible with system updates to the H100/H200 nodes. We highly recommend using HPC-X for MPI, as it provides numerous benefits for MPI + GPU workloads. To use it, load the nvhpc/24.5 and hpcx/2.19-cuda modules. This will not affect the vast majority of single-node Python workflows, which typically do not use MPI.

Another goal for this maintenance period was the replacement of the problematic cooling system pump. While this system was rigorously tested and calibrated prior to installation, the DataBank datacenter staff were required to remove the new pump and replace it with the original as it did not pass inspection upon installation. We share your frustration in this matter. However, operating a safe and reliable datacenter is of the utmost priority and we will continue doing our best to keep PACE resources stable until DataBank is able to successfully replace the cooling pump. We are continuing to work with Georgia Tech leadership on long term solutions to improve the overall reliability to meet the expectations of our users.

At this time, we have extended the next maintenance period August 5-8, 2025 to allow for reinstalling a new cooling pump. We will share additional information as it becomes available.

Thank you,

The PACE Team

[Maintenance] April 28, 2025 at 9:42am

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday May 5th, 05/06/2025, and is tentatively scheduled to conclude by 11:59PM on Friday May 9th, 05/09/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[Firebird] The Firebird system will completely migrate to the RHEL9 Operating system

ITEMS NOT REQUIRING USER ACTION:

Change IDEaS storage user authentication from AD to LDAP
Run filesystem checks on all lustre filesystems.
Upgrade IDEaS storage
Upgrade Phoenix Project storage servers and controllers
Upgrade Phoenix scratch storage servers and controllers
Upgrade ICE scratch storage servers and controllers
Move ice-shared from NetApp to VAST storage
Rebuild ondemand-ice on physical hardware to handle increased usage
Move ICE pace-apps to separate storage volume
Firebird storage and scheduler improvements
Upgrade ddn insight (for monitoring storage system performance)
Databank: replace cooling pump assembly
Databank: Cooling tower cleanup

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. This particular instance is allowing for the complete replacement of a problematic cooling system pump in the datacenter.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

[Maintenance] Maintenance window EXTENDED – May 5th-9th

As part of the work needed to mitigate cooling issues in the Coda datacenter, there will be a full replacement of the cooling system water pump in the research hall of the datacenter. While we previously hoped to handle the maintenance from May 6-9th, we are now planning to start one day earlier on May ^5that 6am ET due to the volume of physical work that must be carried out in the datacenter.

Due to this being the final day for instructors to submit grades, we will ensure that the ICE system remains available to instructors. Reservations on all clusters have been set to prevent jobs from running into the maintenance window as usual.

We will follow up with a full list of the planned activities during this maintenance window in our two-week reminder.

[storage] Phoenix Project storage degraded performance

We are currently experiencing degraded performance on Phoenix Project storage. We are investigating with the vendor and will provides updates as we learn more.

Summary: Performance of Phoenix project storage is currently degraded.

Details: Two of our MDS (MetaData Servers) rebooted early Monday morning, March 31, and load averages are unusually high on one of them.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until we are able to mitigate the issue. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.”

[Advance Notice] Planned Spring-Break (March 17-21st) Downtime

[Update 3/6/25]

Summary: All PACE compute nodes will be unavailable from 4:00 PM on Friday, March 14, through Tuesday, March 18, to repair a water leak in the Coda Datacenter Research Hall cooling system. Access to login nodes and data will remain available.

Details: Due to a water leak discovered last month, a seal will be replaced at the start of Spring Break in the cooling system. A full replacement of the pump is planned for the May maintenance period, which will be extended one day and is now planned for May 6-9, 2025, once all parts are available. Non-compute nodes in the Enterprise Hall will not be impacted in the Spring Break repair.

Impact: During the outage, it will not be possible to run any compute jobs on any PACE cluster (Phoenix, Hive, ICE, Firebird, Buzzard). Login nodes and storage systems will remain available. A reservation has been placed on all schedulers to prevent any jobs from starting if their walltime request extends past 4:00 PM on March 14; these jobs will be held until maintenance is complete.

Thank you for your patience as we work to restore full functionality of the cooling system. You can read this message on our blog.

Best,

-The PACE Team

[Original Post 2/19/25]

Summary: A water leak has occurred in the CODA Datacenter Research Hall cooling system due to the failure of a pump seal – as a result, we are planning a two or three-day outage (pending confirmation from the Databank and mechanical contractor) during the week of March 17^th-21^st, which we hope will have less impact due to Spring Break. Access to login nodes and data will remain available, as these live in a different part of the datacenter. No compute services (Phoenix, Hive, ICE, Firebird, or Buzzard compute nodes) will be available. We will follow up once the exact days of the outage are finalized.

Details: A pump seal in the CODA research hall cooling system failed on Feb 16^t^h. The leak is not currently impacting operation of any PACE resources. Databank is working on a full pump replacement (“flange-to-flange”) plan to address the issue. Databank is actively sourcing the pump and associated parts and coordinating with a new mechanical contractor. We currently target the pump replacement to occur during Spring Break (March 17 – 21). However, this target date could change based on supply chain constraints. The mechanical work is estimated to take one to two days (depending on if additional damage or issues are identified during the pump replacement). Upon completion of the work, the PACE team will need one business day to conduct all necessary testing on the ~2,000 systems and release the five clusters currently hosted in the Research Hall (Phoenix, Hive, ICE/AI Makerspace, Firebird, and OSG Buzzard).

Being able to perform the work during Spring Break represents a best-case scenario. Databank is actively monitoring the leak and the overall health of the cooling system. Should the situation deteriorate quickly or a catastrophic failure occur, Databank will coordinate emergency repair work to replace the pump seal itself using available on-site spare parts. Under this scenario, a complete pump replacement would be coordinated during the planned PACE Maintenance period in May.

We are striving to keep the shutdown as short as possible. A reservation has been placed on the cluster to prevent any jobs being cancelled by the shutdown – which will cause some jobs to hold until the outage is over.

Thank you for your patience as we work to recover from this situation

Best,

-The PACE Team

[Notice] Phoenix Scheduler Account Issue

Following the Slurm upgrade during the January 2025 maintenance window, the monthly usage reset did not execute as scheduled on February 1. Consequently, reported balances were lower than expected, as the reported usage still included January utilization. Having identified the issue, the PACE team manually reset usage across all accounts at 12:00 PM on February 4. Additionally, the vendor has been notified of the bug to provide a patch before the next monthly cycle.

Any jobs that ran from the beginning of February until now will not count towards February usage, and any overages of pre-set limits will be refunded.

We sincerely apologize for the inconvenience.

Thank you and have a great day!

PACE team

[Complete] PACE Maintenance Period – January 13th-16th 2025

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday January 13th, 01/13/2025, and is tentatively scheduled to conclude by 11:59PM on Thursday January 16th, 01/16/2025. The additional day is needed to accommodate additional testing needed due to the presence of both RHEL7 and RHEL9 versions of our systems as we migrate to the new Operating System. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed. We will prioritize ICE to support Spring courses as soon as possible, and for the others, plan to focus on the largest portion of each system first (for Phoenix and Firebird where both OSs are present), to restore access to data and compute capabilities.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[Phoenix] Continue migrating nodes to the RHEL 9 operating system, which will complete post-MD – after this, Phoenix will be 75% on the RHEL9 OS.

[Hive] COMPLETE migrating nodes to the RHEL 9 operating system.

[Phoenix and Hive] Default login behavior will change so that login-phoenix and login-hive will point to RHEL 9 login nodes rather than RHEL 7 nodes, which WILL trigger SSH warnings. For more information on SSH at PACE, see our documentation.

ITEMS NOT REQUIRING USER ACTION:

[Phoenix, Hive, Firebird, ICE] Upgrade Slurm to 24.11.10

[all] DataBank will perform cooling tower cleaning requiring all machines in the research hall to be powered off

[all] Upgrade border firewall hardware

[Phoenix,ICE] Upgrade IB (InfiniBand) switch firmware

[Phoenix,Hive] Move Globus endpoints to new network to improve performance

[ICE] Enable self-service container builds

[Phoenix] Upgrade all storage servers to latest version to support performance improvements, covering scratch and project (coda1) storage

[Firebird] Upgrades to underlying storage servers to improve functionality

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

Message about Storage Performance, Reliability, and Future Plans for Phoenix

Executive summary

PACE recognizes that the increasing frequency of performance issues on the storage system is causing disruptions to your research on the Phoenix cluster. We are striving to mitigate the impact of these events while taking proactive measures to improve the reliability of our systems for the future. To this end, we are introducing new storage technology, and prioritizing migration of data from the existing project storage to the new system over the next year. We are currently working towards finalizing a seamless migration plan. Once the plan is ready, around late spring, we will follow up with detailed information regarding the timeline and any potential workflow impacts. Our goal is to minimize disruption and ensure that everyone is aligned on key milestones. There will be no changes to the unit price until the new system is fully implemented, data migration is complete, the existing system is reconfigured, and we have sufficient data usage metrics to determine any necessary price adjustments. We estimate this will be no earlier than the end of 2025. We believe this to be the fastest path towards a stable and effective storage solution that can cater to the varied storage needs of our user community. Please find more details below.

Phoenix cluster at PACE hosts two major storage systems – scratch and project. Scratch is the temporary file system for storing files used during job execution — it is cleaned up once a month by including files older than 60days for deletion. Project is a long-term file system – it has 2.3pB of data and about 1.8B files on it. Besides environmental factors such as chilled water outage, these two storage systems have been significant contributors to downtime and degraded performance of the Phoenix cluster. Based on our analysis, so far for the calendar year 2024, storage failure or severe degradation accounts for 47% of unplanned downtime. Addressing this concern has been our primary focus over the past six months. This message aims to share our progress and plans for the next 12 months in this regard.

Scratch Space

The Scratch space is supported by a DDN 400NVX2 unit, which includes a mix of flash drives (NVMe) and spinning disks. The system was offlined during August maintenance for a major software upgrade performed by the vendor to improve its stability. Furthermore, software issues/instability on this unit has required us to turn off the use of flash drives for hot pools that has impacted performance. To address this, a second software upgrade is planned for January 2025 during our scheduled maintenance period, at which point we will configure the flash space to use Progressive File Layout (PFL), keeping small files in flash and progressively increasing the number of stripes across all devices for improved performance.

Project Space

Project space is supported by a DDN18K system purchased in 2020. This unit primarily uses spinning disks and is nearing its end of life. Over the past two years, we have been experiencing an increase in issues related to software defects and disk failures. While the system has built-in redundancies to support multiple disk failures, the disk rebuild process coupled with an increasing number of disk intensive research jobs on Phoenix are negatively impacting the performance.

We have adopted a two-pronged strategy to address the project storage:

Perform software and limited hardware upgrades. During the August maintenance period, the hardware subsystem supporting the metadata functionality of the Lustre filesystem was replaced with a new dedicated unit. Due to its complexity, this operation required an additional day of work. The objectives of this upgrade were to a) improve the performance of the metadata functionality and b) provide an upgrade path for the future software releases. During the January 2025 maintenance period, the vendor will be able to perform a major software upgrade, to standardize Lustre 12.4 on all the storage appliances. This will increase the stability and performance of the project storage while simplifying its management.

Consult with other research computing sites (including the Texas Advanced Computing Center) to invest in a new storage system to replace or complement our DDN 18K unit.

Based on feedback from other research computing sites, we made an initial investment in an all-flash storage system from VAST Data. Being an all-flash storage with disaggregated architecture we expect significantly better uptime and performance on this new system. Particularly, VAST system does not require downtime to perform major code upgrades. The vendor has installed the system, and we are in the process of bringing the unit into production to host data, initially in support of improvements to the DDN18K.

Our plan over the next 12 months includes:

Migrate all data from the current DDN18K project space to the new VAST storage space to support these improvements. Due to the storage outages and performance issues affecting our user community, we want to clarify the following during the migration process over the next 12 months, or as long as necessary to complete the DDN18K improvements:

All storage credit balances, including the free tier, will be equivalent to the existing system.

The unit rate of $5.67 per TB per month for the paid tier will remain the same for both the current project space and the new VAST Storage system.

Work with our vendor to reconfigure the DDN18K appliance to efficiently use all available SSD / NVMe drive space and recreate storage pools to remove performance bottlenecks during disk failure. This will require a complete reformatting of the storage space.

Gather metrics on storage efficiencies (e.g., capacity reduction after data de-duplication and compression) and operational efficiencies so we can more accurately calculate the rate for VAST Storage.

Leverage the VAST Data analytics to help users archive older data to lower-cost storage options such as CEDAR.

Publish comparative analysis about functionality, performance, and resiliency between the different storage services provided by PACE to help users make decisions on what storage service(s) to use based on their type of research and data.

In the long term, we expect the new VAST Data storage system to be offered as a separate service with its own storage rate. This rate will likely be higher than the current $5.67 per TB per month for the DDN18K system. However, we cannot determine the final rate until we better understand how our data usage and efficiencies are managed by the new system.

Storage credits that have already been purchased, or are purchased during this transition, will retain their value on the DDN18K system (in terabyte months). Alternatively, they can be converted to storage credits on the VAST system at a ratio to be determined once the transition is complete. At that point, users will have the option to stay on VAST or move back to the reconfigured DDN18K, or choose a different, potentially cheaper storage option.

Our goal is to enhance the reliability of the storage system in PACE while introducing new technologies to meet the diverse needs and budgets of the Georgia Tech Research community. Our aim is to develop migration strategies that minimize disruptions to your workflows and offer more storage options tailored to your requirements. To this end, we are cross-evaluating bulk versus individual group migrations, and we will be engaging with different research groups as necessary. We are committed to providing regular updates during this process.

Thank you for your understanding and support during this project. If you have any questions or concerns, please feel free to reach out to us at pace-support@oit.gatech.edu.

Phoenix Project storage Slowness

WHAT’S HAPPENING?

Multiple hard disks failed in a single RAID pool making up the filesystem underlying Phoenix Project storage. As the arrays are being rebuilt to ensure continued resilience against disk failures, read/write performance on the device may be somewhat slower.

In addition to this, as part of a mitigation for a previous storage issue on 9/30, we have temporarily re-configured our storage to rely fully on spinning disk rather than caching parts of files on solid-state drives, which will cause a general decrease in access speeds until we are able to transition back to the prior configuration.

WHEN IS IT HAPPENING?
The failed drives were replaced on Oct 23rd, the pool rebuild will continue automatically. We will update when the process is complete.

WHY IS IT HAPPENING?

Hard disk failures are a regular part of life; the devices we support are capable of weathering these without data loss, however, it is necessary to re-write striped data onto replacement disks, leading to slight performance slowdown. In this case, 4/64 disks failed in one of the several pools making up the coda1 filesystem. We have configured the system to avoid writing new files to that pool in the meantime. These particular disks were in service for over 5 years before failing.

We also had to disable our use of the Lustre Progressive File Layout (PFL) option on this device, which splits files between solid-state and spinning disk to provide faster access, due to the fact that the solid state drive pool became completely full on 9/30, causing a temporary outage. We are working to migrate data from the solid-state pool to spinning disk, but this process takes time and depends on the underlying drive pools being fully rebuilt, among other things.

WHO IS AFFECTED?

Phoenix users may experience slower performance of Phoenix Project storage during the rebuild, and additionally until we are able to re-enable PFL.

WHAT DO YOU NEED TO DO?

Please bear with us and keep an eye out for updates.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Message concerning September-October 2024 Datacenter Outages

Dear PACE Community,

Due to a highly unusual series of data center related outages this Fall, we would like to share details about the sequence of events and causes which have impacted the availability of PACE resources, and the PACE team’s continued work to provide a stable research computing environment to the GT community. We fully understand the significant negative impact these outages have on the research community, including the inability to submit research papers, complete deadlines, and as well as the loss of research time. While this does not make up for the full impact of these outages, we always work to ensure that no paid accounts are charged for computational jobs that fail due to outages, and have temporarily doubled free-tier account credits for October 2024 in a small effort to alleviate the pain of lost time.

While many of the details here were communicated in the moment, a unified picture may help clear up certain misconceptions, and unfortunately prompt communication was required before a full understanding of the situation could be gained.

Background: The CODA datacenter is the sole hosting facility for PACE resources. The datacenter is owned and operated by Databank. PACE resources are spread across two datacenter areas:

The Enterprise Hall (500kW power provisioned) which has N+1 redundant cooling, networking, and power (battery-based UPS + Generator), where PACE and OIT host critical infrastructure and storage systems. This enables us to maintain access to login nodes and storage during most system and service outages impacting the datacenter.
The Research Hall (2MW), which was designed without redundant cooling, and relies on a combination of flywheel UPS (<1minute runtime) + Georgia Power Microgrid in the case of an electrical utility outage (https://research.gatech.edu/georgia-tech-celebrates-opening-new-energy-project-midtown-atlanta)This design choice allowed for significantly more research compute capacity, performance, and greatly reduced facilities and operational costs. The design and operational model included elements to minimize single points of failure and to support faster recovery times.

For the calendar year 2024, the following power and cooling datacenter outages have impacted PACE services:

9/3/2024: On August 27th, Databank identified a failed chilled water flow sensor on the High-Temp Chiller loop providing cooling to the research hall. Databank requested downtime before the next PACE maintenance period (January 2025) for emergency replacement.
9/8/2024: On September 8th, the High-Temp Chiller system providing cooling to the research hall failed due to the condenser pump variable-frequency drive (VFD) failing. Due to supply chain constraints, a unit was not available as part of the on-site inventory and different brand/model VFD had to be sourced and installed. During the repair, Databank identified that the VFD failure had damaged the condenser pump internal bearing. The condenser pump was replaced with the on-site spare.
10/1/2024: On October 1st, the data center experienced a short loss of utility power, which impacted the High-Temperature Chiller system providing cooling to the research hall. The new condenser pump variable frequency drive was unable to properly auto-reset because of a previously unknown parameter. Note: during this incident, PACE only shut off idle nodes and prevented new jobs from being launched. No running jobs were impacted.
10/2/2024: On October 2nd, at approximately 11:33am, the datacenter experienced a rapid sequence of utility power loss (8 events in less than two minutes). The Research Hall electrical load was transferred to the UPS/Flywheels for backup power. However, the load was unable to be transferred back to the microgrid as intended due to a network breaker that tripped in the electrical vault during the October 1st event. Only Georgia Power can reset this breaker. As a result, power was lost entirely to the Research Hall once the flywheels were depleted.

What are we doing to prevent these failures in the future?

OIT, in partnership with the GT Real Estate Office, has engaged Databank to review outages over the past few years. Specifically, we are:

Evaluating the 2017-2018 datacenter design requirements for the research hall, and how these requirements align with the needs for a reliable research computing infrastructure.
Reviewing, evaluating, and improving operational procedures between DataBank, Georgia Power, and Georgia Tech.
Reviewing and evaluating the list of critical spare parts maintained on-site by DataBank.
Engaging stakeholders to review reliability and resilience requirements for research computing.
Explore potential options to improve the cooling and power redundancy of the research hall.
Analyzing the feasibility and pros and cons of hosting a small portion of the PACE computational capabilities in the high-availability enterprise side or leveraging cloud resources during outages.

Long term, we plan to explore the use of additional datacenter locations to host research computing resources.

Please feel free to reach out with any questions or concerns,

Didier Contis
Executive Director of Academic Technology, Innovation, Research Computing for the Office of Information Technology

Partnership for an Advanced Computing Environment

Author: jcoulter8

PACE Maintenance Period – 01/12/26 to 01/16/26

[Maintenance] Reminder – May 5th-May 9th 2025

[Maintenance] Maintenance window EXTENDED – May 5th-9th

[storage] Phoenix Project storage degraded performance

[Advance Notice] Planned Spring-Break (March 17-21st) Downtime

[Notice] Phoenix Scheduler Account Issue

[Complete] PACE Maintenance Period – January 13th-16th 2025

Message about Storage Performance, Reliability, and Future Plans for Phoenix

Phoenix Project storage Slowness

Message concerning September-October 2024 Datacenter Outages

Georgia Institute of Technology