Partnership for an Advanced Computing Environment
In 2009, Georgia Tech created a technology model for central hosting of research cyberinfrastructure (CI) resources and export support personnel to support multiple scientific disciplines within the campus research community. This project is called the “Partnership for Advanced Computing Environment (PACE). Via the PACE program, the executive leadership of Georgia Tech invests in an extensive portfolio of research CI which includes datacenter capacity, high-performance computing and storage, as well as connections to various regional, national, and international research networks. Since its inception, PACE has grown to include approximately 1,800 compute nodes comprising over 50,000 CPU cores, and over 9 petabytes of storage including 535 terabytes of high-performance scratch storage and has supported over 320 faculty and 3,500 users. By participating in PACE, faculty benefit from the efficient acquisition, careful deployment, proper maintenance, and thoughtful management of research CI resources, which are critical factors for long term sustainability.
PACE provides a three tier storage architecture including home directory, project space, and high transfer rate scratch space across the entire condominium. On top of the storage, compute capabilities are provided as either exclusive resources for a PI or research group, or as shared with the campus community. Through innovative workflow scheduling, shared resources may be utilized by their contributors at higher priority. Additionally, the Institute has invested in a resource called the "FoRCE Research Computing Environment" - commonly known as the "FoRCE". Through a lightweight proposal-based allocation process, faculty may request access to the FoRCE for specific projects at no cost. This model has proven to be a best of breed design that has been adopted by others. Total research CI funding comes from a mix of central funding (30%) and faculty funding (70%) that has proven sustainable and is expected to continue with increased growth into the future. Due to this rapid growth, more hosting capability is being planned. (See CODA below.)
Over time, the FoRCE has grown to become a widely diverse heterogenous resource. Nodes in the FoRCE conform to a baseline configuration that specifies minimum processor/memory/networking ratios to provide for some amount of predictability. Under the auspices of a faculty governance committee, faculty may request a zero-cost allocation on the FoRCE through a lightweight proposal process.
When purchasing nodes, faculty have the option to share the unused time of their resources in exchange for access to the idle time on other shared resources, including the FoRCE. The contributing faculty member, and users authorized by the same, enjoy a high scheduling priority on their contributions that far exceeds that of users from other research groups. In this manner, faculty sharing their resources have the ability to run jobs that are larger than their own investment.
As of September, 2018 the shared nodes in the FoRCE cluster includes 2,512 cores:
- 2 nodes, each with two 16-core AMD Opteron 6274 processors with 128 GB RAM
- 7 nodes, each with two 16-core AMD Opteron 6276 processors with 128 GB RAM
- 2 nodes, each with two 16-core AMD Opteron 6378 processors with 128 GB RAM
- 7 nodes, each with two 16-core AMD Opteron 6378 processors with 256 GB RAM
- 2 nodes, each with two 16-core AMD Opteron 6274 processors with 128 GB RAM
- 19 nodes, each with two 8-core Intel E5-2670 processors with 32 GB RAM
- 27 nodes, each with two 8-core Intel E5-2670 processors with 64 GB RAM
- 8 nodes, each with two 4-core Intel E5-2623v4 processors with 128 GB RAM and one nVidia P100 GPU
- 1 node, with two 4-core Intel E5-2623v4 processors with 128 GB RAM and one nVidia V100 GPU
- 2 nodes, each with two 12-core Intel E5-2680v3 processors with 64 GB RAM
- 7 nodes, each with two 12-core Intel E5-2680v3 processors with 128 GB RAM
- 10 nodes, each with two 14-core Intel E5-2680v4 processors with 128 GB RAM
- 2 nodes, each with two 14-core Intel E5-2680v4 processors with 256 GB RAM
Adjacent to the FoRCE is a set of resources, known as the TestFlight cluster, which serves as a development sandbox for use in debugging codes before execution at larger scale. Use of the TestFlight cluster is available to all PACE participants. As of September, 2018, the TestFlight cluster, includes 96 cores:
- 1 nodes, each with two 8-core Intel E5-2670 processors with 32 GB RAM
- 4 nodes, each with two 10-core Intel E5-2650v3 processors with 64 GB RAM
Big Data, Accelerators, OSG, and LIGO
PACE supports research CI outside of the conventional HPC resources, providing a Hadoop big data platform and several nodes with a mix of nVidia Tesla accelerators. The LIGO cluster is fully integrated into the Open Science Grid (OSG). Jobs submitted to OSG’s glide-in factory via CondorHTC are received locally and are executed on the cluster.
Distributed and Federated Academic Cloud (VAPOR)
In 2014, Georgia Tech created a campus strategy in support of cloud computing resources for the campus research community. The Virtual Applications and Platforms for Operations and Research (VAPOR) facility was created as a joint project between the College of Engineering, College of Science, College of Computing, the PACE program, and the Office of Information Technology (OIT). The facility has several operating Redhat RDO and Redhat Enterprise OpenStack environments. It currently includes more than 10 compute nodes and is growing. These individual cloud environments are federated together using the Redhat Cloudforms open hybrid cloud-management framework. All of the VAPOR cloud services utilize campus identity management to facilitate federation and sharing of resources between campus units.
Georgia Tech also has implemented a distributed and federated swift storage subsystem utilizing the SwiftStack product. The 10 storage nodes are managed by College of Engineering, College of Science and the PACE program and are made available to the campus community through SwiftStack proxy and SwiftStack gateway systems. Currently, the system includes over 200TB of storage capacity.
PACE operates a ScienceDMZ, enabling Globus and GridFTP file transfers utilizing a 40 Gb/s connection to the campus network border. Via the Southern Crossroads (SOX) regional research network, GT connects to Internet2 via a 100 Gb/s link.
All existing and future Georgia Tech facilities are designed to meet the demanding requirements of modern HPC, data, and network systems maximizing their availability with power, cooling, and storage redundancy measures. We have three relivant data centers. CODA, the newest facility, is currently under construction with an estimated completion date in the first quarter of CY2019. The Rich Computing Center (Rich) and the Business Continuity Data Center (BCDC) are legacy facilities.
Georgia Tech and PACE are anchor tenants in CODA, a multi-block 21 story technology focused building offering ~650,000 sq. ft. office and 80,000 sq. ft. data center space to foster innovations in data sciences and HPC. CODA is approximately half occupied by Georgia Tech and half by industry partners where Georgia Tech’s portion include various academic schools and research centers, OIT, and a number of interdisciplinary research neighborhoods.
CODA supports the economic development of Atlanta and the State of Georgia through job creation, new tax revenues, and a technology cluster. It drives anticipatory innovations in research CI by serving a diverse research community by converging industry, research, and educational leadership in a dynamic, world-class environment. It also provides a HPC and data center space to commercial companies to become the ‘de facto center for excellence’ for HPC in Atlanta. It is expected to create a new ecosystem based around a unique facility modeling high-end computational/network/data-intensive hosting defining the future in trans-disciplinary research, eco-friendly practices, and public/private partnerships.
Power capacity allocated for Georgia Tech use in CODA is comprised of two separate operating environments—500kW for enterprise workloads and 1,500 kW for research CI. While the enterprise space is designed with a traditional 2N configuration for both power and cooling, the research space is configured with only Uninterruptible Power Supply (UPS) coverage and no generator backup power. The data center is designed for expansion up to a total of 10 mW, installed in phases on demand.
Each floor of the CODA data center is served by a dedicated chilled water loop that utilizes heat exchangers, heat recovery units, and both water and air side economization to deliver cooling at the most efficient PUE. Air is distributed within the suites either by 55 ton CRAHs for densities under 15 kw per rack, or rear door heat exchangers that tie into the chilled water loop via close coupled cooling for densities up to 45 kW per rack.
Georgia Power is delivering electrical power at 19.8 kVA via dedicated, underground, concrete encased duct banks that terminate in two, below grade transformer vaults, each containing up to five (5) x 3,000 kVA transformers. From there, power is stepped down to 4,160 V for distribution to secondary transformer vaults located proximate to each of the raised floor suites above, and then transmitted to data suite PDUs at 480V.
Dedicated security is located in the CODA building lobby and manned 24/7. Entry from outside requires a security badge or visual confirmation from the security network operation center. Once inside the lobby, all visitors must pass through a minimum of three security checkpoints before accessing computer racks. All ingress/egress points, hallways, loading areas and outside pathways are monitored by closed circuit video devices.
CODA is straddled by West Peachtree Street and Spring Street, which are two of Atlanta’s primary fiber pathways providing direct access to Level 3, Zayo, Fiberlight, AT&T and Comcast fiber.
Legacy data centers
Legacy HPC facilities at Georgia Tech include two 5,000 sq. ft. data centers in Rich, and a 4,000 sq. ft. data center in BCDC. Rich facility will be decommissioned once the ongoing migration to CODA is complete. BCDC will remain operational to provide geographic diversity.
Rich has a total of 1.2mW power capacity and BCDC has 2N redundant 270kW capacity. Both provide a high (> 0.97) power factor. Rich is backed up with five Uninterruptible Power Supplies (UPS), and BCDC is supported by two UPS systems with 2N redundancy. One of the rooms in Rich has a low-density generator with 285kW capacity that serves all of the critical storage and server units. The compute nodes are not on generators, but are connected to the UPS systems that allows for a graceful shutdown. BCDC has a generator to provide power for all of the systems in the facility. HPC storage systems are physically distributed between these two facilities to prevent data loss in a catastrophic event, each holding the backups of the data hosted by the other.
Cooling is achieved by a 3N redundant 450 ton chilled-water system in Rich. BCDC is also equipped with a N+1 redundant 200 ton chilled-water system. Both facilities feature raised floors that allow full coverage of cooling systems to all racks, and equipped with chilled water leak detection systems. All data center space is monitored by a 24x7 network operations team for cooling and power issues.
The Georgia Tech police department (GTPD), a division of the Georgia State Patrol, provides the general security on the campus. GTPD performs campus patrols, mobile camera monitoring and includes a SWAT response team for emergency preparedness and crime prevention. Georgia Tech data centers have badge level access and camera coverage including the building vicinity. Motion sensor alarms are configured to alert GTPD. All systems are monitored 24/7 by an operations team, located in Rich, which responds to emergencies and potential hazards, such as rising temperatures or chilled water leaks. Both data centers link through GTPD to the Atlanta Fire Department, which is located approximately 2.5 miles from campus.
Network and Connectivity
The Georgia Tech campus network consists of over 125,000 network ports across 200 buildings linked by 1,800 miles of fiber optic cabling. There are roughly 2,000 network switches from a diverse set of vendors. The campus provides a centrally managed WiFi service with over 3,600 access points. The campus network infrastructure is managed by the Network Engineering team, and is comprised of network engineers employed by OIT. This team is charged with providing 24x7 support of the infrastructure; responding to problems reported by the campus users, provisioning new equipment and connections, and troubleshooting service and performance issues. This team is responsible for everything “from the wall jack to the Internet,” including the campus WiFi network. The Network Engineering team works closely with Georgia Tech CyberSecurity to define and implement the policies and procedures for securing the campus IT infrastructure.
Research Network Connectivity
Georgia Tech has installed two Cisco Nexus 9500 data center switches to supplement the current campus network. These switches are interconnected with multiple 40 Gb/s links using Cisco’s VPC redundancy and are located one each in Rich and BCDC. Individual research labs are connected to this research network using Cisco Nexus 93000 series switches with multiple 40 Gb/s connections. All fiber paths between these sites utilize diverse paths providing redundancy of operation within our campus. The Cisco 9500 switches are currently connected directly to the Southern Crossroads (SoX) network with a 100 Gb/s link.
External Network Connectivity
Georgia Tech led creation of the Southern Crossroads (SoX) fiber ring that connects various universities and select research institutions in the southeast, and intends to locate network access points for that ring in the CODA. This combination of commercial and proprietary fiber gives the CODA a unique onramp onto the local and global fiber networks. SoX is a collaboration of 13 universities in the South joining forces to connect to the Internet2 network (vBNS). This collaboration, hosted at Georgia Tech and serving many of the top universities and state networks in the Southeast, has provided quality high performance network access to many scientific resources around the world. Over the past 15 years, SoX has upgraded capabilities to the current 10 Gbps network service (Internet2, NLR, Google, ESNET, and other), and has completed an upgrade to 100 Gbps connectivity to the new Internet2 AL2S service. OIT manages and operates the Southern Crossroads (SoX). This strategic position of Georgia Tech allows for high bandwidth connections to leading universities and national labs, with a 10 Gb/s link to Oak Ridge National Lab (ORNL) in particular. In addition, Georgia Tech is the regional GigaPOP for I2, and Southern Light Rail (SLR), the regional aggregation.
IPv6 has been available for services on campus since 2008, with the DNS and other necessary IPv6 infrastructure in place for this period of time. Additionally, firewalls are available for all campus IPv6 networks including a self-service capability to rapidly reconfigure access controls as needed, IPv6 networks are treated as first-order entities, on par with support for IPv4 networks.
Georgia Tech uses perfSONAR to instrument its network for latency and bandwidth. The current infrastructure is a mesh, which includes 8 nodes at 10G and 1 node at 40G across the Georgia Tech network infrastructure. These nodes are distributed at key monitoring points including the main data centers and HPC areas across the campus. Also, the Georgia Tech perfSONAR mesh participates in a disjoint mesh with perfSONAR nodes in our regional network, the Southern Crossroads (SoX).