Sample NSF application boilerplate describing PACE HPC

Computing Resources at the Campus Level

The Georgia Institute of Technology maintains a strategic investment in a comprehensive HPC environment called the Partnership for an Advanced Computing Environment (PACE) - a partnership between faculty and the Institute itself. Via the PACE program, the executive leadership of Georgia Tech invests in data center infrastructure, technical services, systems administration and procurement assistance. Additionally, PACE maintains an extensive HPC support infrastructure which includes high-performance scratch storage, networking, file backups and software licenses for common tools.

Faculty directed research can benefit from this investment at several levels. First of all, there is immediate access to a shared pool of existing capacity, including GPUs, so that researchers can begin working without the delays associated with acquiring equipment. Faculty are also encouraged to augment the shared pool with equipment from their own research funding, which will be prioritized for their use. Faculty contributions may also be run as exclusively dedicated resource, but which retain many of the benefit of the shared PACE infrastructure. By participating in PACE, faculty benefit from the efficient acquisition, careful deployment, proper maintenance, and thoughtful management of HPC resources, which are critical factors for successful utilization.

Furthermore, the Institute has invested in an HPC resource called the "FoRCE Research Computing Environment" - commonly known as the "FoRCE". The FoRCE began with an initial Institute investment of approximately 1,600 CPU cores, including some nVIDIA Tesla based GPU nodes. Over time it has grown to become a diverse and heterogenous resource. The FoRCE also includes a small subset of nodes which can serve as a development sandbox for use in debugging codes before execution on the full cluster. Nodes in the FoRCE conform to a baseline configuration that specifies minimum processor/memory/networking ratios that allow for some amount of predictability in a heterogenous environment. The use of the test environment is open to all PACE participants, but the use of FoRCE is determined by a faculty governance committee. Faculty can request access to the FoRCE for specific projects and courses via a lightweight proposal process.

When purchasing nodes, faculty have the option to share the unused computation time of their equipment in exchange for access to the idle time on other shared resources, including the FoRCE. The contributing faculty member, and users authorized by the same, enjoy a high scheduling priority on their contributions that far exceeds that of users from other research groups. In this manner, faculty sharing their resources have the ability to run jobs that are larger than their own investment.

Any written description of current PACE hardware is likely to be a bit out of date, however a recent snapshot of the shared nodes in the FoRCE cluster includes 2,332 cores:

  • 4 nodes, each with 2 6-core Intel Xeon E5-2630 2.3Ghz processors with 64 GB RAM
  • 4 nodes, each with 2 8-core Intel Xeon E5-2670 2.6Ghz processors with 64 GB RAM
  • 10 nodes, each with 4 12-core AMD Opteron 6172 2.1Ghz processors with 128 GB RAM
  • 1 node, with 4 16-core AMD Opteron 6274 2.2Ghz processors with 256 GB RAM
  • 9 nodes, each with 4 16-core AMD Opteron 6276 2.3Ghz processors with 256 GB RAM
  • 1 node, with 4 16-core AMD Opteron 6276 2.4Ghz processors with 128 GB RAM
  • 1 node, with 4 16-core AMD Opteron 6378 2.4Ghz processors with 256 GB RAM
  • 39 nodes, each with 4 6-core AMD Opteron 8431 2.4Ghz processors with 64 GB RAM

The development sandbox, known as the TestFlight cluster, includes 272 cores:

  • 1 node, with 4 4-core Intel Xeon E5520 2.266Ghz processors with 12 GB RAM
  • 1 node, with 2 2-core AMD Opteron 2214 2.2Ghz processors with 4 GB RAM
  • 1 node, with 4 12-core AMD Opteron 6172 2.1Ghz processors with 128 GB RAM
  • 1 node, with 1 16-core AMD Opteron 6274 2.2Ghz processors with 16 GB RAM
  • 1 node, with 4 16-core AMD Opteron 6274 2.2Ghz processors with 128 GB RAM
  • 1 node, with 1 6-core AMD Opteron 8431 2.4Ghz processors with 16 GB RAM
  • 1 node, with 1 6-core AMD Opteron 8431 2.4Ghz processors with 32 GB RAM
  • 5 node, with 4 6-core AMD Opteron 8431 2.4Ghz processors with 64 GB RAM

In addition to the shared 2,332 CPU cores in the FoRCE cluster, Faculty have purchased over 6,000 CPU cores to contribute to the shared pool. All told, PACE manages approximately 1,200 nodes comprising nearly 30,000 CPU cores, 90 terabytes of memory, 2 Petabytes of online commodity storage, and 215 terabytes of high-performance scratch storage.

There are also other computing resources available on campus beyond PACE. The Colleges of Computing, Engineering and Sciences are partnered to support the School of Computational Science and Engineering (CSE). This academic unit supports students in interdisciplinary research and education in computer science, applied mathematics, engineering simulation, scientific computing, large data, and other related fields. In particular, this means that researchers have access to a very talented pool of students who are often willing to work on computational projects.

Georgia Tech HPC Facilities

The High Performance Computing (HPC) datacenter facilities at Georgia Tech (GT) include two 5,000 square foot computer rooms in the Rich Computer Center and a 4,000 square foot computer room in our Business Continuity Data Center (BCDC). All three datacenters are owned and managed by GT and located on campus. These facilities provide general IT as well as HPC services to the entire campus.

The GT facilities are designed meet the demanding requirements of modern HPC systems maximizing their availability with power, cooling and storage redundancy measures.

Power and Cooling. The Rich Computer Center has a total of 1.2mW power capacity. The BCDC Center has 2N redundant 270kW capacity. Both centers provide a high (> 0.97) power factor.

These facilities have suffered only two unplanned outages for the past 15 years. The Rich Computer Center is backed up with five Uninterruptable Power Supplies (UPS), and the BCDC is supported by two UPS systems with 2N redundancy. One of the rooms in the Rich Center has a low-density generator with 285kW capacity that serves all of the critical storage and server units. In the HPC area, the compute nodes are not on generators, but are connected to the UPS systems that allows for a graceful shutdown. The BCDC has a generator to provide power for all of the systems in the facility. The HPC storage systems and server infrastructure are physically distributed between these two facilities to prevent data loss in a catastrophic event, each holding the backups of the data hosted by the other.

The cooling is achieved by a 3N redundant 450 ton chilled-water system in the Rich Center. The BCDC is also equipped with a N+1 redundant 200 ton chilled-water system. Both facilities feature raised floors that allow full coverage of cooling systems to all racks, and equipped with chilled water leak detection systems.

Security

The GT police department (GTPD), a division of the Georgia State Patrol, provides the general security on the campus. GTPD performs campus patrols, mobile camera monitoring and includes a SWAT response team for emergency preparedness and crime prevention. GT datacenters have badge level access and camera coverage including the building vicinity. Motion sensor alarms are configured to alert GTPD. All systems are monitored 24/7 by an operations team, located in the Rich Computer Center, which responds to emergencies and potential hazards, such as rising temperatures or chilled water leaks. Both datacenters link through GTPD to the Atlanta Fire Department, which is located approximately 2.5 miles from campus.

Network and Connectivity

GT has a unique advantage for connectivity as the founding member of Internet2 (I2) and National Lambda Rail (NLR). GT's Office of Information Technology (OIT) manages and operates the Southern Crossroads (SoX), which is the regional GigaPOP for I2, and Southern Light Rail (SLR), the regional aggregation. This strategic position of GT will allow the proposed systems be connected at multiple Gigabit per second (Gb/s) speeds to leading universities and national labs, with a 100 Gb/s link to Oak Ridge National Lab (ORNL) in particular. Planned and funded iniatives will bring the GT network link to SoX up to 100 Gb/s speeds.

The facilities provide 1Gb/s, 10 Gb/s and 40 Gb/s connections to all servers and HPC systems. The Rich Building is equipped with a QDR QLogic/Intel Infiniband Switch with uplinks that connect two computer rooms.