PACE Advisory Committee Meeting - September 7, 2022

PAC Meeting Summary, September, 2022

Meeting Agenda:

  1. Announcements (Pace Presentation) (7 min)
  2. Progress and FY22 usage in review (Pace Presentation) (10 min)
  3. Slurm updates and planned Phoenix migration to Slurm (Pace Presentation) (5 min)
  4. Publication of PAC report (Pace Presentation)
  5. PAC committee membership (Lew) (10 min)
  6. Nvidia COE possibility (3 min)
  7. Prioritizing topics from the PAC to address for the coming year (5 min)
  8. Open discussion (10 min)

Participants:

Lew Lefton, Srinivas Aluru, Annalisa Bracco, Steven Liang, Joseph Oefelein, David Sherrill, Annalisa Paaby, Didier Contis, Pam Buffington, Semir Sarajlic, Laura Cadonati, Fang Liu

Summary:

Pam Buffington presented PACE presentation (slides)

  1. Announcements
    • Pam introduced Didier Contis as the new Executive Director Academic Technology, Innovation, Research Computing and Fang “Cherry” Liu as the interim team lead for Research Computing Facilitation team. Recent hires into the Research Computing Facilitation team include Marian Zvada, Deepa Phanish, and Jeff Valdez. PACE is also aggressively hiring into several vacant positions including:
      • AI Solutions Architect (RSII, ASE) (new, AI4Opt)
      • Data Solutions Architect (new)
      • Research Computing Facilitator (RSII, RCF)
      • System Support Engineering Lead
      • Systems IT Architect Lead (x2) (1 new FY23)
      • Systems Support Eng Lead (CI)
      • Research Software Engineer (ASE)
  2. Progress and FY22 usage in Review
    • >15% resource utilization increase (averaging >35% in recent months)
    • 60% and 40% increases in per-group concurrent proc use and max job size
    • 250-800% reduction of average queue wait times
    • 2500% decrease in faculty resource availability time
    • FY22 refresh usage: $1,038,274.50
    • Paid usage: $55,477.05
    • Total compute pre-pay: $153,780.68
    • Free-tier usage: $29,782.17
    • Average wait time: 1.14 hours
    • CPUh consumed: 157,628,470.68 (Phoenix only!)
  3. Slurm updates and planned Phoenix migration to Slurm (Pace Presentation) (5 min)
    • Reasons to Migrate to Slurm
      1. Increased Stability, we’ve had problems with Torque and MOAB from a scheduling, billing, and outage perspective.
      2. Slurm will be more efficient as it can start jobs faster (Currently Torque takes ~10 minutes/job to start regardless of the wait time.) These add up to significant delays overall.
      3. Slurm is the industry standard, modern, and better supported than Torque.
    • PACE is proposing staggered migration for Phoenix following Hive migration
      1. Minimizes disruption risks to users
      2. Provides greatest user flexibility
      3. Allow all users to access Phoenix-Slurm beginning October 10
      4. Users will receive a free-tier account on Phoenix Slurm that’s equivalent to 10,000 CPU*hours on base hardware and usable on all architectures; additionally, users will receive access to Embers (free backfill) queue
      5. Users will be charged for their jobs in Slurm.
      6. To facilitate this, PACE will generate new accounts on Phoenix-Slurm and move 50% of the credits from existing MAM account
      7. Monthly statements will continue to provide reports
      8. Initial wait times on the Phoenix-Slurm should be low and provide the incentive for users to migrate.
      9. Extensive training will be made available.
    • 6 Phoenix migration phases proposed:
      1. October 10 – 500 nodes
      2. November 2 (PACE maintenance) – 300 nodes
      3. November 28 – 200 nodes
      4. January 3 – 100 nodes
      5. January 17 – 100 nodes
      6. January 31 (PACE maintenance) – 119 nodes
    • PAC discussed the smooth HIVE transition, which David Sherill confirmed.  Overall, PAC did not have any objections to the proposed Phoenix migration plan or timeline.

  4. Publication of PAC report (Pace Presentation)
  5. PAC committee membership (Lew) (10 min)
    • Lew leads the discussion on PAC committee membership, there are 6 people left among 17 members, Pam and Semir will replace Neil and Memo. Usually membership term is 1-2 years
    • Tony Pan and Mehmet Belgin have left GT
    • Sharing a summary of results to a survey
      • PACE-PAC-bylaws.docx
      • Pam and Semir will provide input
        • Anna Erickson
        • Vishal Acharya
        • Brad Robertson
        • Lynn Kamerlin
  6. Nvidia COE possibility (3 min)
  7. Prioritizing topics from the PAC to address for the coming year (5 min)
    • Will discuss at our next meeting on 11/2/2022.
  8. Open discussion (10 min)
    1. Annalise Paeby
      1. Request to get a list of top users per college/school

ACTION:

  • Create a PAC email list
  • Provide reports to PAC of top users by college and schools

Rate study

  • Indirect Cost Waiver Review coming up for Jim Fortner will provide.