HPC PowerStack: Community-Driven Collaboration on Power-Aware System Stack

Authors: Martin Schulz (Leibniz Supercomputing Centre, Technical University Munich), Siddhartha Jana (Intel Corporation, Energy Efficient HPC Working Group), Stephanie Brink (Lawrence Livermore National Laboratory), Ryuichi Sakamoto (University of Tokyo)

Abstract: This interactive BoF will bring together vendors, labs, and academia to discuss an ongoing community-wide effort to incorporate power-awareness within system-stacks in upcoming exascale machines. HPC PowerStack is the first-and-only community-driven vendor-neutral effort to identify what power optimization software actors are critical within the modern-day stack; discuss their interoperability, and work towards gluing together existing open-source projects to engineer cost-effective, but cohesive, portable implementations.

This highly interactive BoF will disseminate key insights acquired in the project, provide prototyping status updates, invite attendee feedback on current directions, brainstorm solutions to open questions and solicit participation addressing the imminent power challenge.

Long Description: ## Motivation and relevance:

Community-interest in tackling Exascale power challenges is growing. While there exist several standalone efforts that attempt to tackle Exascale power challenges, the majority of the implemented techniques have been designed to meet site-specific needs or optimization goals. Specifications like Redfish provide high-level power management interfaces for accessing power knobs. However, these stop short of defining which software components should actually be involved, and how should they interoperate in a cohesive and coordinated stack. We believe coordination is critical for avoiding underutilization of system Watts and FLOPS.

This realization led to the formation of the HPC PowerStack Community, in 2016. The charter of this community includes (A) identifying the key software actors needed in a system power stack; (B) reaching a consensus on their roles and responsibilities; (C) designing communication protocols for bidirectional control and feedback signals among them for enabling scalable coordination at multiple granularities; (D) establishing a unified hierarchical communication model to access power monitor and control knobs in hardware and software; and (E) leveraging existing R&D prototypes and building a community that actively participates in development and engineering efforts in this domain.

## Pre-BoF Activities: In June 2018 and 2019, a group of 40+ senior researchers, developers, and leaders from vendors, labs, and academia around the globe have convened in Germany for a face-to-face seminar. The community (representatives of all software stack layers), arrived at a consensus that (1) job/application-awareness is going to be critical for boosting system-wide optimization. This implies the need to drive interoperation between a job-level runtime and the job scheduler; (2) hierarchical control-systems provide a good model for scalable global optimization across the system, so the power-stack should be a hierarchical system with bidirectional control and feedback signals flowing between the actors; (3) rather than providing layered access to privileged hardware knobs, today’s systems have an inefficiency in that they break this hierarchy model. And we as a community need to work towards fixing this. These were in accordance with the feedback from the SC18 PowerStack-BoF attendees.

## BoF Goals: While the seminars above made good progress towards aligning the community towards a common power stack, there are still open questions in the stack’s design. Some of them will be best answered through prototyping and experience gained from the development of current state-of-the-art products. Also, since designing an entire stack from ground-up is a gargantuan effort, it is extremely important that the entire global HPC community is made aware of, and be willing to contribute towards this effort. Hence, this BoF.

The goals are: (1) make attendees aware of the emerging community effort to design a common power stack and discuss the lessons learned during the past seminar; (2) provide updates on the current and future prototyping efforts that have begun; and (3) align efforts across the community so that the SC19 BoF attendees reach a consensus with regards to sharing R&D resources, avoid duplicating effort, agree on common interfaces, and reap the rewards together as a community.


