Authors: Michael Ott (Leibniz Supercomputing Centre), Melissa Romanus Abdelbaky (Lawrence Berkeley National Laboratory), Keiji Yamamoto (RIKEN), Luca Bortot (ENI, Italy), Michael Mason (Los Alamos National Laboratory)
Abstract: For a couple of years now, multiple supercomputing sites around the globe have development and implementation projects underway for expanded monitoring frameworks collecting operational parameters of HPC systems and facility support infrastructure into a single unified database, providing a new and more comprehensive overview of all operations. These early adopter sites have already deployed these systems into production and are collecting a valuable repository of performance data. What to do with this wealth of data, how to process and analyze it, and how to feed it back into improved operations, will be the topic of this BoF session.
Long Description: The Energy Efficient HPC Working Group (EE HPC WG) and others have long argued that comprehensive and fine-grained instrumentation and monitoring of HPC systems and their infrastructures are required to better understand, control, and optimize HPC operations. Many supercomputing sites have followed this path and deployed quite intensive instrumentation in their data centers. Leveraging scalable NoSQL databases and other big data technologies, some of them were able to develop and deploy sophisticated monitoring frameworks that can collect, store, and retrieve telemetry data from thousands of devices at high resolution. Those frameworks now allow for the collection of data at all levels of HPC operations: from site level power provisioning and cooling infrastructures down to individual compute nodes and their sub-components and in some cases even telemetry data from the HPC operating system and applications. The next big challenge is now making use of this data treasure.
There are plenty of use cases for exploiting all this monitoring data. The most obvious ones are data center infrastructure performance optimization and Fault Detection & Diagnostics (FDD). More sophisticated scenarios foresee feeding this data back to facility control systems or batch schedulers to optimize the energy performance or utilization of the infrastructure and HPC systems.
As leading edge sites have already deployed high resolution monitoring, they are now developing tools to explore and analyze the vast amounts of data they are collecting and to leverage it for their operations. Other sites are just beginning their own deployment of expanded monitoring tools and about to face the same issues and challenges the pioneering sites faced.
The EE HPC WG has created a team for Operational Data Analytics (ODA) to provide a forum for sites engaged in these activities to share and discuss their ideas, implementations, use cases, and outcomes. The ODA team is organizing this BoF session to bring together sites that have gained experience with ODA processes with those who are in the midst of development and/or planning to deploy such systems. There are plenty of lessons to be learned and ideas to be shared that could help all parties in their endeavors and potentially reduce duplicated efforts. Likewise, attendees with a background in data analytics and machine learning could provide valuable feedback and share experiences.
Back to Birds of a Feather Archive Listing