SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

HPC System Testing: Procedures, Acceptance, Regression Testing, and Automation


Authors: Verónica Melesse Vergara (Oak Ridge National Laboratory), Bilel Hadri (King Abdullah University of Science and Technology (KAUST))

Abstract: This BoF will briefly highlight acceptance and regression testing procedures at several large-scale HPC centers. The goal of this BoF is to bring together testing efforts from multiple leadership-class supercomputer centers and ideas from the HPC community to discuss the different strategies used and to document lessons learned. BoF attendees will have the opportunity to share their experiences conducting acceptance and regression testing at their institutions and exchange best practices with other HPC centers.

Long Description: Supercomputers are becoming more complex tightly integrated systems consisting of many different hardware components, tens of thousands of processors and memory chips, kilometers of networking cables, thousands of storage disks, and hundreds of applications and libraries. In order to increase scientific productivity and ensure that applications efficiently and effectively exploit a system’s full potential, all the components must deliver reliable, stable, and performant service.

The increasing complexity of high performance computing (HPC) architectures requires a larger number of tests in order to thoroughly evaluate a new system or a new software stack before it is transitioned to production. Large-scale systems, in particular, test the boundaries of new technologies as often vendors do not have an internal system of the same scale to test on before shipping it to the customer site. For that reason, in many cases, HPC centers run hundreds of tests to verify the functionality, performance, and stability of both the hardware and the software stack.

For many years, regression testing has been an essential step of any software development or integration cycle. However, for HPC systems, regression testing is typically performed in a more ad-hoc fashion, and is focused on the basic functionality of the various hardware components before releasing the system back to the users as soon as possible after maintenance. Usually, the performance of all components is monitored and measured independently, nevertheless, it does not capture the overall behavior of the HPC system that users and their parallel applications are facing under realistic workloads.

This Birds of a Feather (BOF) session will briefly overview acceptance and regression testing procedures at several large-scale HPC centers. The goal of this BOF is to bring together testing efforts from multiple leadership-class supercomputer centers and ideas from the HPC community to discuss the different strategies used and to document lessons learned. BOF attendees will have the opportunity to share their experiences conducting acceptance and regression testing at their institutions and exchange best practices with other HPC centers.

This BOF will bring together those with experience and interest in regression testing and those who want to explore this topic more deeply. The target audience includes computational scientists, user support specialists, and system administrators along with the research community involved in benchmarking and monitoring. Technical managers may benefit from gaining insight into these activities in order to build support for them at their respective centers.

Topics covered will include, but are not limited to, testing procedures, test selection, test plan development, tools used to support execution of acceptance and regression tests, and analysis of test results.


URL: https://olcf.github.io/system-test-wg/events/sc19bof.html


Back to Birds of a Feather Archive Listing