Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

Authors: David Jauk (Technical University Munich), Dai Yang (Technical University Munich), Martin Schulz (Technical University Munich)

Abstract: As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research therefore aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional and enables a reasoning about its global state.

This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand impact of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the approaches, and show how this can help to understand the state-of-the-practice of this field and identify opportunities, gaps as well as future work.

