Poster 36: Modeling Non-Determinism in HPC Applications
TimeTuesday, 19 November 20198:30am - 5pm
DescriptionAs HPC applications migrate from the petascale systems of today to the exascale systems of tomorrow, the increasing need to embrace asynchronous, irregular, and dynamic communication patterns will lead to a corresponding decrease in application-level determinism. Two critical challenges emerge from this trend. First, unchecked non-determinism coupled with the non-associativity of floating-point arithmetic undermines numerical reproducibility of scientific applications. Second, the prevalence of non-determinism amplifies the cost of debugging, both in terms of computing resources and human effort. In this thesis, we present a modeling methodology to quantify and characterize communication non-determinism in parallel applications. Our methodology consists of three core components. First, we build graph-structured models of relevant communication events from execution traces. Second, we apply similarity metrics based on graph kernels to quantify run-to-run variability and thus identify the regions of executions where non-determinism manifests most prominently. Third, we leverage our notion of execution similarity to characterize applications via clustering, anomaly detection, and extraction of representative patterns of non-deterministic communication which we dub "non-determinism motifs". Our work will amplify the effectiveness of software tools that target mitigation or control of application-level non-determinism (e.g., record-and-replay tools) by providing them with a common metric for quantifying communication non-determinism in parallel applications and a common language for describing it.