SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Poster 133: Portable Resilience with Kokkos

Authors: Jeffery Miles (Sandia National Laboratories), Nicolas Morales (Sandia National Laboratories), Carson Mould (Sandia National Laboratories), Keita Teranishi (Sandia National Laboratories)

Abstract: The Kokkos ecosystem is a programming environment that provides performance and portability to many scientific applications that run on DOE supercomputers as well as other smaller scale systems. Leveraging software abstraction concepts within Kokkos, software resilience for end user code is made portable with abstractions and concepts while implementing the most efficient resilience algorithms internally. This addition enables an application to manage hardware failures reducing the cost of interruption without drastically increasing the software maintenance cost. Two main resilience methodologies have been added to the Kokkos ecosystem to validate the resilience abstractions: 1. Checkpointing includes an automatic mode supporting other checkpointing libraries and a manual mode which leverages the data abstraction and memory space concepts. 2. The redundant execution model anticipates failures by replicating data and execution paths. The design and implementation of these additions are illustrated, and appropriate examples are included to demonstrate the simplicity of use.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF

Back to Poster Archive Listing