Supervisor: Rong Ge (Clemson University)
Abstract: High-Performance Computing systems must simultaneously address both resilience and power. In heterogeneous systems, the trade-offs between resilience and energy-efficiency are more complex for applications using both CPUs and GPUs. A deep understanding of the interplay among energy efficiency, resilience, and performance is required for heterogeneous systems to address them simultaneously.
In this work, we present a new framework for resilient and energy-efficient computing in GPU-accelerated systems. This framework supports partial or full redundancy and checkpointing for resilience, and provides users with flexible hardware resource selection, adjustable precision and power management to improve performance and energy-efficiency. We further perform CUDA-aware MPI to reduce resilience overhead, mainly in message communication between GPUs. Using CG as an example, we show that our framework provides about 40% time and 45% energy savings, comparing to simple extension of RedMPI, a redundancy based resilience framework for homogeneous CPU systems.
ACM-SRC Semi-Finalist: no
Poster Summary: PDF
Back to Poster Archive Listing