Abstract: As new architecture designs continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage operations, the hardware is projected to be more vulnerable to transient faults. Even though relatively infrequent, crashes due to transient faults are incredibly disruptive, and are unpredictable necessitating frequent check-pointing, which would incurs huge overhead.
In this paper, we present CARE, a light-weight and compiler-assisted technique to continue the execution of applications upon crash-causing errors. CARE repairs corrupted states by recomputing the data for the crashed architecture states on-the-fly. We evaluated CARE with 5 scientific workloads with up to 3072 cores. During the normal execution of applications, CARE incurs near-to-zero overheads, and can recover on an average 83.5% of crash-causing errors within ten of milliseconds. Moreover, due to such an effective error-recovery mechanism, frequent check-pointing can be relaxed into a relatively infrequent one, tremendously reducing the overheads.
Back to Technical Papers Archive Listing