Workshop: Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes
Abstract: As HPC systems grow in scale to meet increased computational demands, the incidence of faults in a given window of time is expected to grow. This issue is addressed by the scientific community with research on solutions in every computational layer.
In this paper, we explore strategies for fault tolerance at the algorithmic level. We propose a node-failure-tolerant preconditioned conjugate gradient method, which is able to efficiently recover from node failures without the use of extra spare nodes, i.e., without any overhead in terms of available hardware. For purposes of load balancing, we redistribute the surviving and reconstructed solver data. The objective is to reconstruct the system either as it was before the node failure, or an equivalent, permuted version, and then continue the execution of the solver only on the surviving nodes.
In our experimental evaluations, the recovery stage of the solver typically takes around 10% or less of the solver runtime, including the time to retrieve the problem-defining static data from the hard disk, and, when using a suitable preconditioner, an average solver runtime overhead of 3.5% over that of a resilient solver that uses a replacement node. We investigate the influence of the preconditioner on a trade-off between load-balancing and communication cost in the recovery phase. The obtained solutions are correct, and our method is thus a feasible way to recover from a node failure and continue the execution of the solver only on the surviving nodes.