BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163556Z
LOCATION:301-302-303
DTSTART;TZID=America/Denver:20191122T112000
DTEND;TZID=America/Denver:20191122T114500
UID:submissions.supercomputing.org_SC19_sess131_ws_ftxs108@linklings.com
SUMMARY:Node-Failure-Resistant Preconditioned Conjugate Gradient Method wi
 thout Replacement Nodes
DESCRIPTION:Workshop\n\nNode-Failure-Resistant Preconditioned Conjugate Gr
 adient Method without Replacement Nodes\n\nPachajoa, Pacher, Gansterer\n\n
 As HPC systems grow in scale to meet increased computational demands, the 
 incidence of faults in a given window of time is expected to grow. This is
 sue is addressed by the scientific community with research on solutions in
  every computational layer.\n\nIn this paper, we explore strategies for fa
 ult tolerance at the algorithmic level. We propose a node-failure-tolerant
  preconditioned conjugate gradient method, which is able to efficiently re
 cover from node failures without the use of extra spare nodes, i.e., witho
 ut any overhead in terms of available hardware. For purposes of load balan
 cing, we redistribute the surviving and reconstructed solver data. The obj
 ective is to reconstruct the system either as it was before the node failu
 re, or an equivalent, permuted version, and then continue the execution of
  the solver only on the surviving nodes.\n\nIn our experimental evaluation
 s, the recovery stage of the solver typically takes around 10% or less of 
 the solver runtime, including the time to retrieve the problem-defining st
 atic data from the hard disk, and, when using a suitable preconditioner, a
 n average solver runtime overhead of 3.5% over that of a resilient solver 
 that uses a replacement node. We investigate the influence of the precondi
 tioner on a trade-off between load-balancing and communication cost in the
  recovery phase. The obtained solutions are correct, and our method is thu
 s a feasible way to recover from a node failure and continue the execution
  of the solver only on the surviving nodes.\n\nTag: Workshop Reg Pass, Ext
 reme Scale Computing, Fault Tolerance, Reliability, Resiliency\n\nRegistra
 tion Category: Workshop Reg Pass, Extreme Scale Computing, Fault Tolerance
 , Reliability, Resiliency
URL:https://sc19.supercomputing.org/presentation/?id=ws_ftxs108&sess=sess1
 31
END:VEVENT
END:VCALENDAR

