BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163556Z
LOCATION:301-302-303
DTSTART;TZID=America/Denver:20191122T084500
DTEND;TZID=America/Denver:20191122T091000
UID:submissions.supercomputing.org_SC19_sess131_ws_ftxs103@linklings.com
SUMMARY:Asynchronous Receiver-Driven Replay for Local Rollback of MPI Appl
 ications
DESCRIPTION:Workshop\n\nAsynchronous Receiver-Driven Replay for Local Roll
 back of MPI Applications\n\nLosada, Bouteiller, Bosilca\n\nWith the increa
 se in scale and architectural complexity of supercomputers, the management
  of failures has become integral to successfully executing a long-running 
 high- performance computing application. In many instances, failures have 
 a localized scope, usually impacting a subset of the resources being used,
  yet widely used failure recovery strategies (like checkpoint/restart) fai
 l to take advantage and rely on global, synchronous recovery actions. Even
  with local rollback recovery, in which only the fault impacted processes 
 are restarted from a checkpoint, the consistency of further progress in th
 e execution is achieved through the replay of communication from a message
  log. This theoretically sound approach encounters some practical limitati
 ons: the presence of collective operations forces a synchronous recovery t
 hat prevents survivor processes from continuing their execution, removing 
 any possibility for overlapping further computation with the recovery; and
  the amount of resources required at recovering peers can be un- tenable. 
 In this work, we solved both problems by implementing an asynchronous, rec
 eiver-driven replay of point-to-point and collective communications, and b
 y exploiting remote-memory access capabilities to access the message logs.
  This new protocol is evaluated in an implementation of local rollback ove
 r the User Level Failure Mitigation fault tolerant Message Passing Interfa
 ce (MPI). It reduces the recovery times of the failed processes by an aver
 age of 59%, while the time spent in the recovery by the survivor processes
  is reduced by 95% when compared to an equivalent global rollback protocol
 , thus living to the promise of a truly localized impact of recovery actio
 ns.\n\nTag: Workshop Reg Pass, Extreme Scale Computing, Fault Tolerance, R
 eliability, Resiliency\n\nRegistration Category: Workshop Reg Pass, Extrem
 e Scale Computing, Fault Tolerance, Reliability, Resiliency
URL:https://sc19.supercomputing.org/presentation/?id=ws_ftxs103&sess=sess1
 31
END:VEVENT
END:VCALENDAR

