BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163556Z
LOCATION:301-302-303
DTSTART;TZID=America/Denver:20191122T093500
DTEND;TZID=America/Denver:20191122T100000
UID:submissions.supercomputing.org_SC19_sess131_ws_ftxs109@linklings.com
SUMMARY:FaultSight: A Fault Analysis Tool for HPC Researchers
DESCRIPTION:Workshop\n\nFaultSight: A Fault Analysis Tool for HPC Research
 ers\n\nHorn, Fulp, Calhoun, Olson\n\nSystem reliability is expected to be 
 a significant challenge for future extreme-scale systems. Poor reliability
  results in a higher frequency of interruptions in high-performance comput
 er (HPC) applications due to system/application crashes or data corruption
  due to soft errors. In response, application level error detection and re
 covery schemes are devised to mitigate the impact of these interruptions. 
 Evaluating these schemes and the reliability of an application re- quires 
 the analysis of thousands of fault injection trials, resulting in tedious 
 and time-consuming process. Furthermore, there is no one data analysis too
 l that can work with all of the fault injection frameworks currently in us
 e. In this paper, we present FaultSight, a fault injection analysis tool c
 apable of efficiently assisting in the analysis of HPC application reliabi
 lity as well as the effectiveness of resiliency schemes. FaultSight is des
 igned to be flexible and work with data coming from a variety of fault inj
 ection frameworks. The effectiveness of FaultSight is demonstrated by expl
 oring the reliabil- ity of different versions of the Matrix-Matrix Multipl
 ication kernel using two different fault injection tools. In addition, the
  detection and recovery schemes are highlighted for the HPCCG mini-app.\n\
 nTag: Workshop Reg Pass, Extreme Scale Computing, Fault Tolerance, Reliabi
 lity, Resiliency\n\nRegistration Category: Workshop Reg Pass, Extreme Scal
 e Computing, Fault Tolerance, Reliability, Resiliency
URL:https://sc19.supercomputing.org/presentation/?id=ws_ftxs109&sess=sess1
 31
END:VEVENT
END:VCALENDAR

