BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163557Z
LOCATION:405-406-407
DTSTART;TZID=America/Denver:20191122T104400
DTEND;TZID=America/Denver:20191122T105900
UID:submissions.supercomputing.org_SC19_sess129_ws_hpcsysp111@linklings.co
 m
SUMMARY:Monitoring HPC Services with CheckMK
DESCRIPTION:Workshop\n\nMonitoring HPC Services with CheckMK\n\nLeach, Cas
 s, Manzi\n\nAdministrative monitoring of a range of HPC systems can be tim
 e consuming and inefficient with many HPC systems being provided with thei
 r own integrated monitoring solutions and an expectation that system manag
 ers will monitor each system separately. In order to save staff time, effo
 rt and in order to improve the potential for rapid and effective response 
 to emerging problems where systems interact, a "single pane of glass" appr
 oach is considered optimal. HPC systems typically utilise relatively bouti
 que technology however which is commonly not monitored by existing out-of-
 the-box monitoring solutions. In this presentation we detail the applicati
 on of CheckMK, a general use monitoring system, to HPC systems using non-c
 ommodity hardware and software. We focus on the development and use of "ch
 eck" scripts, which at EPCC have enabled the System Administrators team to
  simply and reliably monitor all relevant and service-critical aspects of 
 a variety of HPC systems through a single "pane of glass" approach.\n\nTag
 : Workshop Reg Pass, Datacenter, SIGHPC Workshop, State of the Practice, S
 ystem Administration, System Maintenance, System Reliability\n\nRegistrati
 on Category: Workshop Reg Pass, Datacenter, SIGHPC Workshop, State of the 
 Practice, System Administration, System Maintenance, System Reliability
URL:https://sc19.supercomputing.org/presentation/?id=ws_hpcsysp111&sess=se
 ss129
END:VEVENT
END:VCALENDAR

