BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163557Z
LOCATION:405-406-407
DTSTART;TZID=America/Denver:20191122T093000
DTEND;TZID=America/Denver:20191122T094500
UID:submissions.supercomputing.org_SC19_sess129_ws_hpcsysp102@linklings.co
 m
SUMMARY:Decoupling OpenHPC Critical Services
DESCRIPTION:Workshop\n\nDecoupling OpenHPC Critical Services\n\nChappell, 
 Chitre, Gazula, Pike, Griffioen\n\nHigh-Performance Computing (HPC) cluste
 r-management software often consolidates cluster-management functiona
 lity into a centralized management node, using it to provision the compute
  nodes, manage users, and schedule jobs. A consequence of this design
  is that the management node must typically be up and operating 
 correctly for the cluster to schedule and continue executing jobs. &n
 bsp;This dependency de-incentivizes administrators from upgrading the
  management node because the entire cluster may need to be taken down
  during the upgrade. Administrators may even avoid performing minor u
 pdates to the management node for fear that an update error could bri
 ng the cluster down.<br /><br />To address this problem, we redesigned the
  structure of management nodes, specifically OpenHPC’s System Manage
 ment Server (SMS), breaking it into components that allow portions of
  the SMS to be taken down and upgraded without interrupting the rest 
 of the cluster. Our approach separates the time-critical SMS tasks fr
 om tasks that can be delayed, allowing us to keep a relatively small&
 nbsp;number of time-critical tasks running while bringing down critical&nb
 sp;portions of the SMS for long periods of time to apply OpenHPC upgr
 ades, update applications, and perform acceptance tests on the new sy
 stem.<br /><br />We implemented and deployed our solution on the Universit
 y of Kentucky’s HPC cluster, and it has already helped avoid do
 wntime from an SMS failure. It also allows us to reduce, or completel
 y eliminate our regularly scheduled maintenance windows.\n\nTag: Work
 shop Reg Pass, Datacenter, SIGHPC Workshop, State of the Practice, System 
 Administration, System Maintenance, System Reliability\n\nRegistration Cat
 egory: Workshop Reg Pass, Datacenter, SIGHPC Workshop, State of the Practi
 ce, System Administration, System Maintenance, System Reliability
URL:https://sc19.supercomputing.org/presentation/?id=ws_hpcsysp102&sess=se
 ss129
END:VEVENT
END:VCALENDAR

