BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163559Z
LOCATION:507
DTSTART;TZID=America/Denver:20191117T115000
DTEND;TZID=America/Denver:20191117T121000
UID:submissions.supercomputing.org_SC19_sess108_ws_pawatm110@linklings.com
SUMMARY:Evaluation of Programming Models to Address Load Imbalance on Dist
 ributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization
DESCRIPTION:Workshop\n\nEvaluation of Programming Models to Address Load I
 mbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank 
 Factorization\n\nPei, Bosilca, Yamazaki, Ida, Dongarra\n\nTo minimize data
  movement, many parallel applications statically distribute computational 
 tasks among the processes.  However, modern simulations often encounters i
 rregular computational tasks.  As a result, load imbalance among the proce
 sses must be dealt with at the programming level.\n\nOne critical applicat
 ion for many domains is the LU factorization of a large dense matrix store
 d in the Block Low-Rank (BLR) format.  Using the low-rank format can signi
 ficantly reduce the cost of factorization in many scientific applications,
  including the boundary element analysis of electrostatic field.  However,
  the partitioning of the matrix based on underlying geometry leads to diff
 erent sizes of the matrix, thus load imbalance among the processes at each
  step of factorization.\n\nWe use BLR LU factorization as a test case to s
 tudy the programmability and performance of five different programming app
 roaches: (1) flat MPI, (2) Adaptive MPI (Charm++), (3) MPI + OpenMP, (4) p
 arameterized task graph (PTG), and (5) dynamic task discovery (DTD).  The 
 last two versions use a task-based paradigm to express the algorithm; we r
 ely on the PaRSEC runtime system to execute the tasks.  We first point out
  programming features needed to efficiently solve this category of problem
 s, hinting at possible alternatives to the MPI+X programming paradigm.  We
  then evaluate the programmability of the different approaches.  Finally, 
 we show the performance result on the Intel Haswell-based Bridges system a
 nd analyze the effectiveness of the implementations to address the load im
 balance.\n\nTag: Workshop Reg Pass, MPI, Parallel Application Frameworks, 
 Parallel Programming Languages, Libraries, and Models, Scalable Computing\
 n\nRegistration Category: Workshop Reg Pass, MPI, Parallel Application Fra
 meworks, Parallel Programming Languages, Libraries, and Models, Scalable C
 omputing
URL:https://sc19.supercomputing.org/presentation/?id=ws_pawatm110&sess=ses
 s108
END:VEVENT
END:VCALENDAR

