BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163557Z
LOCATION:702
DTSTART;TZID=America/Denver:20191118T120000
DTEND;TZID=America/Denver:20191118T123000
UID:submissions.supercomputing.org_SC19_sess127_ws_waccpd107@linklings.com
SUMMARY:Evaluation of Directive-Based GPU Programming Models on a Block Ei
 gensolver with Consideration of Large Sparse Matrices
DESCRIPTION:Workshop\n\nEvaluation of Directive-Based GPU Programming Mode
 ls on a Block Eigensolver with Consideration of Large Sparse Matrices\n\nR
 abbi, Daley, Aktulga, Wright\n\nAchieving high performance and performance
  portability for large-scale scientific applications is a major challenge 
 on heterogeneous computing systems such as many-core CPUs and accelerators
  like GPUs. In this work, we implement a widely used block eigensolver, Lo
 cally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two 
 popular directive based programming models (OpenMP and OpenACC) for GPU-ac
 celerated systems. Our work differs from existing work in that it adopts a
  holistic approach that optimizes the full solver performance rather than 
 narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GP
 U implementation achieves a 2.8x - 4.3x speedup over an optimized CPU impl
 ementation when tested with four different input matrices. The evaluated c
 onfiguration compared one Skylake CPU to one Skylake CPU and one NVIDIA V1
 00 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly iden
 tical performance. We also consider how to create an efficient LOBPCG solv
 er that can solve problems larger than GPU memory capacity. To this end, w
 e create microbenchmarks representing the two dominant kernels (inner prod
 uct and SpMM kernel) in LOBPCG and then evaluate performance when using tw
 o different programming approaches: tiling the kernels, and using Unified 
 Memory with the original kernels. Our tiled SpMM implementation achieves a
  2.9x and 48.2x speedup over the Unified Memory implementation on supercom
 puters with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectivel
 y.\n\nTag: Workshop Reg Pass, Accelerators, Parallel Application Framework
 s, Parallel Programming Languages, Libraries, and Models, Scientific Compu
 ting, Software Engineering\n\nRegistration Category: Workshop Reg Pass, Ac
 celerators, Parallel Application Frameworks, Parallel Programming Language
 s, Libraries, and Models, Scientific Computing, Software Engineering
URL:https://sc19.supercomputing.org/presentation/?id=ws_waccpd107&sess=ses
 s127
END:VEVENT
END:VCALENDAR