BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163557Z
LOCATION:702
DTSTART;TZID=America/Denver:20191118T143000
DTEND;TZID=America/Denver:20191118T150000
UID:submissions.supercomputing.org_SC19_sess127_ws_waccpd102@linklings.com
SUMMARY:Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Dir
 ective-Based Offloading with Math Libraries
DESCRIPTION:Workshop\n\nPerformance of the RI-MP2 Fortran Kernel of GAMESS
  on GPUs via Directive-Based Offloading with Math Libraries\n\nKwack, Bert
 oni, Pham, Larkin\n\nThe US Department of Energy (DOE) started operating t
 wo GPU-based pre-exascale supercomputers in 2018 and plans to deploy anoth
 er pre-exascale in 2020, and three exascale supercomputers in 2021/2022. A
 ll of the systems are GPU-enabled systems, and they plan to provide optimi
 zed vendor-promoted programming models for their GPUs such as CUDA, HIP an
 d SYCL. However, due to their limited functional portability, it is challe
 nging for HPC application developers to maintain their applications in an 
 efficient and effective way with good productivity across all US DOE pre-e
 xascale/exascale systems. Directive-based programming models for accelerat
 ors can be one of the solutions for HPC applications on the DOE supercompu
 ters. In this study, we employ OpenMP and OpenACC offloading models to por
 t and re-implement the RI-MP2 Fortran kernel of the GAMESS application on 
 a pre-exascale GPU system, Summit. We compare and evaluate the performance
  of the re-structured offloading kernels with the original OpenMP threadin
 g kernel. We also evaluate the performance of multiple math libraries on t
 he Nvidia V100 GPU in the RI-MP2 kernel. Using the optimized directive-bas
 ed offloading implementations, the RI-MP2 kernel on a single V100 GPU beco
 mes more than 7 times faster than on dual-socket Power9 processors, which 
 is near the theoretical speed-up based on peak performance ratios. MPI + d
 irective-based offloading implementations of the RI-MP2 kernel perform mor
 e than 40 times faster than a MPI + OpenMP threading implementation on the
  same number of Summit nodes. This study demonstrates how directive-based 
 offloading implementations can perform near what we expect based on machin
 e peak ratios.\n\nTag: Workshop Reg Pass, Accelerators, Parallel Applicati
 on Frameworks, Parallel Programming Languages, Libraries, and Models, Scie
 ntific Computing, Software Engineering\n\nRegistration Category: Workshop 
 Reg Pass, Accelerators, Parallel Application Frameworks, Parallel Programm
 ing Languages, Libraries, and Models, Scientific Computing, Software Engin
 eering
URL:https://sc19.supercomputing.org/presentation/?id=ws_waccpd102&sess=ses
 s127
END:VEVENT
END:VCALENDAR