BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163556Z
LOCATION:503-504
DTSTART;TZID=America/Denver:20191122T094300
DTEND;TZID=America/Denver:20191122T100000
UID:submissions.supercomputing.org_SC19_sess130_ws_ipdrm108@linklings.com
SUMMARY:Leveraging Network-Level Parallelism with Multiple Process-Endpoin
 ts for MPI Broadcast
DESCRIPTION:Workshop\n\nLeveraging Network-Level Parallelism with Multiple
  Process-Endpoints for MPI Broadcast\n\nRuhela, Ramesh, Chakraborty, Subra
 moni, Hashmi...\n\nThe Message Passing Interface has been the dominating p
 rogramming model for developing scalable and high-performance parallel app
 lications. Collective operations empower group communication operations in
  a portable, and efficient manner and are used by a large number of applic
 ations across different domains. Optimization of collective operations is 
 the key to achieve good performance speed-ups and portability. Broadcast o
 r One-to-all communication is one of the most commonly used collectives in
  MPI applications. However, the existing algorithms for broadcast do not e
 ffectively utilize the high degree of parallelism and increased message ra
 te capabilities offered by modern architectures. In this paper, we address
  these challenges and propose a Scalable Multi-Endpoint broadcast algorith
 m that combines hierarchical communication with multiple endpoints per nod
 e for high performance and scalability. We evaluate the proposed algorithm
  against state-of-the-art designs in other MPI libraries, including MVAPIC
 H2, Intel MPI, and Spectrum MPI. We demonstrate the benefits of the propos
 ed algorithm at benchmark and application level at scale on four different
  hardware architectures, including Intel Cascade Lake, Intel Skylake, AMD 
 EPYC, and IBM POWER9, and with InfiniBand and Omni-Path interconnects. Com
 pared to other state-of-the-art designs, our proposed design shows up to 2
 .5 times performance improvements at a microbenchmark level with 128 Nodes
 . We also observe up to 37% improvement in broadcast communication latency
  for the SPECMPI scientific applications\n\nTag: Workshop Reg Pass, Compil
 er Analysis and Optimization, Middleware, Parallel Programming Languages, 
 Libraries, and Models, Runtime Systems\n\nRegistration Category: Workshop 
 Reg Pass, Compiler Analysis and Optimization, Middleware, Parallel Program
 ming Languages, Libraries, and Models, Runtime Systems
URL:https://sc19.supercomputing.org/presentation/?id=ws_ipdrm108&sess=sess
 130
END:VEVENT
END:VCALENDAR

