Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now · Maps

Doctoral Showcase

Posters

: Poster 43: Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters

SessionDoctoral Showcase Posters Display

Author

Ching-Hsiang Chu

Advisor

Dhabaleswar K. Panda

Event Type

Doctoral Showcase

Posters

Registration Categories

Tags

TimeThursday, 21 November 20198:30am - 5pm

LocationE Concourse

DescriptionIn the era of post Moore's law, the traditional CPU is not able to keep the pace up and provide the computing power demanded by the modern compute-intensive and highly parallelizable applications. Under this context, various accelerator architectures such as general-purpose graphics processing unit (GPU), which equipped with the high-bandwidth memory (HBM) and massive parallelizable streaming multiprocessors, has been widely adopted in high-performance computing (HPC) and cloud systems to significantly accelerate numerous scientific and emerging machine/deep learning applications. Message Passing Interface (MPI), the standard programming model for parallel applications, has been widely used for GPU communication. However, the state-of-the-art MPI libraries are only optimizing GPU communication by leveraging advanced technology like Remote Direct Memory Access (RDMA) and not fully utilizing the computational power of GPUs. In this work, we propose GPU-enabled communication schemes to harness GPU computational resources, and cutting-edge interconnects such as NVIDIA NVLink for communication operations on the emerging heterogeneous systems. In this work, three primary MPI operations are addressed. First, intelligent communication scheduling, efficient packing/unpacking, and packing-free schemes are proposed to accelerate non-contiguous data transfer in scientific HPC applications. Second, scalable broadcast operations are presented to leverage the low-level hardware multicast feature to speed up GPU communication at scale. Finally, we also design topology-aware, link-efficient, and cooperative GPU kernels to significantly accelerate All-reduce operation, which is the primary performance bottleneck in deep learning applications. The proposed designs demonstrate significant performance improvements over the state-of-the-art communication schemes for various HPC and deep learning applications.