Workshop: Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms Over PaRSEC
Abstract: This paper introduces a generic and flexible matrix-matrix multiplication algorithm $C = A \times B$ for state-of-the-art computing platforms. Typically, these platforms are distributed-memory machines whose nodes are equipped with several accelerators (e.g., 6 GPUs per node for Summit. To the best of our knowledge, SLATE is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the $C$ matrix can entirely fit in the memory of the GPU accelerators. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependences to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written within the Parsec runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance for a large variety of situations.