Poster 40: Performance, Portability, and Productivity for Data-Parallel Computations on Multi- and Many-Core Architectures
TimeThursday, 21 November 20191:50pm - 2:10pm
DescriptionThis thesis presents an approach to performance, portability, and productivity for data-parallel computations on multi- and many-core architectures, e.g., Intel CPU and NVIDIA GPU. We introduce the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs) – a class of functions that cover important data-parallel computations, e.g., linear algebra routines (BLAS) and stencil computations. For our MDHs, we propose a Domain-Specific Language (DSL), based on patterns of parallelism (a.k.a. algorithmic skeletons), to enable conveniently expressing MDH functions. We introduce a code generation approach for our DSL to automatically generate for MDHs optimized program code targeting multi- and many-core architectures. Our code generation approach relies on OpenCL – an emerging de-facto standard for uniformly programming parallel architectures, such as CPU and GPU. A major feature of our generated code is that it is targeted to OpenCL’s abstract device models (rather than a particular architecture) by being parameterized in performance-critical parameters of these abstract models (e.g., the number of threads and size of tiles). With our code generation approach, we enable both high performance and performance portability: we fully automatically optimize our generated code -- for any given combination of an MDH function, architecture, and input size -- by automatically choosing (auto-tuning) optimized values of our code’s performance-critical parameters using our own Auto-Tuning Framework (ATF). Our experimental results on CPU and GPU demonstrate competitive and often significantly better performance of our MDH+ATF approach as compared to the currently best-performing competitors, e.g., Intel MKL/MKL-DNN, NVIDIA cuBLAS/cuDNN, and Facebook’s Tensor Comprehensions framework.