Abstract: Efficient manipulation of sparse matrices is critical to a wide range of HPC applications. We study one common operation, Sparse Matrix Multi-Vector Multiplication (SpMM), and evaluate the impact of the sparsity, distribution of non-zero elements, and tile-traversal strategies on GPU implementations. Using these insights, we determine that operating on these sparse matrices in tiled-DCSR is well-suited to the parallel warp-synchronous execution model of GPU.
Preprocessing or storing the sparse matrix in the tiled-DCSR format, however, often requires significantly more memory storage than conventional CSR or CSC formats. Given that SpMM kernels are often bottlenecked on DRAM bandwidth, the increase in DRAM traffic can result in a slowdown for many matrices.
This work enhances a GPU's last-level cache/memory controller unit to act as a dynamic translator between the compute-optimized representation of data (tiled-DCSR) and its corresponding storage/bandwidth-optimized format (CSC). Our approach achieves 2.26x better performance on average compared to cuSPARSE.
Back to Technical Papers Archive Listing