Workshop: Scalable Machine Learning with OpenSHMEM
Abstract: Deep convolutional neural networks (DNNs) have had a significant, and lasting impact across the computing industry. Training these large neural networks is computationally intensive and is often parallelized to shorten training times that could otherwise range from days to weeks. The Message Passing Interface (MPI) communication model has been commonly used to facilitate the data exchange and synchronization required for parallel DNN training. We observe that OpenSHMEM supports many of the same communication operations as MPI — in particular, the all-reduce operation needed to support data parallelism — and that OpenSHMEM may further provide a unique solution to fine-grain model parallel computation. In this work, we present an initial evaluation of OpenSHMEM’s suitability for use in DNN training and compare its performance with MPI. Results indicate that OpenSHMEM data-parallel performance is comparable with MPI. The usage of OpenSHMEM to support model parallelism will be explored in our future work.