SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Poster 8: Mitigating Communication Bottlenecks in MPI-Based Distributed Learning

Student: Abdullah B. Nauman (Ward Melville High School, Brookhaven National Laboratory)
Supervisor: Abid M. Malik (Brookhaven National Laboratory)

Abstract: Current commercial and scientific facilities generate and maintain vast amounts of complex data. While machine learning (ML) techniques can provide crucial insight, developing these models is often impractical on a single process. Distributed learning techniques mitigate this problem; however, current models contain significant performance bottlenecks. Here, we conduct a detailed performance analysis of MPI_Learn, a widespread distributed ML framework for high-energy physics (HEP) applications, on the Summit supercomputer, by training a network to classify simulated collision events from high-energy particle detectors at the CERN Large Hadron Collider (LHC).

We conclude that these bottlenecks occur as a result of increasing communication time between the different processes, and to mitigate the bottlenecks we propose the implementation of a new distributed algorithm for stochastic gradient descent (SGD). We provide a proof of concept by demonstrating better scalability with results on 250 GPUs, and with hyperparameter optimization, show a ten-fold decrease in training time.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF

Back to Poster Archive Listing