Supervisor: Abid M. Malik (Brookhaven National Laboratory)
Abstract: Current commercial and scientific facilities generate and maintain vast amounts of complex data. While machine learning (ML) techniques can provide crucial insight, developing these models is often impractical on a single process. Distributed learning techniques mitigate this problem; however, current models contain significant performance bottlenecks. Here, we conduct a detailed performance analysis of MPI_Learn, a widespread distributed ML framework for high-energy physics (HEP) applications, on the Summit supercomputer, by training a network to classify simulated collision events from high-energy particle detectors at the CERN Large Hadron Collider (LHC).
We conclude that these bottlenecks occur as a result of increasing communication time between the different processes, and to mitigate the bottlenecks we propose the implementation of a new distributed algorithm for stochastic gradient descent (SGD). We provide a proof of concept by demonstrating better scalability with results on 250 GPUs, and with hyperparameter optimization, show a ten-fold decrease in training time.
ACM-SRC Semi-Finalist: no
Poster Summary: PDF
Back to Poster Archive Listing