Presentation

DescriptionDistributed deep learning using a large mini-batch is a key technology to accelerate training in deep learning. However, it is difficult to achieve a high scalability and maintain validation accuracy in distributed learning on large clusters. We introduce two optimizations, reducing the computation time and overlapping the communication with the computation. By applying the techniques and using 2,048 GPUs, we achieved the world's fastest ResNet-50 training in MLPerf, which is a de facto standard DNN benchmark (as of July 2019).