Scaling TensorFlow, PyTorch, and MXNet Using MVAPICH2 for High-Performance Deep Learning on Frontera
TimeSunday, 17 November 20194:30pm - 5pm
DescriptionFrontera is the largest NSF-funded cluster in the US and comprises of 8,008 nodes equipped with the latest Intel Xeon processors (Cascade-Lake). In this paper, we explore the potential of Frontera for training state-of-the-art Deep Learning (DL) models at scale. Most DL studies present performance data from large-scale GPU clusters that are equipped with NVIDIA GPUs. However, our earlier performance characterization studies have helped us achieve comparable performance with CPU-only clusters as well. Based on this, we configure three important DL frameworks; 1) TensorFlow, 2) PyTorch, and 3) MXNet, using Horovod and two Message Passing Interface (MPI) libraries on Frontera: 1) MVAPICH2 and 2) Intel MPI. We provide a systematic performance comparison for TensorFlow using MVAPICH2 and Intel MPI on 2,048 Frontera nodes. Using a four process per-node configuration, we observe near-linear scaling for ResNet-50 training for TensorFlow up to 8,192 MPI processes (on 2,048 nodes) offering a sustained performance of 250,000 images/second. In addition, we provide insights into process per node and batch size configurations for TensorFlow as well as for PyTorch and MXNet. Based on single-node performance behavior, we scale all three DL frameworks up to 1,024 processes (256 nodes) for various models like ResNet-50/101/152 and Inception-v3/v4.