Breaking the Old Rule that HPL Goes with the Pace of the Slowest Node

SC19 Proceedings

Breaking the Old Rule that HPL Goes with the Pace of the Slowest Node

Authors: Dmitry Nemirov (Intel Corporation), Andrey Naraikin (Intel Corporation), Kazushige Goto (Intel Corporation)

Abstract: Modern HPC clusters are becoming increasingly heterogeneous both explicitly and implicitly. Top500 race becomes more challenging and heterogeneous tests is a new trend. We will present examples of Intel distribution for HPL use for various real life top500-class machines. For example, a system with two significantly different types of compute nodes with each subset using different interconnect. We will show simpler cases with only compute nodes having configuration differences. In all of these cases, it is possible to achieve combined HPL score. This abstract has a showcase and interactive parts and will be interesting to all who run HPL frequently.

Long Description: Modern HPC clusters are becoming increasingly heterogeneous both explicitly and implicitly. First, heterogeneity comes from different HW/SW configurations of nodes and even fabric. This comes, for example, from phased system delivery approach or desire to evaluate or use in the production different configurations for varying loads. Second, heterogeneity increases even on systems, which on the first look are completely homogeneous. For example, sustained core frequency on the same SKU Intel Xeon nodes may vary up to 20% under heavy AVX* load. Regardless of the heterogeneity nature there are scenarios when application performance needs to be maximized across different parts of the machine working together. For high end machines one of popular examples would be running High Performance Linpack (e.g. to get to top500 list or take higher position here). Intel distribution for HPL benchmark has heterogeneous support. We will present examples of its use for various real life top500-class machines. For example, a system with two significantly different types of compute nodes with each subset using different communication fabrics. We will also show simpler cases with only compute nodes having configuration differences. In all of these cases it is possible to achieve combined HPL score close to the sum of individual homogeneous components scores. For “homogeneous” systems with high per node performance variability we present cases, for which overall cluster result was improved by up to 11% with heterogeneous HPL breaking the old rule stating that Linpack goes with the pace of the slowest node. Finally, we will share some thoughts on how similar approach may be used to improve performance of real life applications on such systems.

Back to Birds of a Feather Archive Listing