SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Spread-n-Share: Improving Application Performance and Cluster Throughput with Resource-Aware Job Placement

Authors: Xiongchao Tang (Tsinghua University, China; Sangfor Technologies Inc.), Haojie Wang (Tsinghua University, China), Xiaosong Ma (Qatar Computing Research Institute), Nosayba El-Sayed (Emory University), Jidong Zhai (Tsinghua University, China), Wenguang Chen (Tsinghua University, China), Ashraf Aboulnaga (Qatar Computing Research Institute)

Abstract: Traditional batch job schedulers adopt the Compact-and-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often leads to self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the unbalanced use of memory bandwidth and the shared last-level cache is still under-investigated.

In this work, we propose Spread-n-Share (SNS), a batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locates jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering memory bandwidth and LLC capacity as two types of performance-critical shared resources. Experimental results show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.

Presentation: file

Back to Technical Papers Archive Listing