Supervisor: Dhabaleswar Panda (Ohio State University)
Abstract: The lack of low-overhead and scalable monitoring tools have prevented a comprehensive study of efficiency and utilization of emerging NVLink-enabled GPU clusters. We address this by proposing and designing an in-depth, real-time analysis, profiling, and visualization tool for high-performance GPU-enabled clusters with NVLinks on the top of the OSU INAM. The proposed tool is capable of presenting a unified and holistic view of MPI-level and fabric level information for emerging NVLink-enabled high-performance GPU clusters. It also provides insights into the efficiency and utilization of underlying interconnects for different communication patterns. We also designed a low overhead and scalable modules to discover the fabric topology and gather fabric metrics by using different levels of threading, bulk insertions and deletions for storage, and using parallel components for fabric discovery and port metric inquiry.
ACM-SRC Semi-Finalist: no
Poster Summary: PDF
Back to Poster Archive Listing