Workshop: Estimation of RTT and Loss Rate of Wide-Area Connections Using MPI Measurements
Abstract: Scientific computations are expected to be increasingly distributed across wide-area networks, and Message Passing Interface (MPI) has been shown to scale to support their communications over long distances. The execution times of MPI basic operations over long distance connections reflect the connection length and losses, which should be accounted for by the applications, for example, by rolling back to a single site under high network loss conditions. We utilize execution time measurements of MPI_Sendrecv operations collected over emulated 10Gbps connections with 0-366ms round-trip times, wherein the longest connection spans the globe, under up to 20\% periodic losses. We describe five machine leaning methods to estimate the connection RTT and loss rate from these MPI execution times. They provide disparate, namely, linear and non-linear, and smooth and non-smooth, estimators of RTT and loss rate. Our results show that accurate estimates can be generated at low loss rates but become inaccurate at loss rates 10% and higher. Overall, these results constitute a case study of the strengths and limitations of machine learning methods in inferring network-level parameters using application-level measurements.