SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Abstract: The massive scale of high performance computing machines necessitates the use of automatic statistical methods to assist human operators in monitoring day-to-day behavior. The largest high performance computing machines today produce on the order of terabytes of monitoring information each day. We specifically address the problem of identifying problematic compute jobs running on these massive machines by modeling the computer-generated text logs known as system logs, which record all activities on the machine in near-natural language form. We apply techniques from relational learning and human language technology, as well as making use of systems domain knowledge, in order to extract features from system logs produced by approximately 10,000 high performance computing jobs. We then evaluate the usefulness of these features by training a random forest model to predict job outcome (completion, failure, timeout, or node failure) in real time. We compare our models to a baseline which mimics state-of-the-art human operator behavior, and find that the best-performing feature set is one which combines domain knowledge with simple aggregate numerical content and temporal metrics. We find that in the average case, our method can predict job outcomes with an F1 score approaching 0.9 after a job has been running for only 30 minutes, giving an average lead time of 3 hours before failure during which a human operator could take mitigating action. The work is a proof-of-concept which suggests the ability to develop a production tool that would raise early alerts based on job outcome predictions.

Back to The 3rd Industry/University Joint International Workshop on Data-Center Automation, Analytics, and Control (DAAC) Archive Listing

Back to Full Workshop Archive Listing