CMU-CS-19-107
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-19-107

Distribution-based cluster scheduling

Jun Woo Park

Ph.D. Thesis

May 2019

CMU-CS-19-107.pdf


Keywords: Planning under uncertainty, Cluster Scheduling, Cloud Computing

Modern computing clusters support a mixture of diverse activities, ranging from customer-facing internet services, software development and test, scientific research, and exploratory data analytics. Many schedulers exploit knowledge of pending jobs' runtimes and resource usages as a powerful building block but suffer significant performance penalty if such knowledge is imperfect. This dissertation demonstrates that schedulers that rely on information about job runtimes and resource usages can more robustly address imperfect predictions by looking at likelihoods of possible outcomes rather than single point expected outcomes.

This dissertation presents a workload analysis and two case studies of scheduling systems: 3Sigma and DistSched. Characterization of real workloads revealed that there exists inherent variability in the job runtimes and resource usage that cannot be captured by single point estimates. An evaluation of a history-based runtime predictor with four different traces demonstrates it is not trivial to obtain perfect runtime predictions in real workloads, especially if the predictor is provided with insufficient information. 3Sigma is a scheduler that leverages distributions of the relevant runtime histories rather than just a point estimate derived from it. By leveraging distribution and mis-estimate mitigation mechanisms, 3Sigma is able to make more robust scheduling decisions and outperform state-of-the-art scheduling systems that only rely on limited or no runtime knowledge. DistSched is a scheduler that leverages distribution of the resource usage (cpu, memory, and cpu-time) and account for the risk of contention to make robust scheduling decisions. The evaluation of DistSched demonstrates that leveraging full history and mitigation mechanisms allows the scheduler to more robustly address the imperfect predictions and perform almost as good as the hypothetical system equipped with perfect knowledge of runtime and resource usage.

108 pages

Thesis Committee:
Gregory R. Ganger (Chair)
Phillip B. Gibbons
George Amvrosiadis
Michael Kozuch (Intel Labs)

Srinivasan Seshan, Head, Computer Science Department
Tom M. Mitchell, Interim Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu