CMU-CS-22-138 Computer Science Department School of Computer Science, Carnegie Mellon University
A Principled Approach to Parallel Job Scheduling Benjamin Berg Ph.D. Thesis August 2022
A wide range of modern computer systems process workloads composed of parallelizable jobs. Data centers, supercomputers, machine learning clusters, distributed computing frameworks, and databases all process jobs which are designed to be parallelized across multiple servers or cores. A job will receive some speedup from being parallelized across additional servers or cores, allowing the job to complete more quickly. However, jobs generally cannot be perfectly parallelized, and receive diminishing returns from being allocated additional cores. Hence, given a fixed number of cores, it is not obvious how to allocate cores across a set of jobs in order to reduce the response times of the jobs – the times from when each job arrives to the system until it is completed. While this question has been considered by the worst-case scheduling community, existing theoretical results are hampered by strong lower bounds on worst-case performance and tend to suggest policies which do not work well in practice. Meanwhile, state-of-the-art systems employ simple heuristic-based policies that leave significant room for improvement. The goal of this thesis is to develop and analyze policies for scheduling parallelizable jobs which perform well both in theory and in practice. Our approach in developing new scheduling policies for parallelizable jobs is threefold. First, we develop new stochastic models of parallelizable jobs running in a multicore system. Second, we analyze these new models using the tools of stochastic performance modeling, showing that a stochastic style of analysis emits scheduling policies which provably outperform the policies suggested by the worst-case literature. Finally, we validate our theoretical models through simulation to show that the scheduling policies we derive work well in practice. We consider the case where job sizes are completely unknown to the system, the case where job sizes are perfectly known to the system, and the case where some jobs are known to be larger than others on average. Similarly, we consider the case where all jobs are assumed to have the same level of parallelizability, the case where some jobs are more parallelizable than others, and the case where a job's parallelizability can change over time.
164 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |