Computer Science Department
School of Computer Science, Carnegie Mellon University
Space-efficient Scheduling for Parallel, Multithreaded Computations
This dissertation presents two asynchronous scheduling algorithms that provide worst-case upper bounds on the space and time requirements of high-level, nested-parallel programs on shared memory machines. In particular, for a program with D depth and a serial space requirement of S1, both algorithms guarantee a space bound of S1 + O(K . p . D) on p processors. Here, K is a user-controllable runtime parameter, which specifies the amount of memory a thread may allocate before being preempted by the scheduler. Typically, in practice, K is fixed to be a few thousand bytes. Most parallel programs have a small depth D. For such programs, the above space bound is lower than the space bound provided by any previously implemented system.
The first of the two scheduling algorithms presented in this dissertation, algorithm AsyncDF, prioritizes threads at runtime by their serial, depth-first execution order, and preempts threads before they allocate too much memory. This ensures that the parallel execution does not require too much more memory than the serial execution. The second scheduling algorithm, DFDeques, enhances algorithm AsyncDF by adding ideas from work stealing. It replaces the single, flat priority queue of AsyncDF with ordered, per-processor queues of ready threads, and allows some deviation from the depth-first priorities while scheduling threads. Consequently, it results in lower scheduling overheads and better locality than AsyncDF for finer-grained threads, at the cost of a slight increase in memory requirement. To ensure scalability with the number of processors, I also describe and analyze fully parallelized versions of the schedulers for both the algorithms.
To verify that the theoretically-efficient scheduling algorithms are also practical, I have implemented them in the context of two multithreaded runtime systems, including a commercial Pthreads package. Parallel benchmarks used to evaluate the schedulers include numerical codes, physical simulations and a data classifier. Experimental results indicate that my algorithms effectively reduce memory usage compared to previous techniques, without compromising time performance. In particular, my schedulers allow simple, high-level, fine-grained benchmarks to run as efficiently as their more complex, hand-partitioned, coarse-grained counterparts. As expected, DFDeques achieves better speedups than AsyncDF for finer-grained threads. It requires more memory than AsyncDF, but less memory compared to previous schedulers. The scheduling algorithms were extended to support the full Pthreads interface, making them available to a large class of applications with irregular and dynamic parallelism. Both the AsyncDF and DFDeques algorithms provide a user-adjustable trade-off between running time and memory requirement, which I analyze and experimentally demonstrate.