CMU-CS-13-124 Computer Science Department School of Computer Science, Carnegie Mellon University
Program-Centric Cost Models Harsha Vardhan Simhadri September 2013 Ph.D. Thesis
This thesis addresses the problem of portable analysis by presenting programcentric metrics for measuring the locality and parallelism of nested-parallel programs written for shared memory machines – metrics based solely on the program structure without reference to machine parameters such as processors, caches and connections. The metrics we present for this purpose are the parallel cache complexity (Q*) for quantifying locality, and the effective cache complexity (ˆQα) for both locality and the cost of load-balancing for a parameter α ≥ 0. These two program-centric metrics constitute the parallel cache complexity (PCC) framework. We show that the PCC framework is relevant by constructing a class of provably-good schedulers that map programs to the parallel memory hierarchy ( PMH) model, represented by a tree of caches. We prove optimal performance bounds on the communication costs of our schedulers in terms of Q* and on running times in terms of ˆQα. For many algorithmic problems, it is possible to design algorithms for which ˆQα = O(Q*) for a range of values of α > 0. The least upper bound on the values of for which this asymptotic relation holds results in a new notion of parallelizability of algorithms that subsumes previous notions such as the ratio of work to depth. For such algorithms, our schedulers are asymptotically optimal in terms of load balancing cache misses across a hierarchy with parallelism Β < α, which results in asymptotically optimal running times. We also prove bounds on the space requirements of dynamically allocated computations in terms of ˆQα. Since the PMH machine model is a good approximation for a broad class of machines, our experimental results demonstrate that program-centric metrics can capture the cost of parallel computations on shared memory machines. To show that the PCC framework is useful, we present optimal parallel algorithms for common building blocks such as Prefix Sums, Sorting and List Ranking. Our designs use the observation that algorithms that have low depth (preferably polylogarithmic depth as in NC algorithms), reasonable "regularity" and optimal sequential cache complexity on the ideal cache model are also optimal according to the effective cache complexity metric. We present results indicating that these algorithms also scale well in practice. Emphasizing the separation between algorithms and schedulers, we built a framework for an empirical study of schedulers and engineered a practical variant of our schedulers for the PMH model. A comparison of our scheduler against the widely used work-stealing scheduler on a 32 core machine with one level of shared cache reveals 25-50% improvement in cache misses on shared caches for most benchmarks. We present measurements to show the extent to which this translates into improvements in running times for memory-intensive benchmarks for different bandwidths.
166 pages | |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |