CMU-CS-13-124
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-13-124

Program-Centric Cost Models
for Locality and Parallelism

Harsha Vardhan Simhadri

September 2013

Ph.D. Thesis

CMU-CS-13-124.pdf


Keywords: Parallelism, Locality, Cost Models, Metrics, Schedulers, Parallel Algorithms, Shared Memory, Cache Hierarchies, Performance of Parallel Programs, Programming Interfaces, Run-time Systems, Parallel Cache Complexity, Effective Cache Complexity, PCC Framework, Parallelizability


Good locality is critical for the scalability of parallel computations. Many cost models that quantify locality and parallelism of a computation with respect to specific machine models have been proposed. A significant drawback of these machinecentric cost models is their lack of portability. Since the design and analysis of good algorithms in most machine-centric cost models is a non-trivial task, lack of portability can lead to a significant wastage of design effort. Therefore, a machine-independent portable cost model for locality and parallelism that is relevant to a broad class of machines can be a valuable guide for the design of portable and scalable algorithms as well as for understanding the complexity of problems.

This thesis addresses the problem of portable analysis by presenting programcentric metrics for measuring the locality and parallelism of nested-parallel programs written for shared memory machines – metrics based solely on the program structure without reference to machine parameters such as processors, caches and connections. The metrics we present for this purpose are the parallel cache complexity (Q*) for quantifying locality, and the effective cache complexity (ˆQα) for both locality and the cost of load-balancing for a parameter α ≥ 0. These two program-centric metrics constitute the parallel cache complexity (PCC) framework.

We show that the PCC framework is relevant by constructing a class of provably-good schedulers that map programs to the parallel memory hierarchy ( PMH) model, represented by a tree of caches. We prove optimal performance bounds on the communication costs of our schedulers in terms of Q* and on running times in terms of ˆQα. For many algorithmic problems, it is possible to design algorithms for which ˆQα = O(Q*) for a range of values of α > 0. The least upper bound on the values of for which this asymptotic relation holds results in a new notion of parallelizability of algorithms that subsumes previous notions such as the ratio of work to depth. For such algorithms, our schedulers are asymptotically optimal in terms of load balancing cache misses across a hierarchy with parallelism Β < α, which results in asymptotically optimal running times. We also prove bounds on the space requirements of dynamically allocated computations in terms of ˆQα. Since the PMH machine model is a good approximation for a broad class of machines, our experimental results demonstrate that program-centric metrics can capture the cost of parallel computations on shared memory machines.

To show that the PCC framework is useful, we present optimal parallel algorithms for common building blocks such as Prefix Sums, Sorting and List Ranking. Our designs use the observation that algorithms that have low depth (preferably polylogarithmic depth as in NC algorithms), reasonable "regularity" and optimal sequential cache complexity on the ideal cache model are also optimal according to the effective cache complexity metric. We present results indicating that these algorithms also scale well in practice.

Emphasizing the separation between algorithms and schedulers, we built a framework for an empirical study of schedulers and engineered a practical variant of our schedulers for the PMH model. A comparison of our scheduler against the widely used work-stealing scheduler on a 32 core machine with one level of shared cache reveals 25-50% improvement in cache misses on shared caches for most benchmarks. We present measurements to show the extent to which this translates into improvements in running times for memory-intensive benchmarks for different bandwidths.

166 pages



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu