Computer Science Department
School of Computer Science, Carnegie Mellon University
Scheduling Ireegular Parallel Computations on Hierarchical Caches
Guy E. Blelloch, Phillip B. Gibbons*,
Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other shared-memory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed high-level models that expose locality in a manner that is oblivious to particular cache or processor organizations, placing the burden of making effective use of a parallel machine on a runtime task scheduler rather than the algorithm designer/programmer.
This paper continues this line of work by (i) developing a new model for parallel cache cost, (ii) developing a task scheduler for irregular tasks on cache hierarchies, and (iii) proving that the scheduler assigns tasks to processors in a work-efficient manner (including cache costs) relative to the model. As with many previous models, our model allows algorithms to be analyzed using a single level of cache with parameters M (cache size) and B (cache-line size), and algorithms can be written cache obliviously (with no choices made based on machine parameters). Unlike previous models, our cost Qα(n; M, B), for problem size n, captures costs due to work-space imbalance among tasks, and we prove a lower bound that shows that some sort of penalty is needed to achieve work efficiency. Nevertheless, for many algorithms, Qα() is asymptotically equal to the standard sequential cache cost Q().
Our task scheduler is a specific "space-bounded scheduler," which assigns subtasks to caches based on their space usage. Our scheduler extends prior work by efficiently scheduling "irregular" computations with arbitrary work imbalance among parallel subtasks, reflected by the Qα cost. Moreover, in addition to proving bounds on cache complexity, we also provide a bound on the total running time of an program execution using our scheduler. Specifically, our scheduler executes a program on a homogeneous h-level parallel memory hierarchy having p processors in time O( (vh /p)Σhi=o Qα(n; Mi, B) · Ci ), where Mi is the size of the level-i cache, B is the cache-line size, Ci is the cost of level-i cache miss, and vh is an overhead defined in the paper. Ignoring the overhead vh, which may be small (or even constant) when certain side conditions hold, this bound is optimal whenever Qα matches the sequential cache complexity–at every level of the hierarchy it is dividing the total cache cost by the total number of processors.
*Intel Labs Pittsburgh