Computer Science Department
School of Computer Science, Carnegie Mellon University
Chip Multiprocessors for Server Workloads
We stand on the cusp of the giga-scale era of chip integration. Technological advancements in semiconductor fabrication yield ever-smaller and faster devices, enabling billion-transistor chips with multi-gigahertz clock frequencies. To utilize the abundant transistors on chip, modern processors pack an exponentially increasing number of cores on chip, multi-megabyte caches, and large interconnects to facilitate intra-chip data transfers. However, the growing on-chip resources do not directly translate into a commensurate increase in performance. Rather, they come at the cost of increased on-chip data access latency, while thermal considerations and pin constraints limit the parallelism that a multicore chip can support.
To mitigate the increasing on-chip data access latency, cache blocks on chip should be placed close to the cores that use them. We observe that cache access patterns can be classified at run time into distinct classes with different on-chip block placement requirements. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each access to place blocks close to the requesting cores. We then explore the design space of physically-constrained multicore processors, and find that future multicores should utilize low-operational-power transistors even for time-critical components (e.g., cores) to ease the power wall, employ novel on-chip block placement techniques to utilize efficiently large caches, while techniques like 3D-stacked memory can mitigate the off-chip bandwidth constraint even for peak-performance designs. Moving forward, we find that heterogeneous multicores hold great promise in improving designs even further.