CMU-CS-22-154
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-22-154

Building a More Efficient Cache Hierarchy by
Taking Advantage of Related Instances of OBjects

Ziqi Wang

Ph.D. Thesis

January 2023

CMU-CS-22-154.pdf


Keywords: Computer Architecture, Cache Hierarchy, Transactional Memory, NVM, Cache Compression, Memory Management

As the capacity of the cache hierarchy keeps scaling to match the increasing number of cores and growing working set size, the design of the cache hierarchy has remained relatively static, which has become an obstacle in adapting to new hardware devices and software paradigms. Specifically, we identify two major issues with today's cache hierarchy design. First, multiversioning support is lacking, which prevents efficient implementations of newer hardware paradigms, such as Hardware Transactional Memory (HTM), and blocks efficient usage of Byte-Addressable Non-Volatile Memory (NVM). Second, the hierarchy only provides a rigid load-store interface without the capability of leveraging runtime application-level information. As a result, the hierarchy is unable to react to runtime dynamic software behavior and misses opportunities for optimization based on such information.

In this dissertation, we demonstrate that by taking advantage of multiple instances of related objects, we can address both limitations of the existing hierarchy and improve the performance and usability of the system. To validate this statement, we detail four case studies. In the first two case studies, we present OverlayTM and NVOverlay. Both designs implement a multiversioned cache hierarchy based on Page Overlays. They support a special "Overlay-on-Write" operation that resembles conventional Copy-on-Write, but creates new cache blocks, which we call "versions", directly in the private cache. OverlayTM is a Hardware Transactional Memory design that enables efficient multi-thread synchronization by allowing concurrent writers to the same address to create their private versions without interfering with each other. Furthermore, concurrent readers and writers will also execute conflict-free by directing readers to the version that constitutes a consistent read snapshot image among several possible versions in the hierarchy. The resulting design greatly reduces transaction abort rates and execution cycles compared with a single-version HTM, showing a 30%–90% reduction on both.

NVOverlay further demonstrates the benefits of multiversioning by extending the multiversioning domain to persistent data on Byte-Addressable Non-Volatile Memory (NVM). NVOverlay implements a memory snapshotting design that captures incremental memory modifications within a time interval. NVOverlay generates incremental memory snapshot data with Overlay-on-Write and gradually writes back snapshot data to the NVM for persistence. Snapshot data is managed on the NVM with a series of mapping tables and can be accessed conveniently for failure recovery or other purposes. Our evaluation shows that NVOverlay minimizes the latency overhead of snapshotting by overlapping most of the operations with execution, considerably reducing NVM write traffic with the multiversioning design compared to logging.

In the third and the fourth case studies, we present MBC and Memento. Both designs leverage application-level information on instances of related objects to optimize performance. MBC performs inter-block cache compression on the last-level cache (LLC). MBC leverages the insight that blocks of similar contents often exhibit a "stepped" (spatially strided) pattern on a page and compresses these blocks together for a higher compression ratio. To identify the per-page stepped pattern, MBC enables application software to pass a "step size" attribute via virtual memory system calls. The attribute propagates in the cache hierarchy, which the cache controller eventually leverages to group blocks based on the attribute for compression. Compared with similar works on inter-block cache compression, MBC greatly simplifies compression hardware while achieving comparable or even better compression ratio.

Memento optimizes object memory allocation in Serverless Computing by offloading high-level allocation primitives to be executed on the hardware. Memento leverages the semantics of user space and OS kernel memory management functions and implements those that can be executed in parallel with the application on hardware, thus overlapping the latency of memory management with execution. On average, Memento reduces main memory traffic by 22% and speed-up function execution by 14%.

174 pages

Thesis Committee:
Todd C. Mowry (Co-Chair) Dimitrios Skarlatos (Co-Chair) Nathan Beckmann Mike Kozuch (CMU/Intel Labs) Gennady Pekhimenko (University of Toronto)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu