CMU-CS-16-106
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-16-106

Simple DRAM and Virtual Memory Abstractions
to Enable Highly EXcient Memory Subsystems

Vivek Seshadri

April 2016

Ph.D. Thesis

CMU-CS-16-106.pdf


Keywords: Efficiency, Memory Subsystem, Virtual Memory, DRAM, Data Movement, Processing in Memory, Fine-grained Memory Management

In most modern systems, the memory subsystem is managed and accessed at multiple different granularities at various resources. The software stack typically accesses data at a word granularity (typically 4 or 8 bytes). The on-chip caches store data at a cache line granularity (typically 64 bytes). The commodity off-chip memory interface is optimized to fetch data from main memory at a cache line granularity. The main memory capacity itself is managed at a page granularity using virtual memory (typically 4KB pages with support for larger super pages). The off-chip commodity DRAM architecture internally operates at a row granularity (typically 8KB). In this thesis, we observe that this curse of multiple granularities results in signifficant inefficiency in the memory subsystem.

We identify three specific problems. First, page-granularity virtual memory unnecessarily triggers large memory operations. For instance, with the widely-used copy-on-write technique, even a single byte update to a virtual page results in a full 4KB copy operation. Second, with existing off-chip memory interfaces, to perform any operation, the processor must first read the source data into the on-chip caches and write the result back to main memory. For bulk data operations, this model results in a large amount of data transfer back and forth on the main memory channel. Existing systems are particularly inefficient for bulk operations that do not require any computation (e.g., data copy or initialization). Third, for operations that do not exhibit good spatial locality, e.g., non-unit strided access patterns, existing cache-line-optimized memory subsystems unnecessarily fetch values that are not required by the application over the memory channel and store them in the on-chip cache. All these problems result in high latency, and high (and often unnecessary) memory bandwidth and energy consumption.

To address these problems, we present a series of techniques in this thesis. First, to address the inefficiency of existing page-granularity virtual memory systems, we propose a new framework called page overlays. At a high level, our framework augments the existing virtual memory framework with the ability to track a new version of a subset of cache lines within each virtual page. We show that this simple extension is very powerful by demonstrating its benefits on a number of different applications.

Second, we show that the analog operation of DRAM can perform more complex operations than just store data. When combined with the row granularity operation of commodity DRAM, we can perform these complex operations efficiently in bulk. Specifically, we propose RowClone, a mechanism to perform bulk data copy and initialization operations completely inside DRAM, and Buddy RAM, a mechanism to perform bulk bitwise logical operations using DRAM. Both these techniques achieve an order-of-magnitude improvement in performance and energy-efficiency of the respective operations.

Third, to improve the performance of non-unit strided access patterns, we propose Gather-Scatter DRAM (GS-DRAM), a technique that exploits the module organization of commodity DRAM to effectively gather or scatter values with any power-of-2 strided access pattern. For these access patterns, GS-DRAM achieves near-ideal bandwidth and cache utilization, without increasing the latency of fetching data from memory. Finally, to improve the performance of the protocol to maintain the coherence of dirty cache blocks, we propose the Dirty-Block Index (DBI), a new way of tracking dirty blocks in the on-chip caches. In addition to improving the efficiency of bulk data coherence, DBI has several applications, including high-performance memory scheduling, efficient cache lookup bypassing, and enabling heterogeneous ECC for on-chip caches.

141 pages

Thesis Committee:
Todd C. Mowry (Co-Chair)
Onur Mutlu (Co-Chair)
David Andersen
Phillip B. Gibbons
Rajeev Balusubramonian (University of Utah)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu