Computer Science Department
School of Computer Science, Carnegie Mellon University
Algorithmic Engineering Towards
Distributed key-value systems have been widely used as elemental
components of many Internet-scale services at sites such as Amazon,
Facebook and Twitter. This thesis examines a system design approach
to scale existing key-value systems, both horizontally and vertically, by
carefully engineering and integrating techniques that are grounded in re-
cent theory but also informed by underlying architectures and expected
workloads in practice. As a case study, we re-design FAWN-KV–a distributed key-value cluster consisting of "wimpy" key-value nodes—to use
less memory but achieve higher throughput even in the worst case.
First, to improve the worst-case throughput of a FAWN-KV system, we propose a randomized load balancing scheme that can fully utilize all the nodes regardless of their query distribution. We analytically prove and empirically demonstrate that deploying a very small but ex- tremely fast load balancer at FAWN-KV can effectively prevent uneven or dynamic workloads creating hotspots on individual nodes. Moreover, our analysis provides service designers a mathematically tractable ap- proach to estimate the worst-case throughput and also avoid drastic over- provisioning in similar distributed key-value systems.
Second, to implement the high-speed load balancer and also to improve the space efficiency of individual key-value nodes, we propose novel data structures and algorithms, including the cuckoo filter, a Bloom filter replacement that is high-speed, highly compact and delete-supporting, and optimistic cuckoo hashing, a fast and space-efficient hashing scheme that scales on multiple CPUs. Both algorithms are built upon conventional cuckoo hashing but are optimized for our target architectures and workloads. Using them as building blocks, we design and implement MemC3 to serve transient data from DRAM with high throughput and low-latency retrievals, and SILT to provide cost-effective access to persis- tent data on flash storage with extremely small memory footprint (e.g., 0.7 bytes per entry).