CMU-CS-21-143
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-21-143

Towards Elastic and Resilient In-Network Computing

Daehyeok Kim

Ph.D. Thesis

November 2021

CMU-CS-21-143.pdf


Keywords: In-network computing, Programmable networks, Programmable networking hardware, Programmable data planes, Resource elasticity, Fault resilience

Recent advances in programmable networking hardware technology such as programmable switches and smart network interface cards create a new computing paradigm called in-network computing. This new paradigm allows functionality that has been served by servers or proprietary hardware devices, ranging from network middleboxes to components of distributed systems, to now be performed in the network. The demand for higher performance and the commercial availability of programmable hardware have driven the popularity of in-network computing.

While many recent efforts have demonstrated the performance benefit of in-network computing, we observe a significant gap between what it offers today and evolving application demands. In particular, we argue that in-network computing lacks resource elasticity and fault resiliency which are essential building blocks for practical computing platforms, limiting its potential. Elasticity can address the shortcoming that today's in-network computing only supports a simple deployment model where a single application runs on a single device equipped with fixed and limited resources. Similarly, fault resiliency is critical for managing prevalent device failures for the correctness and performance of applications, but it has gained littleattention. Although resource elasticity and fault resiliency have been extensively studied for traditional CPU server-based computing, we find that enabling them on programmable networking devices is challenging, especially due to their low-level abstractions, hardware constraints, heterogeneity, and workload characteristics.

In this thesis, we argue that by designing high-level abstractions and runtime environments that help leverage compute and memory resources available outside of one type of device, we can make in-network computing more elastic and resilient without any hardware modifications. This concept, which we call device resource augmentation, is a key enabler for resource elasticity and fault resiliency for stateful in-network applications written for programmable switches. In particular, we design three systems, named TEA, ExoPlane, and RedPlane, that use this concept to support elastic memory and elastic compute/memory, and fault resiliency, respectively. Each of these systems consists of a key abstraction, programming APIs, and a runtime environment. We demonstrate their feasibility and effectiveness with prototype implementations and evaluations using various in-network applications. Putting all the pieces together, developers can easily enable resource elasticity and fault resiliency for their applications without worrying about underlying complexities.

150 pages

Thesis Committee:
Srinivasan Seshan (Co-Chair)
Vyas Sekar (Co-Chair)
Justine Sherry
Jennifer Rexford (Princeton University)
Jitendra Padhye (Microsoft)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu