CMU-S3D-24-104
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University



CMU-S3D-24-104

Resilient Microservice Applications, by
Design, and without the Chaos

Christopher S. Meiklejohn

May 2024

Ph.D. Thesis
Software Engineering

CMU-S3D-24-104.pdf


Keywords: Fault prevention, fault injection, fault tolerance, microservice archi- tectures, microservices, testing, circuit breakers, remote procedure call, RPC

Fault injection testing is vital for assessing the resilience of distributed microservice applications against infrastructure and downstream service failures. Typically performed in production, where customers may be adversely affected by this testing, it often fails to identify application bugs, particularly infrequent ones or those which only affect a subset of customers. While academics recognize the problem of resilience bug detection, in development, and prior to deployment of application code to production, their research has been limited by access to industrial applications, which has resulted in solutions that may or may not be fully aligned with the industry's needs.

This dissertation demonstrates that these types of resilience bugs can be identified during development, and before deployment of application code to production, through the use of a developer-centric fault injection technique and a principled approach to microservice application testing. It then demonstrates that it can be done in a manner that does align with industrial practitioner's needs by co-evolving this fault injection technique and principled approach with an industrial partner, one of the largest food delivery services in the United States, which results in the discovery of deep, previously undiscovered, resilience bugs in their application.

This dissertation begins by first constructing a microservice application corpus and introducing a novel tracing technique that captures all inter-service communication in a microservice application. Combined with the corpus, this tracing technique enables the development of an exhaustive fault injection testing technique designed specifically for microservice environments. This technique is then refined by implementing a novel test case reduction strategy to minimize the exploration of redundant fault injection scenarios, thereby increasing the performance and usability of the technique. The practicality of these techniques is then validated using a case study taken from an industrial microservice application. While this case study confirms the fault injection technique's effectiveness, it both highlights deficiencies in the application of the technique and identifies emergent behavior that is inherent to industrial microservice applications and their piecemeal approach to application resilience. These observations inform the design of a new principled approach for testing microservice applications for resilience, which extends the fault injection technique's usability by ensuring that developers write tests for their applications that are sufficient for bug identification.

With this principled approach, it is shown that deep, previously undiscovered, resilience bugs can be identified in large-scale, industrial microservice applications, in development, and before code ships to production.

213 pages

Thesis Committee:
Heather Miller (Chair)
Claire Le Goues
Rohan Padhye
Peter Alvaro (University of California, Santa Cruz)

James D. Herbsleb, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu