SOFTWARE AND SOCIETAL SYSTEMS DEPARTMENT TECHNICAL REPORT ABSTRACTS

CMU-S3D-25-113
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University

CMU-S3D-25-113

Automated API Refactoring for Evolving Codebases

Daniel Rosa Ramos

August 2025

Ph.D. Thesis
Software Engineering

CMU-S3D-25-113.pdf

Keywords: API refactoring, code migration, large language models, program synthesis

Modern software development depends heavily on third-party libraries and frameworks, which expose their functionality through APIs and bring substantial productivity gains. However, as libraries evolve to meet new technical or market demands, clients must often adapt their code to accommodate breaking changes or even newer libraries. This form of software maintenance, known as API refactoring, is a time-consuming and error-prone task, which has led to significant interest in automating it. A common approach to automating API refactoring is to mine historical data from client repositories to extract match-replace rules. However, these approaches are limited by the availability of high-quality examples: many clients do not refactor in public, and those that do leave insufficient traces to learn from.

This thesis presents a set of alternative methods for learning API migration rules without requiring large-scale mining of client code. Instead, we explore three complementary sources of information: documentation, the API development process, and natural language. First, we use API documentation to infer mappings between old and new APIs, which guide the synthesis of migration scripts. Second, we extract migration knowledge from the evolution of the library itself, especially from pull requests that introduce breaking changes and update internal tests. Finally, we show that large language models trained on natural language artifacts can be used to generate migration examples, which are then validated and generalized into reusable scripts. We operationalize these ideas in four refactoring tools, each targeting a different aspect of the problem. These tools combine program synthesis with machine learning to synthesize and apply migrations automatically. We evaluated our techniques in real-world Python libraries and synthetic benchmarks, showing that it is possible to automate migration effectively using only indirect sources of information, without requiring curated datasets or repository mining.

135 pages

Thesis Committee:
Claire Le Goues (Chair)
Ruben Martins
Joshua Sunshine
Nuno Lopes (Instituto Superior Técnico)
Vasco Manquinho (Instituto Superior Técnico)
Işil Dillig (University of Texas at Austin)

Nicolas Christin, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science

Creative Commons License: CC-BY-NC (Attribution-Non-Commerical)

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu