CMU-S3D-23-103
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University



CMU-S3D-23-103

Automatically Annotating Decompiled Code
with Meaningful Names and Type

Jeremy Lacomis

May 2023

Ph.D. Thesis
Software Engineering

CMU-S3D-23-103.pdf


Keywords: Software reverse engineering, decompilation, decompiled code readability, decompiled code names, decompiled code types

Software reverse engineering is the problem of understanding the behavior of a program without access to its source code. Since there is no source code, analysts must use the binary directly. A primary tool used by reverse engineers is the decompiler, which attempts to reverse the process of compilation. Although decompilers generate abstractions that improve code readability, the act of compilation irreversibly destroys information contained in the source code including comments, control flow abstractions, user-defined types, and identifier names, all of which are provably impossible to reconstruct.

However, software is natural: programmers tend to write the same code to perform the same tasks. While it is technically impossible to generate the original code, it is possible to train a model to automatically generate more meaningful identifier names and types. Treating code augmentation as an instance of translation allows the application of tools and techniques originally developed for natural language translation to the problem of identifier renaming and retyping.

The goal of the work presented in this thesis is to automatically augment the output of decompilers with more meaningful names and user-defined types under the hypothesis that this will decrease the cognitive burden of reasoning about their generated code. We hope that this will have several advantages: first, we believe that this will save reverse engineers valuable time that could be spent reasoning about the higher-level functionality of the code, second, we believe it will flatten the learning curve, allowing more novices to enter the field.

My core thesis statement is: Exploiting structure inherent in code, together with its naturalness, enables the application of machine translation techniques to useful transformations of decompiled code. These techniques can be used to meaningfully rename and retype variables in decompiled code.

To support this thesis I present two automated techniques for automatically renaming and retyping decompiled code. I demonstrate how these techniques are effective at making decompiled code more approachable through metrics developed as a proxy for human understanding and through a user study designed to measure the performance of the techniques in real-world applications.

114 pages

Thesis Committee:
Claire Le Goues (Co-Chair)
Bodgan Vasilescu (Co-Chair)
Graham Neubig
Edward J. Schwartz (CMU Software Engineering Institute)

James D. Herbsleb, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu