CMU-S3D-23-103 Software and Societal Systems Department School of Computer Science, Carnegie Mellon University
Automatically Annotating Decompiled Code Jeremy Lacomis May 2023
Ph.D. Thesis
Software reverse engineering is the problem of understanding the behavior of a program without access to its source code. Since there is no source code, analysts must use the binary directly. A primary tool used by reverse engineers is the decompiler, which attempts to reverse the process of compilation. Although decompilers generate abstractions that improve code readability, the act of compilation irreversibly destroys information contained in the source code including comments, control flow abstractions, user-defined types, and identifier names, all of which are provably impossible to reconstruct. However, software is natural: programmers tend to write the same code to perform the same tasks. While it is technically impossible to generate the original code, it is possible to train a model to automatically generate more meaningful identifier names and types. Treating code augmentation as an instance of translation allows the application of tools and techniques originally developed for natural language translation to the problem of identifier renaming and retyping. The goal of the work presented in this thesis is to automatically augment the output of decompilers with more meaningful names and user-defined types under the hypothesis that this will decrease the cognitive burden of reasoning about their generated code. We hope that this will have several advantages: first, we believe that this will save reverse engineers valuable time that could be spent reasoning about the higher-level functionality of the code, second, we believe it will flatten the learning curve, allowing more novices to enter the field. My core thesis statement is: Exploiting structure inherent in code, together with its naturalness, enables the application of machine translation techniques to useful transformations of decompiled code. These techniques can be used to meaningfully rename and retype variables in decompiled code. To support this thesis I present two automated techniques for automatically renaming and retyping decompiled code. I demonstrate how these techniques are effective at making decompiled code more approachable through metrics developed as a proxy for human understanding and through a user study designed to measure the performance of the techniques in real-world applications.
114 pages
James D. Herbsleb, Head, Software and Societal Systems Department
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |