CMU-HCII-24-101
Human-Computer Interaction Institute
School of Computer Science, Carnegie Mellon University



CMU-HCII-24-101

Behavior-Driven AI Development

Ángel Alexander Cabrera

April 2024

Ph.D. Thesis

CMU-HCII-24-101.pdf


Keywords: Machine learning evaluation, AI evaluation, failure analysis, behavioral analysis, sensemaking, human-AI collaboration, visualization, crowdsourcing, machine learning, artificial intelligence


AI systems are being deployed in many real-world applications, from self-driving cars to customer service chatbots. When a person interacts with an AI system, they develop a complex mental model of how the system behaves that they use to inform their interaction with the AI. Should I override the AI prediction? Should I collect more training data? Traditionally, aggregate metrics such as accuracy are calculated on a hold-out test set to measure a model's overall performance. This singular metric is often insufficient to develop mental models that capture important AI behaviors, such as potential biases or safety concerns.

This thesis proposes behavior-driven AI development (BDAI), a philosophy that centers AI development on identifying, quantifying, and communicating the numerous behaviors a model can show. By focusing on a model's behaviors instead of aggregate metrics, developers can focus on creating responsible AI systems that best fulfill end-user needs. BDAI is central to creating AI systems, informing how a model should be updated, and deploying AI, informing how people should interact with a model. In this thesis, I describe empirical and system-building work that formally defines BDAI and shows how it can be applied to improve real-world AI systems.

In the first half of the thesis, I present a series of interviews, a theoretical framework, and a user study that describe the core principles of BDAI. First, I summarize a qualitative interview study with 27 practitioners investigating how they understand and improve behaviors of complex AI systems. Next, I describe a theoretical framework that defines this process as a form of sensemaking and show how the framework can be used to create AI evaluation tools. I further show how insights into model behavior can improve human-AI collaboration by calibrating end-users' reliance on model outputs.

In the second half of the thesis, I implement two systems that, combined, fulfill the requirements of the full sensemaking process and BDAI workflow. I first introduce Zeno, an interactive platform that lets practitioners discover and validate behaviors across any AI system. I then describe Zeno Reports, a no-code tool built on Zeno for authoring interactive evaluation reports. Through case studies and real-world deployment with more than 500 users, I show how AI analysis tools covering the sensemaking process can empower practitioners to develop more performant and equitable AI systems.

141 pages

Thesis Committee:
Adam Perer (Co-Chair)
Jason I. Hong (Co-Chair)
Kenneth Holstein
Ameet Talwalker (CMU, Machine Learning Department)
Aditya Parameswaran (University of California, Berkeley)

Brad A. Myers, Head, Human-Computer Interaction Institute
Martial Hebert, Dean, School of Computer Science



Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu