CMU-S3D-26-110
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University



CMU-S3D-26-110

Bootstrapping Contextual Understanding in Privacy-Preserving Smart Environments

Prasoon Patidar

June 2026

Ph.D. Thesis
Societal Computing

CMU-S3D-26-110.pdf


Keywords: Human Activity Recognition, Privacy-Preserving Sensing, Context-Aware Computing, Smart Environments, Ubiquitous Computing

Physical spaces instrumented with sensors that infer the activities of their occupants and offer them insights, automation, and assistance have been a longstanding vision of the research community and companies. Today, the technology that has made such understanding possible at scale typically relies on machine learning models built for high fidelity sensors like cameras and microphones due to the fact that there is ample training data available. This reliance on high fidelity sensors creates a tradeoff: to make a smart space understand you, you have to give up some privacy. In recent years, the research community has worked to shift this tradeoff by turning to "privacy-preserving" sensors like mmWave radars, thermal arrays, and motion sensors. These sensors observe enough of human behavior to be useful while revealing far less about the identity, appearance, or conversations of those being observed.

The machine learning models behind cameras and microphones scale because they are global: trained once on large labeled datasets, they transfer across environments with little adaptation. Privacy-preserving sensors do not have this property yet. Their signals are shaped by the physical environment, e.g., the sequence in which a collection of motion sensors activates depends on furniture placement, sensor position, and room layout. The same person, performing the same routine, produces a different signal pattern in a different home. Without global models, the understanding that these systems need is manually reconstructed in each new environment. Deploying a privacy-preserving sensing system today burdens users with hands-on involvement at every stage of the pipeline: labeling training data, defining which activities to detect, authoring rules that synthesize higher-level context, and navigating purpose-built interfaces to access insights. It is this per-environment human effort that prevents privacy-preserving sensing from moving beyond controlled research settings.

The thesis behind this dissertation is that the user burden required to set up and use context-aware sensing systems, from providing training labels to navigating purpose-built interfaces, can be lowered by opportunistically engaging a more capable resource with appropriate scaffolding. At each stage where a conventional pipeline demands human involvement, a more capable resource, whether a temporarily deployed camera, a vision-language model, a formal ontology, or a large language model, supplies the missing understanding, and scaffolding constrains how that resource operates, keeping its contributions predictable and bounded. In some cases the resource is temporary, assisting during an initial setup phase and then departing. In others it remains but in a bounded role, reasoning over processed outputs rather than raw sensor data. This shifts the user's role from constructing the system to confirming or correcting what it produces.

I demonstrate this principle at four successive levels of abstraction in the sensing pipeline. At the raw sensor signal level, deploying privacy-preserving sensors conventionally requires a user to perform and label activities so the sensors can learn to recognize them using supervised ML models. VAX [154] reduces this burden through temporary privileged observation: a camera and microphone are paired with the privacy-preserving sensors during an initial setup phase, and their output, combined with off-the-shelf audio and video ML models, generates training labels automatically; once the sensor models are trained, the privileged sensors are removed. At the vocabulary level, any sensing system deployment requires someone to specify which activities the system should detect, but the right vocabulary for a given space depends on what the occupant does and what activities the sensors can actually distinguish. Defining this vocabulary upfront is difficult as occupants may not anticipate which of their routines matter, and developers cannot predict what a particular sensor configuration will be able to capture. OrganicHAR [155] addresses this by discovering activities directly from “privacy-preserving” sensor data, identifying recurring patterns in privacy-preserving sensor streams and using a Vision Language Model (VLM) during brief key moments to understand what those patterns represent. At the context level, the burden shifts from detecting activities to interpreting what they mean together: someone typing, talking, and sitting could be in a meeting, on a phone call, or doing focused work, and distinguishing these conventionally requires hand-authored rules specific to each environment. TAO [36] replaces these rules with a combination of a formal ontology that models how activities compose into contexts and an unsupervised deep temporal clustering pipeline that discovers context patterns directly from activity streams. The ontology provides structured, transferable domain knowledge that does not need to be rewritten for each environment, and the temporal pipeline learns recurring patterns without supervision. At the interaction level, a system that understands sensor data, activities and context is only useful if occupants can access what it knows. BuildingChat replaces purpose-built interfaces and bespoke apps with a conversational layer in which a large language model assembles analytical pipelines from pre-verified operators on the fly. Here the language model operates in a bounded role, orchestrating verified operators rather than using sensor data directly.

Individually, these four contributions each lower the user burden at one level of the sensing pipeline. Together, they sketch a path toward an end-to-end system that bootstraps much of its own understanding when deployed in a new environment, rather than requiring an occupant to reconstruct that understanding by hand at every stage. Combining them into a single self-configuring deployment is the next step this work makes possible. What the four systems establish is that the per-environment effort at each stage, from labeling training data to navigating purpose-built interfaces, is not inherent to privacy-preserving sensing but can be carried by a more capable resource under appropriate scaffolding. This is a step toward the longstanding vision of physical spaces that understand the people in them without asking those people to give up their privacy.

213 pages

Thesis Committee:
Yuvraj Agarwal (Chair)
Mayank Goel
Andrew Begel
Ranveer Chandra (Microsoft)

Nicolas Christin, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu