CMU-CS-25-155
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-25-155

Algorithms for Massive Data: Optimal Bounds,
Adversarial Robustness, and Data-Driven Insights

Honghao Lin

Ph.D. Thesis

December 2025

CMU-CS-25-155.pdf


Keywords: Sketching, Streaming, Distributed Algorithms, Adversarial Robustness, Learning- Based Algorithms

With the rapid growth of massive datasets in areas such as machine learning and numerical linear algebra, classical algorithms are often no longer feasible. In this thesis, we develop provably efficient algorithms for various problems in these settings, such as the streaming and distributed model. Our contributions span three directions:

(1) Optimal Bounds. We establish nearly optimal upper and lower bounds for a broad class of problems in the classical streaming and distributed model. We introduce a general technique for lifting dimension lower bounds for real-valued linear sketches to polynomially bounded integer inputs. This leads to the first optimal sketching lower bounds for discrete data streams in fundamental problems such as frequency moment approximation, operator norm estimation, and compressed sensing. Beyond this, we also establish nearly-optimal bounds for a variety of streaming and sketching tasks, including p subspace sketches for constant dimension d, p regression in the arbitrary-partition distributed model, and graph problems such as approximating the minimum cut in a stream and constructing cut sparsifiers in balanced directed graphs.

(2) Adversarial Robustness. While most streaming algorithms are studied in static worst-case models, many practical scenarios involve adaptive adversaries who generate inputs based on previous outputs. We present the first adaptive attack against linear sketches for p -estimation over turnstile integer streams. Specifically, we show that any linear streaming algorithm with sketching matrix 𝐀 ∊ ℤrxn can be broken using only poly(r log n) queries, with high constant probability. This result highlights fundamental limits of robustness in adaptive streaming.

(3) Learning-Based Algorithms. Classical algorithms guarantee correctness in the worst case but often ignore structure in real-world data, while machine learning methods leverage structure but typically lack guarantees. We design learning-based algorithms that incorporate machine learning predictions to adapt to input distributions, achieving faster runtimes, reduced space, or improved accuracy. Crucially, these algorithms retain rigorous worst-case guarantees even when the predictions are imperfect, bridging the gap between theory and data-driven practice.

303 pages

Thesis Committee:
David P. Woodruff (Chair)
Yang P. Liu
Richard Peng
Jelani Nelson (University of California, Berkeley)

Jignesh Patel, Interim Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu