|
|
CMU-CS-25-155 Computer Science Department School of Computer Science, Carnegie Mellon University
Algorithms for Massive Data: Optimal Bounds, Honghao Lin Ph.D. Thesis December 2025
With the rapid growth of massive datasets in areas such as machine learning and numerical linear algebra, classical algorithms are often no longer feasible. In this thesis, we develop provably efficient algorithms for various problems in these settings, such as the streaming and distributed model. Our contributions span three directions: (1) Optimal Bounds. We establish nearly optimal upper and lower bounds for a broad class of problems in the classical streaming and distributed model. We introduce a general technique for lifting dimension lower bounds for real-valued linear sketches to polynomially bounded integer inputs. This leads to the first optimal sketching lower bounds for discrete data streams in fundamental problems such as frequency moment approximation, operator norm estimation, and compressed sensing. Beyond this, we also establish nearly-optimal bounds for a variety of streaming and sketching tasks, including ℓp subspace sketches for constant dimension d, ℓp regression in the arbitrary-partition distributed model, and graph problems such as approximating the minimum cut in a stream and constructing cut sparsifiers in balanced directed graphs. (2) Adversarial Robustness. While most streaming algorithms are studied in static worst-case models, many practical scenarios involve adaptive adversaries who generate inputs based on previous outputs. We present the first adaptive attack against linear sketches for ℓp -estimation over turnstile integer streams. Specifically, we show that any linear streaming algorithm with sketching matrix 𝐀 ∊ ℤrxn can be broken using only poly(r log n) queries, with high constant probability. This result highlights fundamental limits of robustness in adaptive streaming. (3) Learning-Based Algorithms. Classical algorithms guarantee correctness in the worst case but often ignore structure in real-world data, while machine learning methods leverage structure but typically lack guarantees. We design learning-based algorithms that incorporate machine learning predictions to adapt to input distributions, achieving faster runtimes, reduced space, or improved accuracy. Crucially, these algorithms retain rigorous worst-case guarantees even when the predictions are imperfect, bridging the gap between theory and data-driven practice. 303 pages
Thesis Committee:
Jignesh Patel, Interim Head, Computer Science Department
|
|
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |
|