Machine Learning Department
School of Computer Science, Carnegie Mellon University


Christoph Dann

September 2019

Ph.D. Thesis


Keywords: Reinforcement Learning, Sequential Decision Making Under Uncertainty, Learning Theory, Exploration, Markov Decision Process, Accountability

Reinforcement learning (RL) focuses on an essential aspect of intelligent behavior – how an agent can learn to make good decisions given experience and rewards in a stochastic world. Yet popular RL algorithms that have enabled exciting successes in domains with good simulators (Go, Atari, etc.) still often fail to learn in other domains because they rely on simple heuristics for exploration. This provides additional empirical justification for essential questions around RL, specifically around algorithms that learn in a provably efficient manner through strategic exploration in any considered domain. This thesis provides new algorithms and theory that enable good performance with respect to existing theoretical frameworks for evaluating RL algorithms (specifically, probably approximately correct) and introduces new stronger evaluation criteria, that may be particularly of interest as RL is applied to more real world problems.

For the first line of work on probably approximately correct (PAC) RL algorithms, we introduce a series of algorithms for episodic tabular domains with substantially better PAC sample complexity bounds that culminate in a new algorithm with close to minimax optimal PAC and regret bounds. Look up tables are required by most sample efficient and computationally tractable algorithms, but cannot represent many practical domains. We therefore also present a new RL algorithm that can learn a good policy in environments with high dimensional observations and hidden deterministic states; unlike predecessors, this algorithm provably explores not only in a statistically but also computationally efficient manner assuming access to function classes with efficient optimization oracles.

To make progress it is critical to have the right measures of success. While empiricaldemonstrations are quite clear, we find that for theoretical properties, two of the most commonlyused learning frameworks, PAC guarantees and regret guarantees, each allow undesirablealgorithm behavior (e.g. ignoring new observations that could improve the policy). We presenta new stronger learning framework called Uniform-PAC that unifies the existing frameworks and prevents undesirable algorithm properties.

One caveat of all existing learning frameworks is that for any particular episode, we do not know how well the algorithm will perform. To address this, we introduce the IPOC framework that requires algorithms to provide a certificate before each episode bounding how suboptimal the current policy can be. Such certifications may be of substantial interest in high stakes scenarios when an organization may wish to track or even pause an online RL system should the potential expected performance bound drop below a required expected outcome.

158 pages

Thesis Committee:
Emma Brunskill (Chair) (Stanford University)
Barnabas Poczos
Benjamin Recht (Univresity of California Berkeley)
Benjamin Van Roy (Stanford University)
Rémi Munos (Deep Mind)

Roni Rosenfeld, Head, Machine Learning Department
Martial Hebert, Dean, School of Computer Science

SCS Technical Report Collection
School of Computer Science