CMU-S3D-23-108
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University



CMU-S3D-23-108

Ensuring the Safety of Reinforcement
Learning Algorithms at Training and Deployment

Melrose Roderick

October 2023

Ph.D. Thesis
Societal Computing

CMU-S3D-23-108.pdf


Keywords: Reinforcement Learning, Environmental Sustainability, Robustness, Safe Exploration, Offline Reinforcement Learning

Reinforcement learning (RL) has the potential to significantly improve the efficiency of many real-world control problems, including Tokomak control for nuclear fusion and power grid optimization. However, practitioners in these problem areas remain weary of applying RL to these problems. From my experience working with practitioners in Tokomak control, power grid optimization, and autonomous manufacturing, the primary concerns for applying RL to these domains are safety concerns: How do we maintain safety throughout training and during deployment of RL algorithms? In this thesis, we will discuss work we have done to address the challenges of ensuring safety of RL algorithms at training and deployment.

We start by with the problem of ensuring safety and robustness during deployment. In real worldsystems, external disturbances or other small changes to the system dynamics are inevitable and, thus, there is a need for controllers that are robust to these disturbances. When designing controllers for safety-critical systems, practitioners often face a challenging tradeoff between robustness and performance. While robust control methods provide rigorous guarantees on system stability under certain worst-case disturbances, they often yield simple controllers that perform poorly in the average (non-worst) case. In contrast, nonlinear control methods trained using deep learning have achieved state-of-the-art performance on many control tasks, but often lack robustness guarantees. We introduce a novel method to provide robustness guarantees to any deep neural-network-based controller trained using RL. We demonstrate our technique empirically improves average-case performance over other robust controllers while maintaining robustness to even worst-case disturbances.

Next, we discuss ensuring safety at training time. Traditionally, online RL algorithms require significant exploration to construct high-performing policies. These exploration strategies often do not take safety into account. Although a growing line of work in reinforcement learning has investigated this area of "safe exploration," most existing techniques either 1) do not guarantee safety during the actual exploration process; and/or 2) limit the problem to a priori known and/or deterministic transition dynamics with strong smoothness assumptions. Addressing this gap, we propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in Markov Decision Processes (MDPs) with unknown, stochastic dynamics. Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP (Probably Approximately Correct-MDP) sense. Additionally, ASE also guides exploration towards the most task-relevant states, which empirically results in significant improvements in terms of sample efficiency, when compared to existing methods.

Alternatively, RL can applied to offline datasets to safely learn control policies without ever taking a potentially dangerous action on the real system. A key problem in offline RL is the mismatch, or distribution shift, between the dataset and the distribution over states and actions visited by the learned policy. The main approach to correct this shift has been through importance sampling, which leads to high-variance gradients. Other approaches, such as conservatism or behavior-regularization, regularize the policy at the cost of performance. We propose a new approach for stable off-policy Q-Learning that builds on a theoretical result by Kolter [64] . Our method, Projected Off-Policy Q-Learning (POP-QL), is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.

Another approach to offline RL is to used Model-Based methods. A unique challenge to Model-Based methods in the offline setting is the need to predict epistemic uncertainty – uncertainty deriving from lack of data samples. However, standard deep learning methods are unable to capture this type of uncertainty. We propose a new method, Generative Posterior Networks (GPNs), that uses unlabeled data to estimate epistemic uncertainty in high-dimensional problems. A GPN is a generative model that, given a prior distribution over functions, approximates the posterior distribution directly by regularizing the network towards samples from the prior. We prove theoretically that our method indeed approximates the Bayesian posterior and show empirically that it improves epistemic uncertainty estimation and scalability over competing methods.

160 pages

Thesis Committee:
Zico Kolter (Chair)
Jeff Schneider
Ruslan Salakhutdinov
Felix Berkenkamp (Bosch Center for AI)

James D. Herbsleb, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu