CMU-CS-99-143Computer Science Department School of Computer Science, Carnegie Mellon University
CMU-CS-99-143
Geoffrey J. Gordon June 1999 Ph.D. Thesis
CMU-CS-99-143.ps
Keywords: Machine learning, reinforcement learning, dynamic
programming, Markov decision processes (MDPs), linear programming,
convex programming, function approximation, worst-case learning, regret
bounds, statistics, fitted value iteration, convergence of numerical
methods
In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to real-world problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. MDPs allow us to represent actions that have probabilistic outcomes, and to plan for complicated, temporally-extended goals. An MDP consists of a set of states that the environment can be in, together with rules for how the environment can change state and for what the learner is supposed to do. One way to approach a large MDP is to try to compute an approximation to its optimal state evaluation function, the function which tells us how much reward the learner can be expected to achieve if the world is in a particular state. If the approximation is good enough, we can use a shallow search to find a good action from most states. Researchers have tried many different ways to approximate evaluation functions. This thesis aims for a middle ground, between algorithms that don't scale well because they use an impoverished representation for the evaluation function and algorithms that we can't analyze because they use too complicated a representation. 152 pages
| |

Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |