CMU-CS-99-132Computer Science Department School of Computer Science, Carnegie Mellon University
CMU-CS-99-132
Leemon C. Baird, III May 1999 Ph.D. Thesis
CMU-CS-99-132.ps
Keywords: Reinforcement learning, machine learning, gradient
descent, convergence, backpropagation, backprop, function approximators,
neural networks, Q-learning, TD(lambda), temporal difference learning,
value function approximation, evaluation functions, dynamic programming,
advantage learning, residual algorithms, VAPS
A series of new families of algorithms are derived based on stochastic gradient descent. Since they are derived from first principles with function approximators in mind, they have guaranteed convergence to local minima, even on general nonlinear function approximators. For both residual algorithms and VAPS algorithms, it is possible to take any of the standard algorithms in the field, such as Q-learning or SARSA or value iteration, and rederive a new form of it with provable convergence. In addition to better convergence properties, it is shown how gradient descent allows an inelegant, inconvenient algorithm like Advantage updating to be converted into a much simpler and more easily analyzed algorithm like Advantage learning. In this case that is very useful, since Advantages can be learned thousands of times faster than Q values for continuous-time problems. In this case, there are significant practical benefits of using gradient-descent-based techniques. In addition to improving both the theory and practice of existing types of algorithms, the gradient-descent approach makes it possible to create entirely new classes of reinforcement-learning algorithms. VAPS algorithms can be derived that ignore values altogether, and simply learn good policies directly. One hallmark of gradient descent is the ease with which different algorithms can be combined, and this is a prime example. A single VAPS algorithm can both learn to make its value function satisfy the Bellman equation, and also learn to improve the implied policy directly. Two entirely different approaches to reinforcement learning can be combined into a single algorithm, with a single function approximator with a single output. Simulations are presented that show that for certain problems, there are significant advantages for Advantage learning over Q-learning, residual algorithms over direct, and combinations of values and policy search over either alone. It appears that gradient descent is a powerful unifying concept for the field of reinforcement learning, with substantial theoretical and practical value. 80 pages
| |

Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |