Machine Learning Department
School of Computer Science, Carnegie Mellon University
Gradient Descent for Non-convex Problems in
Modern Machine Learning
Simon Shaolei Du
Machine learning, nonconvex optimization, gradient descent, neural network,
saddle point, matrix factorization
Machine learning has become an important tool set for artificial intelligence and data science across many fields. A modern machine learning method can be often
reduced to a mathematical optimization problem. Among algorithms to solve the
optimization problem, gradient descent and its variants like stochastic gradient descent and momentum methods are the most popular ones. The optimization problem induced from classical machine learning methods is often a convex and smooth one, for which gradient descent is guaranteed to solve it efficiently. On the other hand, modern machine learning methods, like deep neural networks, often require solving a non-smooth and non-convex problem. Theoretically, non-convex mathematical optimization problems cannot be solved efficiently. However, in practice, gradient descent and its variants can find a global optimum efficiently. These competing facts show that often there are special structures in the optimization problems that can make gradient descent succeed in practice.
This thesis presents technical contributions to fill the gap between theory and
practice on the gradient descent algorithm. The outline of the thesis is as follows.
While our focus is theoretical, whenever possible, we also present experiments that illustrate our theoretical findings.
- In the first part, we consider applying gradient descent to minimize the empirical risk of a neural network. We will show if a multi-layer neural network
with smooth activation function is sufficiently wide, then randomly initialized
gradient descent can efficiently find a global minimum of the empirical risk.
We will also show the same result for the two-layer neural network with Rectified Linear Unit (ReLU) activation function. It is quite surprising that although
the objective function of neural networks is non-convex, gradient descent can
still find their global minimum. Lastly, we will study structural property of the trajectory induced by the gradient descent algorithm.
- In the second part, we assume the label is generated from a two-layer teacher convolutional neural network and we consider using gradient descent to recover
the teacher convolutional neural network. We will show that if the input distribution is Gaussian, then gradient descent can recovered a one-hidden-layer
convolutional neural network in which both the convolutional weights and the
output wights are unknown parameters to be recovered. We will also show that
the Gaussian input assumption can be relaxed to a general structural assumption
if we only need to recover a single convolutional filter.
- In the third part, we study conditions under which gradient descent fails. We will show gradient descent can take exponential time to optimize a smooth
function with the strict saddle point property for which the noise-injected gradient can optimize in polynomial time.
Barnabás Póczos (Co-Chair)
Aarti Singh (Co-Chair)
Michael I. Jordan (University of California at Berkeley)
Roni Rosenfeld, Head, Machine Learning Department
Tom M. Mitchell, Interim Dean, School of Computer Science