Machine Learning Department
School of Computer Science, Carnegie Mellon University


Effective and Efficient Learning at Scale

Adams Wei Yu

August 2019

Ph.D. Thesis


Keywords: Large Scale Optmization, Deep Neural Networks, Machine Reading Comprehension, Natural Language Processing

How to enable efficient and effective machine learning at scale has been a longstanding problem in modern artificial intelligence, which also motivates this thesis research. In particular, we aim at solving the following problems:

  1. How to efficiently train a machine learning model?
  2. How to speed up inference after the model is trained?
  3. How to make the model generalize better?
We approach those problems from two perspectives: models and algorithms.
On one hand, we design novel models that are intrinsically fast to train and/or test. On the other, we develop new algorithms with rapid convergence guarantee. Not surprisingly, the above three problem are not mutually exclusive and thus solving one of them might also benefit others. For example, 1) a new model that can enable parallel computation helps accelerate both training and inference; 2) a fast algorithm can save time for hyper-parameter tuning and/or make it affordable for training with more data, which in return boosts the generalization performance.

This thesis consists of two parts. The first part presents new machine learning models with a focus on sequential data such as natural language processing and question answering. Firstly, we propose a model, LSTM-Jump, that can skip unimportant information in text, mimicking the skimming behavior of human reading. Trained with an efficient reinforcement learning algorithm, this model can be several times faster than a vanilla LSTM in inference time. Then we introduce a text encoding model that totally discards recurrent networks, which thus fully supports parallel training and inference. Based on this technique, a new question-answering model, QANet, is proposed. Combined with data augmentation approach via backtranslation, this model stays at the No.1 place in the competitive Stanford Question and Answer Dataset (SQuAD) from March to Aug 2018, while being times faster than the prevalent models. It was also the deepest neural network model for NLP when invented.

The second part proposes large scale learning algorithms with provable convergence guarantee. To enable fast training of neural networks, we propose a general gradient normalization algorithm for efficient deep networks training. This method can not only alleviate the gradient vanishing problem, but also regularize the model to achieve better generalization. When the amount of training data becomes huge, we need appeal to distributed computation. We are particularly interested in the ubiquitous problem, empirical risk minimization, and propose a few algorithms to attack the challenges posed by the distributed environment. We first show that a randomized primal-dual coordinate method DSPDC can achieve a linear convergence rate in the single-machine setting. Then by further leveraging the structural information of the problem, we propose an asynchronous algorithm DSCOVR, which enjoys a linear convergence rate as well on the distributed parameter server environment, by executing periodical variance reduction.

187 pages

Thesis Committee:
Jaime Carbonell (Co-Chair)
Alexander Smola (Co-Chair) (Amazon)
Ruslan Salakhutdinov
Quoc Le (Google)
Christopher Manning (Stanford University)

Roni Rosenfeld, Head, Machine Learning Department
Martial Hebert, Dean, School of Computer Science

SCS Technical Report Collection
School of Computer Science