CMU-ML-13-104
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-13-104

GraphLab: A Distributed Abstraction
for Large Scale Machine Learning

Yucheng Low

June 2013

Ph.D. Thesis

CMU-ML-13-104.pdf


Keywords: NA


Machine Learning methods have found increasing applicability and relevance to the real world, finding applications in a broad range of fields in robotics, data mining, physics and biology, among many others. However, with the growth of the World Wide Web, and with improvements in data collection technology, real world datasets have been rapidly increasing in size and complexity, necessitating comparable scaling of Machine Learning algorithms.

However, designing and implementing efficient parallel Machine Learning algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools such as MPI are difficult to use, and leave Machine Learning experts repeatedly solving the same design challenges.

In this thesis, we trace the development of a framework called GraphLab which aims to provide an expressive and efficient high level abstraction to satisfy the needs of a broad range of Machine Learning algorithms. We discuss the initial GraphLab design, including details of a shared memory and distributed memory implementation. Next, we discuss the scaling limitations of GraphLab on real-world power-law graphs and how that informed the design of PowerGraph. By placing restrictions on the abstraction, we are able to improve scalability, demonstrating state of the art performance on a variety of benchmarks. Finally, we end with the WarpGraph abstraction which improves usability of PowerGraph by combining features of GraphLab and PowerGraph to achieve an abstraction that is easy to use, scalable, and fast.

We demonstrate in this thesis, that by designing a domain specific abstraction for Machine Learning, we are able to provide a system that is both easy to use, and provides exceptional scaling and performance on real world datasets

126 pages


SCS Technical Report Collection
School of Computer Science