Machine Learning Department
School of Computer Science, Carnegie Mellon University
GraphLab: A Distributed Abstraction
Machine Learning methods have found increasing applicability and relevance
to the real world, finding applications in a broad range of fields in robotics, data mining, physics and biology, among many others. However, with the growth
of the World Wide Web, and with improvements in data collection technology,
real world datasets have been rapidly increasing in size and complexity,
necessitating comparable scaling of Machine Learning algorithms.
However, designing and implementing efficient parallel Machine Learning algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools such as MPI are difficult to use, and leave Machine Learning experts repeatedly solving the same design challenges.
In this thesis, we trace the development of a framework called GraphLab which aims to provide an expressive and efficient high level abstraction to satisfy the needs of a broad range of Machine Learning algorithms. We discuss the initial GraphLab design, including details of a shared memory and distributed memory implementation. Next, we discuss the scaling limitations of GraphLab on real-world power-law graphs and how that informed the design of PowerGraph. By placing restrictions on the abstraction, we are able to improve scalability, demonstrating state of the art performance on a variety of benchmarks. Finally, we end with the WarpGraph abstraction which improves usability of PowerGraph by combining features of GraphLab and PowerGraph to achieve an abstraction that is easy to use, scalable, and fast.
We demonstrate in this thesis, that by designing a domain specific abstraction for Machine Learning, we are able to provide a system that is both easy to use, and provides exceptional scaling and performance on real world datasets
||SCS Technical Report Collection
School of Computer Science