MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-13-104 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-13-104 GraphLab: A Distributed Abstraction for Large Scale Machine Learning Yucheng Low June 2013 Ph.D. Thesis CMU-ML-13-104.pdf Keywords: NA Machine Learning methods have found increasing applicability and relevance to the real world, finding applications in a broad range of fields in robotics, data mining, physics and biology, among many others. However, with the growth of the World Wide Web, and with improvements in data collection technology, real world datasets have been rapidly increasing in size and complexity, necessitating comparable scaling of Machine Learning algorithms. However, designing and implementing efficient parallel Machine Learning algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools such as MPI are difficult to use, and leave Machine Learning experts repeatedly solving the same design challenges. In this thesis, we trace the development of a framework called GraphLab which aims to provide an expressive and efficient high level abstraction to satisfy the needs of a broad range of Machine Learning algorithms. We discuss the initial GraphLab design, including details of a shared memory and distributed memory implementation. Next, we discuss the scaling limitations of GraphLab on real-world power-law graphs and how that informed the design of PowerGraph. By placing restrictions on the abstraction, we are able to improve scalability, demonstrating state of the art performance on a variety of benchmarks. Finally, we end with the WarpGraph abstraction which improves usability of PowerGraph by combining features of GraphLab and PowerGraph to achieve an abstraction that is easy to use, scalable, and fast. We demonstrate in this thesis, that by designing a domain specific abstraction for Machine Learning, we are able to provide a system that is both easy to use, and provides exceptional scaling and performance on real world datasets 126 pages

SCS Technical Report Collection School of Computer Science