CMU-CS-20-140
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-20-140

Checkpoint-Free Fault Tolerance for
Recommendation System Training via Erasure Coding

Kaige Liu

M.S. Thesis

December 2020

CMU-CS-20-140.pdf


Keywords: Recommendation systems, erasure coding, machine learning, fault tolerance

Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Checkpointing is the predominant approach used for fault tolerance in these systems. However, it incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs.

In this thesis, we present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode and where to place them in a training cluster, correctly and efficiently updates parities during normal operation, and recovers from failure without pausing training and while maintaining consistency of the recovered parameters. The design of ECRM enables training to proceed without any pauses both during normal operation and during recovery. We implement ECRM atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRM reduces training time overhead by up to 88%, recovers from failures significantly faster, and allows training to proceed during recovery. These results show the promise of erasure coding in imparting efficient fault tolerance to training current and future DLRMs.

50 pages

Thesis Committee:
Rashmi K. Vinayak (Chair)
Phillip Gibbons

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu