CMU-CS-99-148
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-99-148

A Survey of Rollback-Recovery Protocols
in Message-Passing Systems

Mootaz Elnohazy*, Lorenzo Alvisi**, Yi-Min Wang***, David B. Johnson

June 1999

This report is a revision of CMU-CS-96-181.

CMU-CS-99-148.ps
CMU-CS-99-148.pdf


Keywords: Distributed systems, fault tolerance, high availability, checkpointing, message logging, rollback, recovery


This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey, we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

44 pages

*IBM Austin Research Lab, mootaz@us.ibm.com
**Department of Computer Sciences, University of Texas at Austin, lorenzo@cs.utexas.edu
***Microsoft Research, ymwang@microsoft.com
****Computer Science Department, Carnegie Mellon University, dbj@cs.cmu.edu


Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu