|   | CMU-CS-99-148 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-99-148
 
A Survey of Rollback-Recovery Protocolsin Message-Passing Systems
 
Mootaz Elnohazy*, Lorenzo Alvisi**, Yi-Min Wang***, David B. Johnson 
June 1999  
This report is a revision of CMU-CS-96-181. 
CMU-CS-99-148.psCMU-CS-99-148.pdf
 Keywords: Distributed systems, fault tolerance, high availability,
checkpointing, message logging, rollback, recovery
 This survey covers rollback-recovery techniques that do not require
special language constructs.  In the first part of the survey, we
classify rollback-recovery protocols into checkpoint-based and
log-based.  Checkpoint-based protocols rely solely on checkpointing
for system state restoration.  Checkpointing can be coordinated,
uncoordinated, or communication-induced.  Log-based protocols combine
checkpointing with logging of nondeterministic events, encoded in
tuples called determinants.  Depending on how determinants are logged,
log-based protocols can be pessimistic, optimistic, or causal.
Throughout the survey, we highlight the research issues that are at
the core of rollback recovery and present the solutions that currently
address them.  We also compare the performance of different
rollback-recovery protocols with respect to a series of desirable
properties and discuss the issues that arise in the practical
implementations of these protocols.
 
44 pages 
*IBM Austin Research Lab, mootaz@us.ibm.com**Department of Computer Sciences, University of Texas at Austin, 
   lorenzo@cs.utexas.edu
 ***Microsoft Research, ymwang@microsoft.com
 ****Computer Science Department, Carnegie Mellon University, dbj@cs.cmu.edu
 |