@device(postscript) @libraryfile(Mathematics10) @libraryfile(Accents) @style(fontfamily=timesroman,fontscale=11) @pagefooting(immediate, left "@c", center "@c", right "@c") @heading(A Survey of Rollback-Recovery Protocols in Message-Passing Systems) @heading(CMU-CS-96-181) @center(@b(Elmootazbellah N. Elnozahy, David B. Johnson, Y.M. Wang@foot)) @center(September 1996@foot< A version of this paper has been submitted for publication in @i[ACM Surveys].>) @center(FTP: CMU-CS-96-181.ps) @blankspace(1) @begin(text) The problem of rollback-recovery in message-passing systems has undergone extensive study. In this survey, we review rollback-recovery techniques that do not require special language constructs, and classify them into two primary categories. @i(Checkpoint-based rollback-recovery) relies solely on checkpointed states for system state restoration. Depending on when checkpoints are taken, existing approaches can be divided into uncoordinated checkpointing, coordinated checkpointing and communication-induced checkpointing. @i(Log-based rollback-recovery) uses checkpointing and message logging. The logs enable the recovery protocol to reconstruct the states that are not checkpointed. There are three different log-based approaches, namely, pessimistic logging, optimistic logging and causal logging. We identify a set of desirable properties of rollback-recovery protocols, and compare different approaches with respect to these properties. Log-based rollback-recovery protocols generally rely on the assumption of piecewise determinism and pay additional overhead to allow faster output commits and more localized recovery. We present research issues under each approach, and review existing solutions to address them. We also present implementation issues of checkpointing and message logging. @blankspace(2line) @begin(transparent,size=10) @b(Keywords:@ )@c @end(transparent) @blankspace(1line) @end(text) @flushright(@b[(50 pages)])