CMU-ISR-09-112
Institute for Software Research
School of Computer Science, Carnegie Mellon University



CMU-ISR-09-112

Efficient Survivability for Highly Replicated Services

Michael G. Merideth

April 2009

PhD. Thesis
Software Engineering

CMU-ISR-09-112.pdf


Keywords: Distributed systems, Byzantine fault tolerance, probabilistic quorum systems


Networked services like distributed file systems can suffer a wide range of problems such as machine crashes and security intrusions that may cause downtime, incorrect behavior, and other undesirable issues. Byzantine-fault-tolerant protocols that replicate the services represent catchall solutions for such problems by providing survivability–the ability of a service to operate correctly despite instances of such security or reliability "faults." These protocols use Byzantine quorum systems as building blocks in order to ensure that services operate correctly and are available to clients even in the presence of faults. However, there are a variety of Byzantine quorum systems, which differ in three primary measures: fault tolerance–a measure of the number of faults that can be tolerated; load–a measure of efficiency; and availability–a measure of how well the service remains available for use. Unfortunately, no quorum system excels in all categories.

In this dissertation, we show that, compared with previous quorum systems, probabilistic quorum systems can provide better fault tolerance and load albeit at the cost of admitting a bounded probability of failing to mask faults. We present a probabilistic opaque quorum system that can tolerate up to 37% more faults than can traditional opaque quorum systems. Then, we present a technique called write markers for probabilistic masking and opaque quorum systems that can tolerate 50% and 48% more faults, respectively, compared with traditional quorum systems. Moreover, these probabilistic quorum systems with write markers have asymptotically better load than the bounds proven for previous masking and opaque quorum systems, achieving an optimal O(1/√n) in certain cases, and presenting the possibility of smaller quorums that yield efficient access for clients.

Additional contributions of this dissertation include the following. First, we present a framework for comparing Byzantine-fault-tolerant protocols on the basis of how they use quorum systems. This framework highlights the similarity between protocols that are explicitly quorum based, such as Q/U, and others like BFT and Zyzzyva that are not. Second, we introduce a way of improving the availability of probabilistic quorum systems by providing some leeway in how quorums are chosen. Instead of assuming that all quorums are chosen uniformly at random, we allow faulty clients to choose quorums by any strategy from access sets that are chosen uniformly at random. Third, we present the first analysis of a probabilistic quorum system that accounts for the behavior of Byzantine-faulty clients. We anticipate that a faulty client may choose quorums with the goal of maximizing the error probability, and show the effects that this may have. Fourth, we present a framework based on the McDiarmid inequality in order to prove that probabilistic quorum systems in general can meet specific load and fault tolerance targets. This framework allows us to prove that probabilistic masking quorum systems can tolerate up to 13% more faults than shown using Chernoff bounds previously. Fifth, we present a protocol by which probabilistic quorum systems can tolerate Byzantine-faulty clients. Such clients are otherwise problematic in that they may seek to cause the system to fail to mask faults. Finally, we analyze the cost of changing quorums routinely, as may be required by probabilistic quorum systems. This analysis is in the context of wide area networks with the Q/U protocol, which can require state to be transferred between servers as a result of quorum changes.

103 pages


Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu