Institute for Software Research
School of Computer Science, Carnegie Mellon University
Efficient Survivability for Highly Replicated Services
Michael G. Merideth
In this dissertation, we show that, compared with previous quorum systems, probabilistic quorum systems can provide better fault tolerance and load albeit at the cost of admitting a bounded probability of failing to mask faults. We present a probabilistic opaque quorum system that can tolerate up to 37% more faults than can traditional opaque quorum systems. Then, we present a technique called write markers for probabilistic masking and opaque quorum systems that can tolerate 50% and 48% more faults, respectively, compared with traditional quorum systems. Moreover, these probabilistic quorum systems with write markers have asymptotically better load than the bounds proven for previous masking and opaque quorum systems, achieving an optimal O(1/√n) in certain cases, and presenting the possibility of smaller quorums that yield efficient access for clients.
Additional contributions of this dissertation include the following. First, we present a framework for comparing Byzantine-fault-tolerant protocols on the basis of how they use quorum systems. This framework highlights the similarity between protocols that are explicitly quorum based, such as Q/U, and others like BFT and Zyzzyva that are not. Second, we introduce a way of improving the availability of probabilistic quorum systems by providing some leeway in how quorums are chosen. Instead of assuming that all quorums are chosen uniformly at random, we allow faulty clients to choose quorums by any strategy from access sets that are chosen uniformly at random. Third, we present the first analysis of a probabilistic quorum system that accounts for the behavior of Byzantine-faulty clients. We anticipate that a faulty client may choose quorums with the goal of maximizing the error probability, and show the effects that this may have. Fourth, we present a framework based on the McDiarmid inequality in order to prove that probabilistic quorum systems in general can meet specific load and fault tolerance targets. This framework allows us to prove that probabilistic masking quorum systems can tolerate up to 13% more faults than shown using Chernoff bounds previously. Fifth, we present a protocol by which probabilistic quorum systems can tolerate Byzantine-faulty clients. Such clients are otherwise problematic in that they may seek to cause the system to fail to mask faults. Finally, we analyze the cost of changing quorums routinely, as may be required by probabilistic quorum systems. This analysis is in the context of wide area networks with the Q/U protocol, which can require state to be transferred between servers as a result of quorum changes.