This book covers the most essential techniques for designing and building dependable distributed systems. Instead of covering a broad range of research works for each dependability strategy, the book focuses only a selected few (usually the most seminal works, the most practical approaches, or the first publication of each approach) are included and explained in depth, usually with a comprehensive set of examples. The goal is to dissect each technique thoroughly so that readers who are not familiar with dependable distributed computing can actually grasp the technique after studying the book.
The book contains eight chapters. The first chapter introduces the basic concepts and terminologies of dependable distributed computing, and also provide an overview of the primary means for achieving dependability. The second chapter describes in detail the checkpointing and logging mechanisms, which are the most commonly used means to achieve limited degree of fault tolerance. Such mechanisms also serve as the foundation for more sophisticated dependability solutions. Chapter three covers the works on recovery-oriented computing, which focus on the practical techniques that reduce the fault detection and recovery times for Internet-based applications. Chapter four outlines the replication techniques for data and service fault tolerance. This chapter also pays particular attention to optimistic replication and the CAP theorem. Chapter five explains a few seminal works on group communication systems. Chapter six introduces the distributed consensus problem and covers a number of Paxos family algorithms in depth. Chapter seven introduces the Byzantine generals problem and its latest solutions, including the seminal Practical Byzantine Fault Tolerance (PBFT) algorithm and a number of its derivatives. The final chapter covers the latest research results on application-aware Byzantine fault tolerance, which is an important step forward towards practical use of Byzantine fault tolerance techniques.
About the Author
Wenbing Zhao received his PhD in electrical and computerengineering from the University of California, Santa Barbara, in2002. Currently, he is an Associate Professor in the Department ofElectrical and Computer Engineering at Cleveland State University.Dr. Zhao has more than 80 academic publications to his credit, andthree of his recent research papers in the area of dependabledistributed computing have won best paper awards. Dr. Zhao also hasa U.S. patent on consistent time service for fault-tolerantdistributed systems.
Table of Contents
List of Figures xiiiList of Tables xxiAcknowledgements xxiiiPreface xxvReferences xxviii1 Introduction to Dependable Distributed Computing 11.1 Basic Concepts and Terminologies 21.2 Means to Achieve Dependability 9References 132 Logging and Checkpointing 152.1 System Model 162.2 Checkpoint-Based Protocols 212.3 Log Based Protocols 34References 543 Recovery-Oriented Computing 573.1 System Model 593.2 Fault Detection and Localization 623.3 Microreboot 833.4 Overcoming Operator Errors 87References 934 Data and Service Replication 974.1 Service Replication 994.2 Data Replication 1054.3 Optimistic Replication 1114.4 CAP Theorem 131References 1385 Group Communication Systems 1415.1 System Model 1435.2 Sequencer Based Group Communication System 1465.3 Sender Based Group Communication System 1605.4 Vector Clock Based Group Communication System 186References 1916 Consensus and the Paxos Algorithms 1936.1 The Consensus Problem 6.2 The Paxos Algorithm 1966.3 Multi-Paxos 2066.4 Dynamic Paxos 2106.5 Fast Paxos 2216.6 Implementations of the Paxos Family Algorithms 229References 2367 Byzantine Fault Tolerance 2397.1 The Byzantine Generals Problem 2407.2 Practical Byzantine Fault Tolerance 2557.3 Fast Byzantine Agreement 2717.4 Speculative Byzantine Fault Tolerance 271References 284