Design for Reliability: Information and Computer-Based Systems / Edition 1

Design for Reliability: Information and Computer-Based Systems / Edition 1

by Eric Bauer
     
 

ISBN-10: 0470604654

ISBN-13: 9780470604656

Pub. Date: 11/16/2010

Publisher: Wiley

Today's customer expects valid service requests or transactions to be reliably executed with acceptable quality. Design for Reliability brings together the analysis, design, and system implementation principles necessary to build highly available, reliable systems. It fills the knowledge gap in this area, explaining techniques for framing verifiable

Overview

Today's customer expects valid service requests or transactions to be reliably executed with acceptable quality. Design for Reliability brings together the analysis, design, and system implementation principles necessary to build highly available, reliable systems. It fills the knowledge gap in this area, explaining techniques for framing verifiable availability/reliability requirements and methodically designing, analyzing, and testing systems to meet those requirements.

This book takes a very pragmatic approach of framing reliability and robustness as concrete, functional attributes of a system, rather than abstract, non-functional notions. It is divided into three sections:

Reliability Basics-frames the elements of a typical system; defines eight broad categories of errors that can produce critical system failures; and explains the failure recovery process

Reliability Concepts-covers concepts for failure containment and recovery; reviews techniques that complement failure containment and redundancy to improve system reliability; outlines error detection and failure recovery mechanisms; provides design basics for reliable procedures; and offers information to help enterprises deploy robust operational policies to maximize highly available system operation

Design for Reliability-reviews reliability requirements and analysis techniques; demonstrates downtime budgeting and modeling to assess the feasibility of meeting a system's service availability requirement; covers strategy and planning of robustness and stability testing; shows how field outage events can be analyzed to drive reliability improvements; and explains how to construct a reliability road map to methodically drive a system to achieve the ultimate service availability on a desired schedule

A case study of design for reliability diligence of a networked system is then presented to illustrate appropriate considerations for developing a high-availability, high-reliability system. System architects, engineers, developers, testers, and project and product managers will rely on Design for Reliability to understand how all the key elements fit into the overall system design lifecycle in order to produce robust systems that achieve customers' expectations for service reliability and service availability. Quality professionals for products with high-availability expectations will also find this book useful in understanding what it takes to design and deploy robust systems.

Product Details

ISBN-13:
9780470604656
Publisher:
Wiley
Publication date:
11/16/2010
Pages:
325
Product dimensions:
6.40(w) x 9.40(h) x 1.00(d)

Table of Contents

Figures xiii

Tables xv

Preface xvii

Acknowledgments xxi

Part 1 Reliability Basics

1 Reliability and Availability Concepts 3

1.1 Reliability and Availability 3

1.2 Faults, Errors, and Failures 5

1.3 Error Severity 6

1.4 Failure Recovery 7

1.5 Highly Available Systems 9

1.6 Quantifying Availability 12

1.7 Outage Attributability 14

1.8 Hardware Reliability 16

1.9 Software Reliability 22

1.10 Problems 28

1.11 For Further Study 29

2 System Basics 31

2.1 Hardware and Software 31

2.2 External Entities 35

2.3 System Management 37

2.4 System Outages 43

2.5 Service Quality 47

2.6 Total Cost of Ownership 49

2.7 Problems 56

3 What Can Go Wrong 57

3.1 Failures in the Real World 57

3.2 Eight-Ingredient Framework 59

3.3 Mapping Ingredients to Error Categories 63

3.4 Applying Error Categories 66

3.5 Error Category: Field-Replaceable Unit (FRU) Hardware 68

3.6 Error Category: Programming Errors 70

3.7 Error Category: Data Error 71

3.8 Error Category: Redundancy 73

3.9 Error Category: System Power 74

3.10 Error Category: Network 75

3.11 Error Category: Application Protocol 76

3.12 Error Category: Procedures 77

3.13 Summary 79

3.14 Problems 80

3.15 For Further Study 80

Part 2 Reliability Concepts

4 Failure Containment and Redundancy 85

4.1 Units of Design 85

4.2 Failure Recovery Groups 91

4.3 Redundancy 92

4.4 Summary 96

4.5 Problems 97

4.6 For Further Study 97

5 Robust Design Principles 99

5.1 Robust Design Principles 99

5.2 Robust Protocols 101

5.3 Robust Concurrency Controls 103

5.4 Overload Control 103

5.5 Process, Resource, and Throughput Monitoring 108

5.6 Data Auditing 109

5.7 Fault Correlation 110

5.8 Failed Error Detection, Isolation, or Recovery 111

5.9 Geographic Redundancy 112

5.10 Security, Availability, and System Robustness 114

5.11 Procedural Considerations 119

5.12 Problems 130

5.13 For Further Study 130

6 Error Detection 131

6.1 Detecting Field-Replaceable Unit (FRU) Hardware Faults 131

6.2 Detecting Programming and Data Faults 132

6.3 Detecting Redundancy Failures 134

6.4 Detecting Power Failures 139

6.5 Detecting Networking Failures 141

6.6 Detecting Application Protocol Failures 142

6.7 Detecting Procedural Failures 144

6.8 Problems 144

6.9 For Further Study 144

7 Analyzing and Modeling Reliability and Robustness 145

7.1 Reliability Block Diagrams 145

7.2 Qualitative Model of Redundancy 147

7.3 Failure Mode and Effects Analysis 149

7.4 Availability Modeling 151

7.5 Planned Downtime 165

7.6 Problems 168

7.7 For Further Study 168

Part 3 Design for Reliability

8 Reliability Requirements 171

8.1 Background 171

8.2 Defining Service Outages 172

8.3 Service Availability Requirements 175

8.4 Detailed Service Availability Requirements 177

8.5 Service Reliability Requirements 180

8.6 Triangulating Reliability Requirements 181

8.7 Problems 182

9 Reliability Analysis 185

9.1 Step 1: Enumerate Recoverable Modules 186

9.2 Step 2: Construct Reliability Block Diagrams 191

9.3 Step 3: Characterize Impact of Recovery 193

9.4 Step 4: Characterize Impact of Procedures 198

9.5 Step 5: Audit Adequacy of Automatic Failure Detection and Recovery 200

9.6 Step 6: Consider Failures of Robustness Mechanisms 201

9.7 Step 7: Prioritizing Gaps 202

9.8 Reliability of Sourced Modules and Components 202

9.9 Problems 206

10 Reliability Budgeting and Modeling 207

10.1 Downtime Categories 208

10.2 Service Downtime Budget 209

10.3 Availability Modeling 212

10.4 Update Downtime Budget 213

10.5 Robustness Latency Budgets 215

10.6 Problems 218

11 Robustness and Stability Testing 219

11.1 Robustness Testing 219

11.2 Context of Robustness Testing 220

11.3 Factoring Robustness Testing 221

11.4 Robustness Testing in the Development Process 222

11.5 Robustness Testing Techniques 223

11.6 Selecting Robustness Test Cases 232

11.7 Analyzing Robustness Test Results 233

11.8 Stability Testing 234

11.9 Release Criteria 240

11.10 Problems 243

12 Closing the Loop 245

12.1 Analyzing Field Outage Events 245

12.2 Reliability Roadmapping 255

12.3 Problems 260

13 Design for Reliability Case Study 263

13.1 System Context 263

13.2 System Reliability Requirements 268

13.3 Reliability Analysis 270

13.4 Downtime Budgeting 283

13.5 Availability Modeling 284

13.6 Reliability Roadmap 286

13.7 Robustness Testing 287

13.8 Stability Testing 289

13.9 Reliability Review 290

13.10 Reliability Report 291

13.11 Release Criteria 292

13.12 Field Data Analysis 293

14 Conclusion 295

14.1 Overview of Design for Reliability 295

14.2 Concluding Remarks 299

14.3 Problems 300

15 Appendix: Assessing Design for Reliability Diligence 301

15.1 Assessment Methodology 302

15.2 Reliability Requirements 304

15.3 Reliability Analysis 306

15.4 Reliability Modeling and Budgeting 307

15.5 Robustness Testing 308

15.6 Stability Testing 310

15.7 Release Criteria 311

15.8 Field Availability 312

15.9 Reliability Roadmap 313

15.10 Hardware Reliability 313

Abbreviations 315

References 317

Photo Credits 319

About the Author 321

Index 323

Customer Reviews

Average Review:

Write a Review

and post it to your social network

     

Most Helpful Customer Reviews

See all customer reviews >