Design for Reliability. Information and Computer–Based Systems

Techniques for developing reliable, robust networked systems that meet customers expectations

Today s customer expects valid service requests or transactions to be reliably executed with acceptable quality. Design for Reliability brings together the analysis, design, and system implementation principles necessary to build highly available, reliable systems. It fills the knowledge gap in this area, explaining techniques for framing verifiable availability/reliability requirements and methodically designing, analyzing, and testing systems to meet those requirements.

This book takes a very pragmatic approach of framing reliability and robustness as concrete, functional attributes of a system, rather than abstract, non–functional notions. It is divided into three sections:

  • Reliability Basics frames the elements of a typical system; defines eight broad categories of errors that can produce critical system failures; and explains the failure recovery process

  • Reliability Concepts covers concepts for failure containment and recovery; reviews techniques that complement failure containment and redundancy to improve system reliability; outlines error detection and failure recovery mechanisms; provides design basics for reliable procedures; and offers information to help enterprises deploy robust operational policies to maximize highly available system operation

  • Design for Reliability reviews reliability requirements and analysis techniques; demonstrates downtime budgeting and modeling to assess the feasibility of meeting a system s service availability requirement; covers strategy and planning of robustness and stability testing; shows how field outage events can be analyzed to drive reliability improvements; and explains how to construct a reliability road map to methodically drive a system to achieve the ultimate service availability on a desired schedule

A case study of design for reliability diligence of a networked system is then presented to illustrate appropriate considerations for developing a high–availability, high–reliability system. System architects, engineers, developers, testers, and project and product managers will rely on Design for Reliability to understand how all the key elements fit into the overall system design lifecycle in order to produce robust systems that achieve customers expectations for service reliability and service availability. Quality professionals for products with high–availability expectations will also find this book useful in understanding what it takes to design and deploy robust systems.

1 Reliability and Availability Concepts.

1.1 Reliability and Availability.

1.2 Faults, Errors and Failures.

1.3 Error Severity.

1.4 Failure Recovery.

1.5 Highly Available Systems.

1.6 Quantifying Availability.

1.7 Outage Attributability.

1.8 Hardware Reliability.

1.9 Software Reliability.

1.10 Problems.

1.11 For Further Study.

2 System Basics.

2.1 Hardware and Software.

2.2 External Entities.

2.3 System Management.

2.4 System Outages.

2.5 Service Quality.

2.6 Total Cost of Ownership.

2.7 Problems.

3 What Can Go Wrong.

3.1 Failures in the Real World.

3.2 Eight–Ingredient Framework.

3.3 Mapping Ingredients to Error Categories.

3.4 Applying Error Categories.

3.5 Error Category: Field Replaceable Unit (FRU) Hardware.

3.6 Error Category: Programming Errors.

3.7 Error Category: Data Error.

3.8 Error Category: Redundancy.

3.9 Error Category: System Power.

3.10 Error Category: Network.

3.11 Error Category: Application Protocol.

3.12 Error Category: Procedures.

3.13 Summary.

3.14 Problems.

3.15 For Further Study.


4 Failure Containment and Redundancy.

4.1 Units of Design.

4.2 Failure Recovery Groups.

4.3 Redundancy.

4.4 Summary.

4.5 Problems.

4.6 For Further Study.

5 Robust Design Principles.

5.1 Robust Design Principles.

5.2 Robust Protocols.

5.3 Robust Concurrency Controls.

5.4 Overload Control.

5.5 Process, Resource and Throughput Monitoring.

5.6 Data Auditing.

5.7 Fault Correlation.

5.8 Failed Error Detection, Isolation or Recovery.

5.9 Geographic Redundancy.

5.10 Security, Availability and System Robustness.

5.11 Procedural Considerations.

5.12 Problems.

5.13 For Further Study.

6 Error Detection.

6.1 Detecting Field Replaceable Unit (FRU) Hardware Faults.

6.2 Detecting Programming and Data Faults.

6.3 Detecting Redundancy Failures.

6.4 Detecting Power Failures.

6.5 Detecting Networking Failures.

6.6 Detecting Application Protocol Failures.

6.7 Detecting Procedural Failures.

6.8 Problems.

For Further Study.

7 Analyzing and Modeling Reliability and Robustness.

7.1 Reliability Block Diagrams.

7.2 Qualitative Model of Redundancy.

7.3 Failure Mode and Effects Analysis.

7.4 Availability Modeling.

7.5 Planned Downtime.

7.6 Problems.

7.7 For Further Study.


8 Reliability Requirements.

8.1 Background.

8.2 Defining Service Outages.

8.3 Service Availability Requirements.

8.4 Detailed Service Availability Requirements.

8.5 Service Reliability Requirements.

8.6 Triangulating Reliability Requirements.

8.7 Problems.

9 Reliability Analysis.

9.1 Step 1: Enumerate Recoverable Modules.

9.2 Step 2: Construct Reliability Block Diagrams.

9.3 Step 3: Characterize Impact of Recovery.

9.4 Step 4: Characterize Impact of Procedures.

9.5 Step 5: Audit Adequacy of Automatic Failure Detection and Recovery.

9.6 Step 6: Consider Failures of Robustness Mechanisms.

9.7 Step 7: Prioritizing Gaps.

9.8 Reliability of Sourced Modules and Components.

9.9 Problems.

10 Reliability Budgeting and Modeling.

10.1 Downtime Categories.

10.2 Service Downtime Budget.

10.3 Availability Modeling.

10.4 Update Downtime Budget.

10.5 Robustness Latency Budgets.

10.6 Problems.

11 Robustness and Stability Testing.

11.1 Robustness Testing.

11.2 Context of Robustness Testing.

11.3 Factoring Robustness Testing.

11.4 Robustness Testing in the Development Process.

11.5 Robustness Testing Techniques.

11.6 Selecting Robustness Test Cases.

11.7 Analyzing Robustness Test Results.

11.8 Stability Testing.

11.9 Release Criteria.

11.10 Problems.

12 Closing the Loop.

12.1 Analyzing Field Outage Events.

12.2 Reliability Roadmapping.

12.3 Problems.

13 Design for Reliability Case Study.

13.1 System Context.

13.2 System Reliability Requirements.

13.3 Reliability Analysis.

13.4 Downtime Budgeting.

13.5 Availability Modeling.

13.6 Reliability Roadmap.

13.7 Robustness Testing.

13.8 Stability Testing.

13.9 Reliability Review.

13.10 Reliability Report.

13.11 Release Criteria.

13.12 Field Data Analysis.

14 Conclusion.

14.1 Overview of Design for Reliability.

14.2 Concluding Remarks.

14.3 Problems.

15 Appendix: Assessing Design for Reliability Diligence.

15.1 Assessment Methodology.

15.2 Reliability Requirements.

15.3 Reliability Analysis.

15.4 Reliability Modeling and Budgeting.

15.5 Robustness Testing.

15.6 Stability Testing.

15.7 Release Criteria.

15.8 Field Availability.

15.9 Reliability Roadmap.

15.10 Hardware Reliability.



"Thus, I highly recommend this book to undergraduate students and junior researchers entering the reliability studies field. Though experts may not find the book to be very interesting, they will likely find it useful as a basis for lecturing, and as a good source of insightful, fundamental ideas." (Computing Reviews, 16 May 2011)

"The book takes a very pragmatic approach of framing reliability and robustness as a functional aspect of a system so that architects, designers, developers and testers can address it as a concrete, functional attribute of a system, rather than an abstract, non–functional notion." (Forums Digital Media Net, 16 March 2011)

