Blueprints for High Availability / Edition 2

Paperback (Print)
Used and New from Other Sellers
Used and New from Other Sellers
from $3.96
Usually ships in 1-2 business days
(Save 92%)
Other sellers (Paperback)
  • All (17) from $3.96   
  • New (8) from $22.99   
  • Used (9) from $3.96   


The success of the 1e of Blueprints for High Availability set the stage for the de-velopment of a cluster of storage books to meet the demand of the emerging market for the topic. Originally positioned and priced as a professional discount book, this new edition, priced and discounted for the trade market should benefit from more exposure on the brick and mortar shelf. Since the 1e, there has been a huge increase in the complexity and design requirements needed to make networks reliable and always available 24x7. The reliability issues have increased dramatically since 9-11and so; too the emergence of web services has brought the need for a whole new model of a business continuity strategy. With nearly 30ew material and 15% revised, this book will give detailed patterns and rules to follow to meet these needs.
Read More Show Less

Product Details

  • ISBN-13: 9780471430261
  • Publisher: Wiley
  • Publication date: 9/19/2003
  • Edition description: Subsequent
  • Edition number: 2
  • Pages: 624
  • Sales rank: 1,183,701
  • Product dimensions: 9.25 (w) x 7.50 (h) x 1.29 (d)

Read an Excerpt

Blueprints for High Availability

By Evan Marcus Hal Stern

John Wiley & Sons

Copyright © 2003

Evan Marcus, Hal Stern
All right reserved.

ISBN: 0-471-43026-9

Chapter One

What to Measure

What would you think about, if it were you, sitting there in an oversized suit,
strapped to an intricate and complex network of components, wires, circuits
and engines, all procured by the government, from the lowest bidder?

-John Glenn, on his thoughts before his first spaceflight

We have become obsessed with data. If we can measure it, in theory, we can
control it and improve it. Data collected and analyzed becomes the basis for
resource allocations and prioritization decisions, ranging from whether or not
to buy an additional 10 terabytes (TB) of disk space for marketing, to which
server pairs should be turned into clusters for improved availability. Our goal
for this chapter is to present some common measurements of availability, and
to provide a framework for interpreting those measurements. In Chapter 3,
"The Value of Availability," we'll ascribe pricing and business benefit to this

In this chapter we discuss the following topics:

_ How we measureavailability

_ The failure modes, or typical things that can and do go wrong

_ Situations in which measurements may not be valid

Throughout this book we use resiliency in terms of overall system availability.
We see resiliency as a general term similar to high availability, but without
all the baggage that HA carries along with it. High availability once referred to
a fairly specific range of system configurations, often involving two computers
that monitor and protect each other. During the last few years, however, it has
lost much of its original meaning; vendors and users have co-opted the term to
mean whatever they want it to mean.

To us, resiliency and high availability mean that all of a system's failure
modes are known and well defined, including networks and applications.
They mean that the recovery times for all known failures have an upper
bound; we know how long a particular failure will have the system down.
Although there may be certain failures that we cannot cope with very well, we
know what they are and how to recover from them, and we have backup plans
to use if our recoveries don't work. A resilient system is one that can take a hit
to a critical component and recover and come back for more in a known,
bounded, and generally acceptable period of time.

Measuring Availability

When you discuss availability requirements with a user or project leader, he
will invariably tell you that 100 percent availability is required: "Our project is
so important that we can't have any downtime at all." But the tune usually
changes when the project leader finds out how much 100 percent availability
would cost. Then the discussion becomes a matter of money, and more of a
negotiation process.

As you can see in Table 2.1, for many applications, 99 percent uptime is adequate.
If the systems average an hour and a half of downtime per week, that
may be satisfactory. Of course, a lot of that depends on when the hour and a
half occurs. If it falls between 3:00 A.M. and 4:30 A.M. on Sunday, that is going
to be a lot more tolerable on many systems than if it occurs between 10:00 A.M.
and 11:30 A.M. on Thursday, or every weekday at 2:00 P.M. for 15 or 20 minutes.

One point of negotiation is the hours during which 100 percent uptime may
be required. If it is only needed for a few hours a day, that goal is quite achievable.
For example, when brokerage houses trade between the hours of 9:30 A.M.
and 4:00 P.M., then during those hours, plus perhaps 3 or 4 hours on either side,
100 percent uptime is required. A newspaper might require 100 percent uptime
during production hours, but not the rest of the day. If, however, 100 percent
uptime is required 7 x 24 x 365, the costs become so prohibitive that only the
most profitable applications and large enterprises can consider it, and even
if they do, 100 percent availability is almost impossible to achieve over the
long term.

As you move progressively to higher levels of availability, costs increase
very rapidly. Consider a server (abbott) that with no special protective measures
taken, except for disk mirrors and backups, delivers 99 percent availability.
If you couple that server with another identically configured server (costello)
that is configured to take over from abbott when it fails, and that server also
offers 99 percent availability, then theoretically, you can achieve a combined
availability of 99.99 percent. Mathematically, you multiply the downtime on
abbott (1 percent) by the uptime on costello (99 percent); costello will only be in
use during abbott's 1 percent of downtime. The result is 0.99 percent. Add the
original 99 to 0.99, and you get 99.99 percent, the theoretical uptime for the
combined pair.

Of course, in reality 99.99 percent will not occur simply by combining two
servers. The increase in availability is not purely linear. It takes time for the
switchover (usually called a failover) to occur, and during that period, the combined
server is down. In addition, there are external failures that will affect
access to both servers, such as network connectivity or power outages. These
failures will undoubtedly decrease the overall availability figures below
99.99 percent.

However, we only use the "nines" for modeling purposes. In reality, we
believe that the nines have become an easy crutch for system and operating
system vendors, allowing them to set unrealistic expectations for uptime.

The Myth of the Nines

We've seen a number of advertisements proclaiming "five nines" or more of
availability. This is a nice generalization to make for marketing materials,
because we can measure the mean time between failures (MTBF) of a hardware
system and project its downtime over the course of a year. System availability
is based on software configurations, load, user expectations, and the
time to repair a failure. Before you aim for a target number of nines, or judge
systems based on their relative proclaimed availability, make sure you can
match the advertised number against your requirements. The following are
considerations to take into account when evaluating the desired availability:

Nines are an average. Maximum outages, in terms of the maximum time
to repair, are more important than the average uptime.

Nines only measure that which can be modeled. Load and software are
hard to model in an average case; you will need to measure your actual
availability and repair intervals for real systems, running real software

Nines usually reflect a single system view of the world. Quick: Think of
a system that's not networked but important. Reliability has to be based
on networks of computers, and the top-to-bottom stack of components
that make up the network. The most reliable, fault-tolerant system in the
world is useless if it sits behind a misconfigured router.

Computer system vendors talk about "nines of availability," and although
nines are an interesting way to express availability, they miss some essential

All downtime is not created equal. If an outage drives away customers or
users, then it is much more costly than an outage that merely inconveniences
those users. But an outage that causes inconvenience is more costly to an enterprise
than an outage that is not detected by users.

Consider the cost of downtime at a retail e-commerce web site such as amazon.
com or If, during the course of a year, a single 30-minute outage
is suffered, the system has an apparently respectable uptime of 99.994 percent.
If, however, the outage occurs on a Friday evening in early December, it costs
a lot more in lost business than the same outage would if it occurred on a
Sunday at 4:00 A.M. local time in July. Availability statistics do not make a distinction
between the two.

Similarly, if an equities trading firm experiences a 30-minute outage 5 minutes
before the Federal Reserve announces a surprise change in interest rates,
it would cost the firm considerably more than the same outage would on a
Tuesday evening at 8 P.M., when no rate change, and indeed, little activity of
any kind, was in the offing.

Consider the frustration level of a customer or user who wants to use a critical
system. If the 30-minute outage comes all at once, then a user might leave
and return later or the next night, and upon returning, stay if everything is OK.
However, if the 30 minutes of downtime is spread over three consecutive
evenings at the same time, users who try to gain access each of those three
nights and find systems that are down will be very frustrated. Some of them
will go elsewhere, never to return. (Remember the rule of thumb that says it
costs 10 times more to find a new customer than it does to retain an old one.)

Many system vendors offer uptime guarantees, where they claim to guarantee
specific uptime percentages. If customers do not achieve those levels, then
the vendor is contractually bound to pay their customers money or provide
some other form of giveback. There are so many factors that are out of the control
of system vendors, and are therefore disallowed in the contracts, that those
contracts seldom have any teeth, and even more seldom pay off. Compare, for
instance, the potential reliability of a server located in a northern California
data center where, in early 2001, rolling power blackouts were a way of life,
with a server in, say, Minnesota, where the traditionally high amounts of winter
snow are expected and do not traditionally impact electric utility service.
Despite those geographical differences, system vendors offer the same uptime
contractual guarantees in both places. A system vendor cannot reasonably be
expected to guarantee the performance of a local electric power utility, wide
area network provider, or the data center cooling equipment. Usually, those
external factors are specifically excluded from any guarantees.

The other problem with the nines is that availability is a chain, and any
failed link in the chain will cause the whole chain to fail. Consider the diagram
in Figure 2.1, which shows a simple representation of a user sitting at a client
station and connected to a network over which he is working.

If the seven components in the figure (client station, network, file server and
its storage, and the application server, its application, and its storage) have
99.99 percent availability each, that does not translate to an end user seeing
99.99 percent availability.

To keep the math simple, let's assume that all seven components have exactly
the same level of expected availability, 99.99 percent. In reality, of course, different
components have different levels of expected availability, and more complex
components such as networks will often have lower levels. The other
assumption is that multiple failures do not occur at the same time (although
they can, of course, in real life); that would needlessly complicate the math.

Availability of 99.99 percent over each of seven components yields a simple formula
of 0.9999 to the seventh power, which works out to 99.93 percent. That may
not sound like a huge difference, but the difference is actually quite significant:

_ Availability of 99.99 percent spread over a year is about 52 minutes

_ Availability of 99.93 percent spread over a year is over 6 hours of

Another way to look at the math is to consider that for all practical purposes,
the seven components will never be down at the same time. Since each
component will be responsible for 52 minutes of downtime per year (based on
99.99 percent availability), 7 times 52 is 364 minutes, or just over 6 hours per
year, or 99.93 percent.

The actual path from user to servers is going to be much more complicated
than the one in Figure 2.1. For example, the network cloud is made up of
routers, hubs, and switches, any of which could fail and thereby lower network
availability. If the storage is mirrored, then its availability will likely be higher,
but the value will surely vary. The formulas also exclude many other components
that could cause additional downtime if they were to fail, such as electric
power or the building itself.

Consider another example. Six of the seven components in the chain deliver
99.99 percent availability, but the seventh only achieves 99 percent uptime. The
overall availability percentage for that chain of components will be just 98.94
percent. Great returns on investment can be achieved by improving the availability
of that weakest link.

So, while some single components may be able to deliver upwards of 99.99
percent availability, it is much more difficult for an entire system, from user to
server, to deliver the same level. The more components there are in the chain
and the more complex the chain, the lower the overall availability will be.

Any bad component in the chain can lower overall availability, but there is
no way for one good component to raise it above the level of the weakest link.

Defining Downtime

Definitions for downtime vary from gentle to tough, and from simple to complex.
Easy definitions are often given in terms of failed components, such as
the server itself, disks, the network, the operating system, or key applications.
Stricter definitions may include slow server or network performance, the
inability to restore backups, or simple data inaccessibility.

We prefer a very strict definition for downtime: If a user cannot get her job
done on time, the system is down. A computer system is provided to its users
for one purpose: to allow them to complete their work in an efficient and
timely way. When circumstances prevent a user from doing this work, regardless
of the reason, the system is down.

Causes of Downtime

In Figure 2.2 and Figure 2.3, we examine two different views of the most common
causes of downtime, from surveys conducted by two different organizations.
In Figure 2.2, which comes from the computer industry analysts
Gartner/Dataquest, the greatest cause of downtime is system software failures,
but just by a little bit (27 to 23 percent) over hardware failures. In Figure 2.3,
provided by CNT, hardware failures cause 44 percent of downtime, more than
triple their estimate for software downtime (and still more than double if you
include viruses among their software causes).

The conclusion that we draw from these two very different sets of results is
that if you ask different groups of people, you'll get widely varying results.

Both surveys agree that human error is a major cause of downtime,
although they disagree on the degree of downtime that it causes. People cause
downtime for two closely related reasons.


Excerpted from Blueprints for High Availability
by Evan Marcus Hal Stern
Copyright © 2003 by Evan Marcus, Hal Stern.
Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More Show Less

Table of Contents

About the Authors
Ch. 1 Introduction 1
Ch. 2 What to Measure 9
Ch. 3 The Value of Availability 31
Ch. 4 The Politics of Availability 61
Ch. 5 20 Key High Availability Design Principles 75
Ch. 6 Backups and Restores 105
Ch. 7 Highly Available Data Management 149
Ch. 8 SAN, NAS, and Virtualization 183
Ch. 9 Networking 203
Ch. 10 Data Centers and the Local Environment 241
Ch. 11 People and Processes 263
Ch. 12 Clients and Consumers 291
Ch. 13 Application Design 303
Ch. 14 Data and Web Services 333
Ch. 15 Local Clustering and Failover 361
Ch. 16 Failover Management and Issues 387
Ch. 17 Failover Configurations 415
Ch. 18 Data Replication 433
Ch. 19 Virtual Machines and Resource Management 465
Ch. 20 The Disaster Recovery Plan 473
Ch. 21 A Resilient Enterprise 513
Ch. 22 A Brief Look Ahead 541
Ch. 23 Parting Shots 555
Index 559
Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)