- Shopping Bag ( 0 items )
What would you think about, if it were you, sitting there in an oversized suit,
strapped to an intricate and complex network of components, wires, circuits
and engines, all procured by the government, from the lowest bidder?
-John Glenn, on his thoughts before his first spaceflight
We have become obsessed with data. If we can measure it, in theory, we can
control it and improve it. Data collected and analyzed becomes the basis for
resource allocations and prioritization decisions, ranging from whether or not
to buy an additional 10 terabytes (TB) of disk space for marketing, to which
server pairs should be turned into clusters for improved availability. Our goal
for this chapter is to present some common measurements of availability, and
to provide a framework for interpreting those measurements. In Chapter 3,
"The Value of Availability," we'll ascribe pricing and business benefit to this
In this chapter we discuss the following topics:
_ How we measureavailability
_ The failure modes, or typical things that can and do go wrong
_ Situations in which measurements may not be valid
Throughout this book we use resiliency in terms of overall system availability.
We see resiliency as a general term similar to high availability, but without
all the baggage that HA carries along with it. High availability once referred to
a fairly specific range of system configurations, often involving two computers
that monitor and protect each other. During the last few years, however, it has
lost much of its original meaning; vendors and users have co-opted the term to
mean whatever they want it to mean.
To us, resiliency and high availability mean that all of a system's failure
modes are known and well defined, including networks and applications.
They mean that the recovery times for all known failures have an upper
bound; we know how long a particular failure will have the system down.
Although there may be certain failures that we cannot cope with very well, we
know what they are and how to recover from them, and we have backup plans
to use if our recoveries don't work. A resilient system is one that can take a hit
to a critical component and recover and come back for more in a known,
bounded, and generally acceptable period of time.
When you discuss availability requirements with a user or project leader, he
will invariably tell you that 100 percent availability is required: "Our project is
so important that we can't have any downtime at all." But the tune usually
changes when the project leader finds out how much 100 percent availability
would cost. Then the discussion becomes a matter of money, and more of a
As you can see in Table 2.1, for many applications, 99 percent uptime is adequate.
If the systems average an hour and a half of downtime per week, that
may be satisfactory. Of course, a lot of that depends on when the hour and a
half occurs. If it falls between 3:00 A.M. and 4:30 A.M. on Sunday, that is going
to be a lot more tolerable on many systems than if it occurs between 10:00 A.M.
and 11:30 A.M. on Thursday, or every weekday at 2:00 P.M. for 15 or 20 minutes.
One point of negotiation is the hours during which 100 percent uptime may
be required. If it is only needed for a few hours a day, that goal is quite achievable.
For example, when brokerage houses trade between the hours of 9:30 A.M.
and 4:00 P.M., then during those hours, plus perhaps 3 or 4 hours on either side,
100 percent uptime is required. A newspaper might require 100 percent uptime
during production hours, but not the rest of the day. If, however, 100 percent
uptime is required 7 x 24 x 365, the costs become so prohibitive that only the
most profitable applications and large enterprises can consider it, and even
if they do, 100 percent availability is almost impossible to achieve over the
As you move progressively to higher levels of availability, costs increase
very rapidly. Consider a server (abbott) that with no special protective measures
taken, except for disk mirrors and backups, delivers 99 percent availability.
If you couple that server with another identically configured server (costello)
that is configured to take over from abbott when it fails, and that server also
offers 99 percent availability, then theoretically, you can achieve a combined
availability of 99.99 percent. Mathematically, you multiply the downtime on
abbott (1 percent) by the uptime on costello (99 percent); costello will only be in
use during abbott's 1 percent of downtime. The result is 0.99 percent. Add the
original 99 to 0.99, and you get 99.99 percent, the theoretical uptime for the
Of course, in reality 99.99 percent will not occur simply by combining two
servers. The increase in availability is not purely linear. It takes time for the
switchover (usually called a failover) to occur, and during that period, the combined
server is down. In addition, there are external failures that will affect
access to both servers, such as network connectivity or power outages. These
failures will undoubtedly decrease the overall availability figures below
However, we only use the "nines" for modeling purposes. In reality, we
believe that the nines have become an easy crutch for system and operating
system vendors, allowing them to set unrealistic expectations for uptime.
The Myth of the Nines
We've seen a number of advertisements proclaiming "five nines" or more of
availability. This is a nice generalization to make for marketing materials,
because we can measure the mean time between failures (MTBF) of a hardware
system and project its downtime over the course of a year. System availability
is based on software configurations, load, user expectations, and the
time to repair a failure. Before you aim for a target number of nines, or judge
systems based on their relative proclaimed availability, make sure you can
match the advertised number against your requirements. The following are
considerations to take into account when evaluating the desired availability:
Nines are an average. Maximum outages, in terms of the maximum time
to repair, are more important than the average uptime.
Nines only measure that which can be modeled. Load and software are
hard to model in an average case; you will need to measure your actual
availability and repair intervals for real systems, running real software
Nines usually reflect a single system view of the world. Quick: Think of
a system that's not networked but important. Reliability has to be based
on networks of computers, and the top-to-bottom stack of components
that make up the network. The most reliable, fault-tolerant system in the
world is useless if it sits behind a misconfigured router.
Computer system vendors talk about "nines of availability," and although
nines are an interesting way to express availability, they miss some essential
All downtime is not created equal. If an outage drives away customers or
users, then it is much more costly than an outage that merely inconveniences
those users. But an outage that causes inconvenience is more costly to an enterprise
than an outage that is not detected by users.
Consider the cost of downtime at a retail e-commerce web site such as amazon.
com or ebay.com. If, during the course of a year, a single 30-minute outage
is suffered, the system has an apparently respectable uptime of 99.994 percent.
If, however, the outage occurs on a Friday evening in early December, it costs
a lot more in lost business than the same outage would if it occurred on a
Sunday at 4:00 A.M. local time in July. Availability statistics do not make a distinction
between the two.
Similarly, if an equities trading firm experiences a 30-minute outage 5 minutes
before the Federal Reserve announces a surprise change in interest rates,
it would cost the firm considerably more than the same outage would on a
Tuesday evening at 8 P.M., when no rate change, and indeed, little activity of
any kind, was in the offing.
Consider the frustration level of a customer or user who wants to use a critical
system. If the 30-minute outage comes all at once, then a user might leave
and return later or the next night, and upon returning, stay if everything is OK.
However, if the 30 minutes of downtime is spread over three consecutive
evenings at the same time, users who try to gain access each of those three
nights and find systems that are down will be very frustrated. Some of them
will go elsewhere, never to return. (Remember the rule of thumb that says it
costs 10 times more to find a new customer than it does to retain an old one.)
Many system vendors offer uptime guarantees, where they claim to guarantee
specific uptime percentages. If customers do not achieve those levels, then
the vendor is contractually bound to pay their customers money or provide
some other form of giveback. There are so many factors that are out of the control
of system vendors, and are therefore disallowed in the contracts, that those
contracts seldom have any teeth, and even more seldom pay off. Compare, for
instance, the potential reliability of a server located in a northern California
data center where, in early 2001, rolling power blackouts were a way of life,
with a server in, say, Minnesota, where the traditionally high amounts of winter
snow are expected and do not traditionally impact electric utility service.
Despite those geographical differences, system vendors offer the same uptime
contractual guarantees in both places. A system vendor cannot reasonably be
expected to guarantee the performance of a local electric power utility, wide
area network provider, or the data center cooling equipment. Usually, those
external factors are specifically excluded from any guarantees.
The other problem with the nines is that availability is a chain, and any
failed link in the chain will cause the whole chain to fail. Consider the diagram
in Figure 2.1, which shows a simple representation of a user sitting at a client
station and connected to a network over which he is working.
If the seven components in the figure (client station, network, file server and
its storage, and the application server, its application, and its storage) have
99.99 percent availability each, that does not translate to an end user seeing
99.99 percent availability.
To keep the math simple, let's assume that all seven components have exactly
the same level of expected availability, 99.99 percent. In reality, of course, different
components have different levels of expected availability, and more complex
components such as networks will often have lower levels. The other
assumption is that multiple failures do not occur at the same time (although
they can, of course, in real life); that would needlessly complicate the math.
Availability of 99.99 percent over each of seven components yields a simple formula
of 0.9999 to the seventh power, which works out to 99.93 percent. That may
not sound like a huge difference, but the difference is actually quite significant:
_ Availability of 99.99 percent spread over a year is about 52 minutes
_ Availability of 99.93 percent spread over a year is over 6 hours of
Another way to look at the math is to consider that for all practical purposes,
the seven components will never be down at the same time. Since each
component will be responsible for 52 minutes of downtime per year (based on
99.99 percent availability), 7 times 52 is 364 minutes, or just over 6 hours per
year, or 99.93 percent.
The actual path from user to servers is going to be much more complicated
than the one in Figure 2.1. For example, the network cloud is made up of
routers, hubs, and switches, any of which could fail and thereby lower network
availability. If the storage is mirrored, then its availability will likely be higher,
but the value will surely vary. The formulas also exclude many other components
that could cause additional downtime if they were to fail, such as electric
power or the building itself.
Consider another example. Six of the seven components in the chain deliver
99.99 percent availability, but the seventh only achieves 99 percent uptime. The
overall availability percentage for that chain of components will be just 98.94
percent. Great returns on investment can be achieved by improving the availability
of that weakest link.
So, while some single components may be able to deliver upwards of 99.99
percent availability, it is much more difficult for an entire system, from user to
server, to deliver the same level. The more components there are in the chain
and the more complex the chain, the lower the overall availability will be.
Any bad component in the chain can lower overall availability, but there is
no way for one good component to raise it above the level of the weakest link.
Definitions for downtime vary from gentle to tough, and from simple to complex.
Easy definitions are often given in terms of failed components, such as
the server itself, disks, the network, the operating system, or key applications.
Stricter definitions may include slow server or network performance, the
inability to restore backups, or simple data inaccessibility.
We prefer a very strict definition for downtime: If a user cannot get her job
done on time, the system is down. A computer system is provided to its users
for one purpose: to allow them to complete their work in an efficient and
timely way. When circumstances prevent a user from doing this work, regardless
of the reason, the system is down.
Causes of Downtime
In Figure 2.2 and Figure 2.3, we examine two different views of the most common
causes of downtime, from surveys conducted by two different organizations.
In Figure 2.2, which comes from the computer industry analysts
Gartner/Dataquest, the greatest cause of downtime is system software failures,
but just by a little bit (27 to 23 percent) over hardware failures. In Figure 2.3,
provided by CNT, hardware failures cause 44 percent of downtime, more than
triple their estimate for software downtime (and still more than double if you
include viruses among their software causes).
The conclusion that we draw from these two very different sets of results is
that if you ask different groups of people, you'll get widely varying results.
Both surveys agree that human error is a major cause of downtime,
although they disagree on the degree of downtime that it causes. People cause
downtime for two closely related reasons.
Excerpted from Blueprints for High Availability
by Evan Marcus Hal Stern
Copyright © 2003 by Evan Marcus, Hal Stern.
Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
|About the Authors|
|Ch. 2||What to Measure||9|
|Ch. 3||The Value of Availability||31|
|Ch. 4||The Politics of Availability||61|
|Ch. 5||20 Key High Availability Design Principles||75|
|Ch. 6||Backups and Restores||105|
|Ch. 7||Highly Available Data Management||149|
|Ch. 8||SAN, NAS, and Virtualization||183|
|Ch. 10||Data Centers and the Local Environment||241|
|Ch. 11||People and Processes||263|
|Ch. 12||Clients and Consumers||291|
|Ch. 13||Application Design||303|
|Ch. 14||Data and Web Services||333|
|Ch. 15||Local Clustering and Failover||361|
|Ch. 16||Failover Management and Issues||387|
|Ch. 17||Failover Configurations||415|
|Ch. 18||Data Replication||433|
|Ch. 19||Virtual Machines and Resource Management||465|
|Ch. 20||The Disaster Recovery Plan||473|
|Ch. 21||A Resilient Enterprise||513|
|Ch. 22||A Brief Look Ahead||541|
|Ch. 23||Parting Shots||555|