Blueprints for High Availability: Designing Resilient Distributed Systems / Edition 1

Hardcover (Print)
Used and New from Other Sellers
Used and New from Other Sellers
from $1.99
Usually ships in 1-2 business days
(Save 96%)
Other sellers (Hardcover)
  • All (27) from $1.99   
  • New (5) from $31.31   
  • Used (22) from $1.99   
Sort by
Page 1 of 1
Showing All
Note: Marketplace items are not eligible for any coupons and promotions
Seller since 2014

Feedback rating:



New — never opened or used in original packaging.

Like New — packaging may have been opened. A "Like New" item is suitable to give as a gift.

Very Good — may have minor signs of wear on packaging but item works perfectly and has no damage.

Good — item is in good condition but packaging may have signs of shelf wear/aging or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Acceptable — item is in working order but may show signs of wear such as scratches or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Used — An item that has been opened and may show signs of wear. All specific defects should be noted in the Comments section associated with each item.

Refurbished — A used item that has been renewed or updated and verified to be in proper working condition. Not necessarily completed by the original manufacturer.

New York, NY 2000 Hard cover New. Sewn binding. Cloth over boards. 344 p. Contains: Illustrations.

Ships from: Rumford, ME

Usually ships in 1-2 business days

  • Canadian
  • International
  • Standard, 48 States
  • Standard (AK, HI)
  • Express, 48 States
  • Express (AK, HI)
Seller since 2015

Feedback rating:


Condition: New
Brand New Item.

Ships from: Chatham, NJ

Usually ships in 1-2 business days

  • Canadian
  • International
  • Standard, 48 States
  • Standard (AK, HI)
  • Express, 48 States
  • Express (AK, HI)
Seller since 2015

Feedback rating:


Condition: New
Brand new.

Ships from: acton, MA

Usually ships in 1-2 business days

  • Standard, 48 States
  • Standard (AK, HI)
Seller since 2015

Feedback rating:


Condition: New
Brand new.

Ships from: acton, MA

Usually ships in 1-2 business days

  • Standard, 48 States
  • Standard (AK, HI)
Seller since 2008

Feedback rating:


Condition: New

Ships from: Chicago, IL

Usually ships in 1-2 business days

  • Standard, 48 States
  • Standard (AK, HI)
Page 1 of 1
Showing All
Sort by


"Rely on this book for information on the technologies and methods you'll need to design and implement high-availability systems...It will help you transform the vision of always-on networks into a reality."-Dr. Eric Schmidt, Chairman and CEO, Novell Corporation
Your system will crash! The reason could be something as complex as network congestion or something as mundane as an operating system fault. The good news is that there are steps you can take to maximize your system availability and prevent serious downtime. This authoritative book will provide you with the tools to deploy a system with confidence. The authors guide you through the building of a network that runs with high availability, resiliency, and predictability. They clearly show you how to assess the elements of a system that can fail, select the appropriate level of reliability, and provide steps for designing, implementing, and testing your solution to reduce downtime to a minimum. All the while, they help you determine how much you can afford to spend by balancing costs and benefits. This book of practical, hands-on blueprints:
* Examines what can go wrong with the various components of your system
* Provides twenty key system design principles for attaining resilience and high availability
* Discusses how to arrange disks and disk arrays for protection against hardware failures
* Looks at failovers, the software that manages them, and sorts through the myriad of different failover configurations
* Provides techniques for improving network reliability and redundancy
* Reviews techniques for replicating data and applications to other systems across a network
* Offers guidance on application recovery
* Examines Disaster Recovery
Read More Show Less

Editorial Reviews

From Barnes & Noble
The Barnes & Noble Review
Blueprints For High Availability is a relentlessly real-world guide to building highly-available distributed systems and networks. As the authors point out early and often, it's not enough to "install failover software and walk away"—you have to build your systems so that failover is the absolute last resort. Of course, your resources are limited, so this book offers excellent advice on making the best possible tradeoffs.

You'll start by reviewing a laundry list of the causes of system downtime, followed by 20 key principles of system design for high availability (eliminate single points of failure, enforce strict separation between production and development environments, reuse proven configurations...)

Next, the authors address each key component of high availability systems, in detail: disk hardware, redundant server design, failover management and configuration, redundant network services, database servers, networked file systems, replication, and more.

There are comprehensive chapters on application recovery, backup/restore, disaster recovery, and system operations—everything from vendor relationships to trouble ticketing to how you set up your rack-mounted servers. (Are yours arranged to minimize the impact of some fool spilling a 64 oz. Big Gulp on top of the rack?) If you get the idea these authors have seen it all, you'd be right.

Read More Show Less

Product Details

  • ISBN-13: 9780471356011
  • Publisher: Wiley, John & Sons, Incorporated
  • Publication date: 2/28/2000
  • Edition description: Older Edition
  • Edition number: 1
  • Pages: 368
  • Product dimensions: 7.88 (w) x 9.60 (h) x 1.04 (d)

Meet the Author

EVAN MARCUS is a Senior Systems Engineer at VERITAS Software Corporation and co-designed a key piece of the first commercial Sun-based software for High Availability. He has been the company's consultant for successful implementations of VERITAS High Availability Products around the world.

HAL STERN is a Distinguished Systems Engineer at Sun Microsystems. He has led reliability and improvement teams for several financial services clients and focuses on performance, reliability, and networked system architecture. He is also the author of Managing NFS and NIS.

Read More Show Less

Read an Excerpt

Chapter 1: Introduction

Despite predictions in the 1970s (and extending into the 1980s) that computers would make everyone's lives easier and give us all more leisure time, just the opposite seems to be taking place. Computers move faster, thanks to faster and faster CPUs, and yet the business world seems to move even faster. Computers are expected to be operational and available 7 days week, 24 hours a day. Downtime, even for maintenance, is no longer an option.

With the unprecedented growth and acceptance of the Internet, average people expect to be able to buy clothing or office supplies on the web at 4 A.M., while in their underwear. And if they can't buy from your web site, they will buy from your competitor's. Uniform resource locators (URLs) are part of the culture; they are written on the sides of city buses, and even four-year-olds know and www.disneycom.

Adding to the complexity is the globalization of the Internet and the web. Even if there are quiet times for web servers in the United States, those times are filled by users in Europe and the rest of the world. National borders and time zones essentially disappear on the web.

The amounts of data that businesses are being called on to save and manage are growing at astounding rates. The consulting organizations that monitor such things estimate that online data will grow 70 to 75 percent per year for the next few years. That data must be accessible quickly at all hours of the day or night, if not for your company's European operation, then for its U.S. personnel. As the amount of data grows, the price of storage devices continues to drop dramatically, making it feasible for companies to store all their data.

But what happens when the systems crash? What happens when disks stop turning? What about when your network stops delivering data? Does the business stop? Must your customers visit your competitor's web site to order their Barbie dolls? Should you just send your employees home? For how long? Can you recover? When? And how come this stuff never seems to happen to your competitors? (Don't worry-it does.) The media frenzy surrounding the Y2K problem did do some good. The fear of all computers everywhere failing made the average Joe appreciate (even if just a little bit) the direct effect that computers have on his daily life. While that lesson may be quickly forgotten, at least the impact was there for a short time. As a result, the people who allocate money for the management of computer systems have a little bit more of an appreciation of what can happen when good computers do go bad.

Why an Availability Book?

The Y2K problem is certainly not the only issue that has caused problems or at least concerns with computer availability; it just happens to be the best-known example. Some of the others over the last few years include: Terrorist attacks in New York, London, and other large cities, such as the 1993 bombing of New York's World Trade Center Satellite outages that brought down pagers and other communication devices Attacks by various computer viruses Natural disasters such as floods, tornadoes, and earthquakes Introduction of the euro The Dow Jones Industrial Average passing 10,000 (sometimes called D10K) Emergence of the Internet as a viable social force, comparable to TV or radio as a mass market influence Obviously the impact of each of these issues has varied from negligible (D10K) to serious (virus attacks). But again, each calls attention to the importance and value of computers, and the impact of failures. Downtime on the Internet is like dead air on a television station; it's embarrassing and a mistake that you only make once.

Our Approach to the Problem Set

In this book, we will take a look at the elements of your computer systems that can fail, whether due to a major event such as the ones just listed, or due to rather mundane problems like the failure of a network router or corruption of a critical file. We will look at basic system configuration issues, including, but not limited to, physical placement of equipment, logical arrangements of disks, backing up of critical data, and migration of important services from one system to another. We will take an end-to-end perspective, because systems are an end-to-end proposition. Either everything works, or nothing works. Rarely is there any middle ground. Your users sit in front of their computer screen, trying to run applications. Either they run or they don't. Sometimes they run, but too slowly. We'll look at performance only to the extent that an exaggerated performance problem smells like an availability issue. When an application runs too slowly, that may still be an availability issue, since the user cannot get his or her job done in a timely manner.

We will also take a businesslike approach, never losing sight of the fact that every bit of protection-whether mirrored disks, backup systems, or extra manpower for system design-costs real money. We have tried to balance costs and benefits, or at least help you to balance costs and benefits. After all, no two sites or systems are the same. What may be important to you may not matter a bit to the guy down the hall. One of the hardest jobs facing designers of highly available systems is scoping out the costs of a particular level of protection, which are frequently higher than the people running the business would like. In an era where computer hardware prices shrink logarithmically, it's hard to explain to management that you can't get fault-tolerant system operation at desktop PC prices. Our goal is to help define the metrics, the rules, and the guidelines for making and justifying these cost/benefit tradeoffs. Availability is measured in "nines"-99.99 percent uptime is "four nines." Our goal is to help you choose the number of nines you can achieve with your engineering constraints, and the number you can afford within your budget and cost constraints.

We'll frequently refer to cost/benefit or cost/complexity trade-offs. Such decisions abound in technology-driven companies: Buy or build? Faster time to market or more features? Quick and dirty or morally correct but slower? Our job is to provide some guidelines for making these decisions. We'll try to avoid taking editorial stances, because there are no absolutes in engineering design...

Read More Show Less

Table of Contents

What Is Resiliency?

Twenty Key System Design Principles.

Highly Available Data Management.

Redundant Server Design.

Failover Management.

Failover Configurations and Issues.

Redundant Network Services.

Data Service Reliability.

Replication Techniques.

Application Recovery.

Backups and Restores.

System Operations.

Disaster Recovery.

Parting Shot.



Read More Show Less

First Chapter


Note: This sample chapter does not appear in its entirety.

In this chapter we will discuss:

  • What resiliency is
  • How we measure availability
  • Ways to quantify the costs associated with downtime
  • The things that can go wrong in a typical server environment

Throughout this book we use the term "resiliency" in terms of overall system availability. We see resiliency as a general term similar to "high availability," but without all the baggage that "HA" carries along with it. High availability once referred to a fairly specific range of system configurations, usually involving two computers that monitor and protect each other. During the last few years, however, it has lost much of its original meaning; vendors and users have co-opted the term to mean whatever they want it to mean.

To us, resiliency and high availability mean that all of a system's failure modes are known and well-defined, including networks and applications. They mean that the recovery times for all known failures have an upper bound; we know how long a particular failure will have the system down. While there may be certain failures that we cannot cope with very well, we know what they are and how to recover from them, and we have backup plans for use if our recoveries don't work. A resilient system is one that can take a hit to a critical component, and recover and come back for more in a known, bounded, and generally acceptable period of time.

Measuring Availability

When you discuss availability requirements with a user or project leader, he or she will invariably tell you that 100 percent availability is required: "Our project is so important that we can't have any downtime at all." But the tune usually changes when the project leader finds out how much 100 percent availability costs. Then it becomes a matter of money, and more of a negotiation process.

As you can see in Table 2.1, for many applications, 99 percent uptime is adequate. If the systems average an hour-and-a-half of downtime per week, that may be satisfactory. Of course, a lot of that depends on when the hour-and-a-half occurs. If it falls between 3: 00 and 4: 30 Sunday morning, that is going to be a lot more tolerable on many systems than if it occurs between 10: 00 and 11: 30 Thursday morning, or every weekday afternoon at 2: 00 for 15 or 20 minutes.

One point of negotiation is the hours during which 100 percent uptime may be required. If it is only needed for a few hours a day, then that goal is quite achievable. For example, when brokerage houses trade between the hours of 9: 30 A. M. and 4: 00 P. M., then during those hours, plus perhaps three or four hours on either side, 100 percent uptime is required. A newspaper might require 100 percent uptime during production hours, but not the rest of the time. If, however, 100 percent uptime is required 7 ´ 24 ´ 365, the costs become so prohibitive that only the most profitable applications and large enterprises can consider it.

As you move progressively to higher levels of availability, costs increase very rapidly. Consider a server ("abbott") that with no special measures taken, except for disk mirrors and backups, has 99 percent availability. If you couple that server with another identically configured server ("costello") that is configured to take over from abbott when it fails, and that server also offers 99 percent availability, then theoretically, you can achieve a combined availability of 99.99 percent. [Mathematically, you multiply the downtime on abbott (1 percent) by the uptime on costello (99 percent); costello will only be in use during abbott's 1 percent of downtime. The result is 0.99 percent. Add the original 99 to 0.99, and you get 99.99 percent, the theoretical uptime for the combined pair.]

Of course, in reality 99.99 percent will not occur simply by combining two servers. The increase in availability is not purely linear. It takes time for the switchover (usually called a failover) to occur, and during that period, the combined server is down. In addition, there are external failures that will affect access to both servers, such as network connectivity or power outages. These failures will undoubtedly decrease the overall availability figures.

The rule of thumb for costs is that as you move one line down the chart, costs increase from 5 to 10 times, and the multiplier also increases as you move down the chart.

Defining Downtime

Definitions for downtime vary from gentle to tough, and from simple to complex. Easy definitions are often given in terms of failed components, such as the server itself, disks, the network, the operating system, or key applications. Stricter definitions may include slow server or network performance, the inability to restore backups, or simple data inaccessibility. We prefer a very strict definition for downtime: If a user cannot get his job done on time, the system is down.

The system is provided to the users for one service: to allow them to work in an efficient and timely way. When circumstances prevent a user from doing this work, regardless of the reason, the system is down.

Causes of Downtime

In Figure 2.1, we examine the various causes of downtime. One of the largest regions on the graph is planned downtime. It is also one of the easiest segments to reduce. Planned downtimes are the events, usually late at night, when the system administrators add hardware to the system, upgrade operating systems or other critical software, or rearrange the layout of data on disks. Sometimes planned downtime is just a preventative reboot to clean up logs, temporary directories, and memory.

Most of these events can be performed with the system up nowadays; disks can be added to hot pluggable disk arrays without interrupting services. Many (though sadly not all) critical applications can be upgraded without service interruptions. In a failover environment, which we will discuss in much more detail later, one machine in a failover pair or cluster can be upgraded while its partner continues to operate. Then, the upgraded cluster member can take over the server role, while the first machine is upgraded. The only service interruption comes during the switchover of services, called a failover. Several vendors produce disk and logical storage and volume management software that can enable on-line management of data layout and volumes, with no interruption to service at all.

The people factor is another major cause of downtime. People cause down-time for two closely related reasons. The first reason is that they sometimes make dumb or careless mistakes. The second reason that people cause down-times is that they do not always completely understand the way a system operates. The best way to combat people-caused downtime is through a combination of education and intelligent, simple system design. By sending your people to school to keep them up to date on current technologies, and by keeping good solid documentation on hand and up to date, you can reduce the amount of downtime they cause.

Possibly the most surprising region on the chart is hardware. Hardware causes just 10 percent of system outages. That means that the best RAID disks in the world and the most redundant networks can only prevent about 10 percent of your downtime. In fact, besides disk and network failures, hardware outages also include central processing unit (CPU) and memory failures, loss of power supplies, and internal system cooling.

The most obvious common causes for system outages are probably software failures. Altogether, software is responsible for 40 percent of system downtime. Software bugs are perhaps the most difficult source of failures to get out of the system. As hardware becomes more reliable, and methods are employed to reduce planned outages, their percentages will decrease, while the percentage of outages attributed to software issues will increase. As software becomes more complex, software-related outages may become more frequent on their own. Of course, as software development and debugging techniques become more sophisticated, software-related outages should become less prevalent. It will be very interesting to see whether software-related downtime increases or decreases over time.

What Is Availability?

At its simplest level, availability, whether high, low, or in between, is a measure of the time that a server is functioning normally. We offer a simple equation to calculate availability:


Where A is the degree of availability expressed as a percentage, MTBF is the mean time between failures, and MTTR is the maximum time to repair or resolve a particular problem.

Some simple observations:

1. As MTTR approaches zero, A increases toward 100 percent.
2. As the MTBF gets larger, MTTR has less impact on A.

For example, if a particular system has an MTBF of 100,000 hours, and an MTTR of 1 hour, it has a rather impressive availability level of 100,000/ 100,001, or 99.999 percent. If you cut the MTTR to 6 minutes, or 1 Ú10 of an hour, availability increases an extra 9, to 99.9999 percent. But to achieve this level of availability with even 6 minutes of downtime, you need a component with an actual duration between failures of 100,000 hours, which is better than 11 years.

Let us restate that last statistic. To achieve 99.9999 percent availability, you are permitted just 6 minutes downtime in 11.4 years. That's 6 minutes in 11.4 years over your entire system, not just the one component we happen to be examining. Given today's technology, this is unachievable for all practical purposes, and an unrealistic goal. Downtimes of less than 10 minutes per year (about 99.998 percent) are probably achievable, but it would be very difficult to get much less than that. In addition to well-designed systems, a significant degree of luck will surely be required. And you just can't plan for luck.

Luck comes in many flavors. Good luck is when your best developer happens to be working late on the night that his application brings down your critical servers, and he fixes the problem quickly. Good luck is when the water pipe in the ceiling breaks and leaks water over one side of your disk mirrors but doesn't affect the other side. Bad luck (or malice) gets you when someone on the data center tour leans on the power switch to a production server. Bad luck forces a car from the road and into the pole in front of your building where the power from both of your power companies comes together. Bad luck is that backhoe outside your window digging up the fiber cable running between your buildings, as you helplessly watch through your office window.

"M" Is for Mean

The key term in MTBF is mean time. A mean time between failures number is just that, a mean. An average. If a disk drive has an MTBF of 200,000 hours (almost 23 years), that does not mean that every disk rolling off the assembly line is guaranteed to work for exactly 23 years, and then drop dead. When the government announces life expectancy figures, it certainly doesn't mean that every man in America will die when he reaches the age of 79.6 years. For every person who doesn't make it to his 50th birthday, there is going to be someone whose picture makes it onto The Today Show for surpassing his (or her) 100th birthday.

Means are trends. If you look at all the disks that roll off the assembly line during a given period of time, the average life expectancy of a disk, before it fails, is about 23 years. That means, however, that some disks may fail the first day, and others may last 40 years (obsolescence aside). It also means that if you have a large server, with, say, 500 disks in it, on average you will lose a disk every 200,000/ 500 or 400 hours. Four hundred hours is only about 16 1 Ú2 days. So, you will be replacing a disk, on average, every 2.5 weeks.

A statistical mean is not enough to tell you very much about the particular members of a population. The mean of 8, 9, 10, 11, and 12 is 10. But the mean of 1, 1, 1, 1, and 46 is also 10. So is the mean of 12,345, -12,345, 47,000,000, -47,000,000, and 50.

The other key number is standard deviation, or sigma (s). Without going into a long, dull explanation of standard deviation calculations (go back and look at your college statistics textbook!), sigma tells you how far the members of a population stray from the mean. For each of the three previous examples (and treating each like a complete population, for you statistics nerds), s is: 1.414214, 19.6, and 29,725,411. For the sake of completeness we offer one more example: the mean of 9.99, 10.01, 9.999, 10.002, and 9.999 is still 10, and the standard deviation is 0.006419. As these comparisons illustrate, the closer the members of the population are to the mean, the lower sigma becomes. Or in other words, the lower sigma is, the more indicative of final results the mean is. (You can unclench your teeth now; we have finished talking about statistics.)

When looking at hardware components, therefore, you'll want MTBF figures that are associated with low standard deviations. But good luck in obtaining MTBF numbers; most hardware vendors don't like sharing such data. If you can get these numbers, however, they will tell you a lot about the quality of the vendor's hardware.

The same guidelines that apply to MTBFs also apply to MTTRs. If it takes your administrator 15 minutes to recover from a particular problem, that does not necessarily mean it will take him 15 minutes every time. Complications can set in during the repair process. Personnel change, and recovery times can increase while the new people learn old procedures. (Conversely, as an administrator becomes more adept at fixing a particular problem, repair times can decline.) System reboot times can increase over time, too, as the system gains additional components that need to be checked and/ or reinstalled at boot time.

Many aspects of MTTRs can be out of your control. If you need a critical component to repair a server, and that component is on back order at the vendor, it could take days or possibly weeks to acquire it. Unless you have alternative sources or stock spare parts, there isn't anything you can do but wait, with your system down. In many shops, system administration is a separate job from network administration, and it is performed by a totally separate organization. If the system problem turns out to be network-related, you may have to wait on the network folks to find and fix the problem so you can get your system running again. Or vice versa.

Usually the amount of acceptable downtime determines the level of availability required. In general, you don't hear about local phone service outages. The phone companies have mastered the art of installing upgrades to their infrastructure without causing any perceptible interruptions to phone service. Acceptable outages in a telephone network are in the subsecond range.

On a trading floor, outages in the subminute range can be tolerated (usually not well, though) before impact is felt. Regulatory issues concerning on-time trade reporting kick in within two minutes. The market moves every few seconds, possibly making a profitable trade less interesting only a few seconds later. In a less critical application (decision support for marketing, for example), tolerable outages may be up in the one-or two-day range. You'll need to understand what downstream systems are affected by an outage. If the decision support system is driving a mass mailing that has a deadline 48 hours away, and it takes 4 hours to finish a single query, your uptime requirements are stronger than if the mailing deadline is in the next month. Design your support systems, and spend your money, accordingly.

Sometimes, user expectations enter into the equation. If your users believe that the system will be down for an hour while you fix a problem, and you can fix it in 20 minutes, you are a hero. But if the users believe it should take 10 minutes, the same 20-minute fix makes you a goat. Setting expectations drives another downstream system, namely, what the users will do next. If they're convinced they'll be live again in an hour, you have to deliver in that hour. In one of the original Star Trek movies, Captain Kirk asked Engineer Scott if he always multiplied his repair estimates by a factor of four. Scotty's response was, "Of course. How do you think I keep my reputation as a miracle worker?"


In this section, we will take a quick look at the things that can go wrong with computer systems and that can cause downtime. Some of them, especially the hardware ones, may seem incredibly obvious, but others will not.


Hardware points of failure are the most obvious ones - the failures that people will think of first when asked to provide such a list. And yet, as we saw in Figure 2.1, they only make up about 10 percent of all system outages. However, when you have a hardware outage, you may be down for a long time if you don't have redundancy built in. Waiting for parts and service people makes you a captive to the hardware failure.

The components that will cause the most failures are moving parts, especially those associated with high speeds, low tolerances, and complexity. Having all of those characteristics, disks are prime candidates for failures. Disks also have controller boards and cabling that can break or fail. Many hardware disk arrays have additional failure-prone components such as memory for caching, or hardware for mirroring or striping.

Tape drives and libraries, especially DLT tape libraries, have many moving parts, motors that stop and start, and extremely low tolerances. They also have controller boards and many of the same internal components that disk drives have, including memory for caching.

Fans are the other components with moving parts. The failure of a fan may not cause immediate system failure the way a disk drive failure will, but when a machine's cooling fails, the effects can be most unpredictable. When CPUs and memory chips overheat, systems can malfunction in subtle ways. Many systems do not have any sort of monitoring for their cooling, so cooling failures can definitely catch even the best-monitored systems by surprise.

It turns out that fans and power supplies have the worst MTBFs of all system components. Power supplies can fail hard and fast, resulting in simple down-time, or they can fail gradually. The gradual failure of a power supply can be a very nasty problem, causing subtle, sporadic failures in the CPU, memory, or backplane. Power supply failures are caused by many factors, including varying line voltage and the stress of being turned on and off.

To cover for these shortcomings, modern systems have extra fans, extra power supplies, and superior hardware diagnostics that provide for problem detection and identification as quickly as possible. Many systems can also "call home." When a component fails, the system can automatically call the service center and request maintenance. In many cases, repair people arrive on site to the complete surprise of the local staff.

Of course, failures can also occur in system memory and in the CPU. Again, some modern systems are able to configure a failed component right out of the system without a reboot. This may or may not help intermittent failures in memory or the CPU, but it will definitely help availability when a true failure occurs.

There are other hardware components that can fail, although they do so very infrequently. These include the backplane, the various system boards, the cabinet, the mounting rack, and the system packaging.

Environmental and Physical Failures

Failures can be external to the system as well as internal. There are many components in the environment that can cause system downtime, yet these are rarely considered as potential points of failure. Most of these are data-center related, but many of them can impact your servers regardless of their placement. And in many cases, having a standby server will not suffice in these situations, as the entire environment may be affected.

The most obvious environmental problem is a power failure. Power failures (and brownouts) can come from your electric utility, or they occur much more locally. A car can run into the light pole in front of your building. The failure of a circuit breaker or fuse, or even a power strip, can shut your systems down. The night cleaning crew might unplug some vital system in order to plug in a vacuum cleaner, or their plugging in the vacuum cleaner may overload a critical circuit.

The environmental cooling system can fail, causing massive overheating in all of the systems in the room. Similarly, the dehumidifying system can fail (although that failure is not going to be as damaging to the systems in the room as a cooling failure).

Most data centers are rats' nests of cables, under the floor and out the back of the racks and cabinets. Cables can break, and they can be pulled out. And, of course, a change in the laws of physics could Could not acquire words on page 11 result in copper no longer conducting electricity. (If that happens, you probably have bigger problems. . . .)

Most data centers have fire protection systems. Halon is still being removed from data centers (apparently if they get one more Halon "incident," that's it!) but the setting off of one of these fire protection systems can still be a very disruptive event. One set of problems ensues when the fire is real, and the protection systems work properly and put it out. The water or other extinguishing agent can leave a great deal of residue, and can leave the servers in the room unfit for operation. Halon works by displacing the oxygen in the room, which effectively chokes off the fire. Of course, displaced oxygen could be an issue for any human beings unfortunate enough to be in the room at the time. Inergen Systems are newer and more friendly to oxygen-breathing life and are becoming more popular. And the fire itself can cause significant damage to the environment. One certainly hopes that when a fire protection system is put into action the fire is real, but sometimes it isn't, and the fire protection system goes off when no emergency exists. This can leave the data center with real damage caused solely by a mistake.

The other end of the spectrum is when a fire event is missed by the protection system. The good news is that there will be no water or other fire protection system residue. The bad news is that your once-beautiful data center may now be an empty, smoldering shell. Or worse.

Another potential environmental problem is the structural failure of a supporting component, such as a computer rack or cabinet. Racks can collapse or topple when not properly constructed. If shelves are not properly fastened, they can come loose and crash down on the shelves beneath them. Looming above the cabinets in most data centers are dusty ceilings, usually with some cabling running through them. Ceilings can come tumbling down, raining dust and other debris onto your systems.

Many data centers have some construction underway, while active systems are operating nearby. Construction workers bring heavy-duty equipment in with them, and may not have any respect for the production systems that are in their way. Cables get kicked or cut; and cabinets get pushed slightly (or not so slightly) and can topple. While construction workers are constructing, they are also stirring up dust and possibly cutting power to various parts of the room. If they lay plastic tarps over your equipment to protect it from dust, the equipment may not receive proper ventilation, and overheat.

And then there are the true disasters: earthquakes, tornadoes, floods, bombs and other acts of war and terrorism, or even locusts.

It is important to note that some high-end fault-tolerant systems may be impacted by environmental and power issues just as badly as regular availability systems.

Read More Show Less


Steps Toward an Always-On Network

Everywhere we look, networks matter. Around the world, the global Internet and related information networks are transforming every aspect of business, government, education, and culture. With so much happening, at such astounding speed, it's easy to forget that the Internet and network computing axe still in their infancy. We sense that even more dramatic changes are still to come, and that we have caught only a brief glimpse of the technology's ability to improve our lives. Compare today's computer networks, for example, to the global longdistance telephone network. A ten-key pad of buttons gives us access to virtually anyplace on earth, and a worldwide directory system helps us to quickly locate businesses and individuals. When we use this system, its incredible complexity is almost entirely invisible to us. Moreover, we expect this global network and its familiar dial tone to always be available. We take it for granted that the network is always on.

High availability systems, the subject of this book, represent one of the chief means to achieving the goal of always-on computer networks. This branch of systems engineering began with the need to ensure the availability of critical applications running within organizations. As information networks became more central to doing business, the number of applications deemed critical to doing business expanded and the demand for high availability systems increased. Today, with the walls of the organization coming down and businesses everywhere extending their networks and applications to the Internet, the availability of networked systems has become a top priority. In the context ofe-business, downtime puts you out of business. Not only do you lose money the instant the system fails, you risk losing the attention and loyalty of your online customers.

The positive side of all this is that high-availability technology has dramatically improved to meet the e-business challenge. Clustering and failover software, once limited to exotic and extremely expensive hardware, is now available, at a much lower cost, for standard hardware and open, flexible operating environments. New management software makes it easier and more costeffective for businesses to monitor and control availability. New guidelines and methodologies, such as those outlined in this book, are helping businesses and solutions designers intelligently build high availability into their networks.

For all that's state of the art in high availability, there can be no better guides than the authors of this book. Evan Marcus of VERITAS Software and Hal Stern of Sun Microsystems are true experts in the field of high availability, as well as hands-on scientists who understand the challenges faced by systems designers, solutions providers, and IT and e-business managers. If you represent any of those groups, you can rely on this book for information on the technologies and methods you'll need to design and implement high availability systems. In other words, this book will help you transform the vision of always-on networks into a reality.

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)