- Shopping Bag ( 0 items )
"This book covers a wide spectrum of topics relevant to implementing and managing a modern data center. The chapters are comprehensive and the flow of concepts is easy to understand."
Gain a practical knowledge of data center concepts
To create a well-designed data center (including storage and network architecture, VoIP implementation, and server consolidation) you must understand a variety of key concepts and technologies. This book explains those factors in a way that smoothes the path to implementation and management. Whether you need an introduction to the technologies, a refresher course for IT managers and data center personnel, or an additional resource for advanced study, you'll find these guidelines and solutions provide a solid foundation for building reliable designs and secure data center policies.
* Understand the common causes and high costs of service outages
* Learn how to measure high availability and achieve maximum levels
* Design a data center using optimum physical, environmental, and technological elements
* Explore a modular design for cabling, Points of Distribution, and WAN connections from ISPs
* See what must be considered when consolidating data center resources
* Expand your knowledge of best practices and security
* Create a data center environment that is user- and manager-friendly
* Learn how high availability, clustering, and disaster recovery solutions can be deployed to protect critical information
* Find out how to use a single network infrastructure for IP data, voice, and storage
We gain strength, and courage, and confidence by each experience in which we really stop to look fear in the face ... we must do that which we think we cannot. - Eleanor Roosevelt
The need for high availability did not originate with the Internet or e-commerce. It has existed for thousands of years. When Greek warships or merchant ships sailed to discover new lands or business, the captains carried spare sails and oars on board. If the primary sail failed, the crew would immediately hoist a replacement and continue on their way, while they repaired damaged sails. With the advent of electronic sensors, the spare parts employed in industrial systems did not need human intervention for activation. In the early twentieth century, electric power-generating plants automatically detected problems, if any, in the primary generator and switched to a hot standby unit.
With the recent explosive growth of the Internet and our dependence on information systems, high availability has taken on a new meaning and importance. Businesses and consumers are turning to the Internet for purchasing goods and services. People conduct business anytime from their computer. They expect to buy clothes at 2 a.m. on the Web and expect the site to function properly, without problem or delay, from the first click to the last. If the Web site is slow or unavailable, they will click away to a competitor's site. Business globalization caused by the Internet adds another layer of complexity. A popular online store, with business located in Bismarck, North Dakota, may have customers in Asia who keep the seller's servers busy during quiet hours in the United States. Time zones, national borders, and peak and off-peak hours essentially disappear on the Web.
As computers get faster and cheaper, they are being used for more and more critical tasks that require 24-7 uptime. Hospitals, airlines, online banking services, and other service industries modify customer-related data in real time. The amount of online data is rapidly expanding. It is estimated that online data will grow more than 75 percent every year for the next several years. The rapidly increasing demand for placing more and more data online and the constantly decreasing price of storage media have resulted in an increase of huge amounts of critical information being placed online.
Employees and partners depend on data being available at all times. Work hours have extended beyond the traditional 9-to-5, five days a week. Intranet servers such as e-mail, internal applications, and so forth, must be always up and functional for work to continue. Every company has at least one business-critical server that supports the organization's day-to-day operation and health. The unavailability of critical applications translates to lost revenue, reduced customer service and customer loyalty, and well-paid, but idle, workers. A survey of 450 Fortune 100 companies (conducted by the Strategic Research Division of Find/SVP) concluded that U.S. businesses incur about $4 billion of losses per year because of system or network downtime.
In fact, analysts estimate that every minute of Enterprise Resource Planning (ERP) downtime could cost a retailer between $10,000 and $15,000. Systems and data are not expected to be down, not even for maintenance. Downtime literally freezes customers, employees, and partners, who cannot even complete the most basic daily chores.
The requirements for reliability and availability put extreme demands on servers, network, software, and supporting infrastructure. Corporate and e-commerce sites must be capable of processing large numbers of concurrent transactions and are configured to operate 24-7. All components, including both the server hardware and software, must be configured to be redundant.
And what happens when no one can get to the applications? What happens when data is unreachable and the important servers do not want to boot up? Can you shut down your business and ask your employees to go home? Can you tell your customers to go somewhere else? How is it that no one planned for this scenario? Is it possible to recover from this? How long will it take and how much will it cost? What about reputation among customers? Will they ever come back? Why doesn't this happen to your competitors?
As you can see, it happens all the time and all around us. Following are some events that have occurred over the last few years. They expose our total dependence on computer systems and utter helplessness if critical systems are down.
* In April of 1998, AT&T had a 26-hour frame relay-network outage that hurt several business customers. In December of 1999, AT&T had an 8-hour outage that disrupted services to thousands of AT&T WorldNet dial-up users.
* In early 1999, customers of the popular online stock trading site ETrade could not place stock trade orders because the trading sites were down. At the same time, there were a few outages at The Charles Schwab Corporation because of operator errors or upgrades. Schwab later announced a plan to invest $70 million in information technology (IT) infrastructure.
* In June of 1999, eBay had a 22-hour outage that cost the company more than $3 million in credits to customers and about $6 billion (more than 20 percent) in market capitalization. In January of 2001, parts of the site were again down for another 10 hours.
* In August of 1999, MCI suffered about 10 days of partial outages and later provided 20 days of free service to 3,000 enterprise customers.
* Three outages at the Web retailer amazon.com during the busy holiday-shopping season of December 2000 cost Amazon more than $500,000 in sales loss.
* Denial-of-Service and several virus-induced attacks on Internet servers continue to cause Web site outages. On July 19, 2002, a hacker defaced a page on the U.S. Army Research Laboratory's Web site with a message criticizing the Army's organization for bias to certain nations.
* Terrorist attacks in Washington, D.C., New York, London, and cities around the world in recent years have destroyed several data centers and offices.
Businesses everywhere are faced with the challenge of minimizing downtime. At the same time, plans to enhance service availability have financial and resource-related constraints. Taking steps to increase data, system, and network availability is a delicate task. If the environment is not carefully designed and implemented, it would cost dearly (in terms of required time, money, and human resources) to build and manage it.
To increase service availability, you must identify and eliminate potential causes of downtime, which could be caused by hardware failures, network glitches, software problems, application bugs, and so forth. Sometimes, poor server, application, or network performance is perceived as downtime. Service expectations are high. When someone wants to place a phone call, he or she picks up the phone and expects a dial tone within a few seconds. The call must connect within one second of dialing and there should be no dropped connections. When surfing the Web, users expect the first visual frame of a Web page within a few seconds of accessing the site. All systems, especially those related to consumers and critical operations, should always be ready and must operate with no lost transactions.
But potential causes of downtime abound. The entire IT infrastructure is made up of several links, such as user workstations, network devices, servers, applications, data, and so forth. If any link is down, the user is affected. It then does not matter if the other links in the chain are available or not. Downtime, in this book, is defined as an end user's inability to get his or her work done. This book examines ways to enhance service availability to the end user and describes techniques for improving network, data, server, and application uptime.
Availability is the portion of time that an application or service is available to internal or external customers for productive work. The more resilient a system or environment is, the higher the availability is. An important decision is the required availability level. When you ask a user or project manager how much uptime he or she needs for the application, the reflex answer is "One-hundred percent. It must always be available at all times." But when you explain the high costs required to achieve 100 percent uptime, the conversation becomes more of a two-way negotiation. The key point is to balance downtime cost with availability configuration costs.
Another point is the time duration when 100 percent uptime is necessary. Network Operations Center (NOC) and 24-7 network monitoring applications and e-commerce Web sites require 100 percent uptime. On the other extreme are software development environments, used only when developers are accessing the system. If, on occasion, you take development systems down (especially at night or on weekends), and if you warn your users well in advance, downtime is not an issue.
Table 1-1 illustrates how little time per year is afforded for planned or unplanned downtime as availability requirements move closer to 100 percent. Suppose a server, "hubble," has no special high-availability features except for RAID-1 volumes and regular backups and has 98 percent uptime. The 2 percent downtime is too high and, therefore, it is clustered with "casper," which also has 98 percent uptime. Server "casper" is used only 2 percent of the time when hubble is down. The combined availability is 98 percent plus 98 percent of 2 (0.98 ??2), which is 1.96 percent. These add to a theoretical service uptime of 99.96 percent for the two-node cluster.
In reality, several other factors affect both servers, such as downtime during failover duration, power or network outages, and application bugs. These failures will decrease the theoretical combined uptime.
As you move down the table, the incremental costs associated with achieving the level of availability increase exponentially. It is far more expensive to migrate from a "four-nines" to a "five-nines" (99.99 percent to 99.999 percent uptime) configuration than to move from 99 percent to 99.9 percent uptime.
Causes of Downtime
About 80 percent of the unplanned downtime is caused by process or people issues, and 20 percent is caused by product issues. Solid processes must be in place throughout the IT infrastructure to avoid process-, people-, or product-related outages. Figure 1-1 and Table 1-2 show the various causes of downtime. As you can see, planned or scheduled downtime is one of the biggest contributors (30 percent). It is also the easiest to reduce. It includes events that are preplanned by IT (system, database, and network) administrators and usually done at night. It could be just a proactive reboot. Other planned tasks that lead to host or application outage are scheduled activities such as application or operating system upgrades, adding patches, hardware changes, and so forth.
Most of these planned events can be performed without service interruption. Disks, fans, and power supplies in some servers and disk subsystems can be changed during normal run-time, without need for power-offs. Data volumes and files systems can be increased, decreased, or checked for problems while they are online. Applications can be upgraded while they are up. Some applications must be shut down before an upgrade or a configuration change.
Outages for planned activities can be avoided by having standby devices or servers in place. Server clustering and redundant devices and links help reduce service outages during planned maintenance. If the application is running in a cluster, it can be switched to another server in the cluster. After the application is upgraded, the application can be moved back. The only downtime is the time duration required to switch or failover services from one server to another. The same procedure can be used for host-related changes that require the host to be taken off-line. Apart from the failover duration, there is no other service outage.
Another major cause of downtime is people-related. It is caused by poor training, a rush to get things done, fatigue, lots of nonautomated tasks, or pressure to do several things at the same time. It could also be caused by lack of expertise, poor understanding of how systems or applications work, and poorly defined processes. You can reduce the likelihood of operator-induced outages by following properly documented procedures and best practices. Organization must have several, easy-to-understand how-tos for technical support groups and project managers. The documentation must be placed where it can be easily accessed, such as internal Web sites. It is important to spend time and money on employee training because in economically good times, talented employees are hard to recruit and harder to retain. For smooth continuity of expertise, it is necessary to recruit enough staff to cover emergencies and employee attrition and to avoid overdependence on one person.
Avoiding unplanned downtime takes more discipline than reducing planned downtime. One major contributor to unplanned downtime is software glitches. The Gartner Group estimates that U.S. companies suffer losses of up to $1 billion every year because of software failure. In another survey conducted by Ernst and Young, it was found that almost all the 310 surveyed companies had some kind of business disruption. About 30 percent of the disruptions caused losses of $100,000 or more each to the company.
When production systems fail, backups and business-continuance plans are immediately deployed and are every bit worth their weight, but the damage has already been done. Bug fixes are usually reactive to the outages they wreak. As operating systems and applications get more and more complex, they will have more bugs. On the other hand, software development and debugging techniques are getting more sophisticated. It will be interesting to see if the percentage of downtime attributed to software bugs increases or decreases in the future. It is best to stay informed of the latest developments and keep current on security, operating system, application, and other critical patches. Sign up for e-mail-based advisory bulletins from vendors whose products are critical to your business.
Environmental factors that can cause downtime are rare, but they happen. Power fails. Fires blaze. Floods gush. The ground below shakes. In 1998, the East Coast of the United States endured the worst hurricane season on record. At the same time, the Midwest was plagued with floods. Natural disasters occur mercurially all the time and adversely impact business operations. And, to add to all that, there are disasters caused by human beings, such as terrorist attacks.
The best protection is to have one or more remote, mirrored disaster recovery (DR) sites. In the past, a fully redundant system at a remote DR site was an expensive and daunting proposition. Nowadays, conditions have changed to make it very affordable:
* Hardware costs and system sizes have fallen dramatically.
* The Internet has come to provide a common network backbone.
* Operating procedures, technology, and products have made an off-site installation easy to manage remotely.
To protect against power blackouts, use uninterruptible power supplies (UPS). If Internet connection is critical, use two Internet access providers or at least separate, fully redundant links from the same provider.
Cost of Downtime
Organizations need to cost out the financial impact caused by downtime. The result helps determine the extent of resources that must be spent to protect against outages. The total cost of a service outage is difficult to assess. Customer dissatisfaction, lost transactions, data integrity problems, and lost business revenue cannot be accurately quantified. An extended period of downtime can result in ruin and, depending on the nature of the business, the hourly cost of business outage can be several tens of thousands of dollars to a few million dollars. Table 1-3 provides some examples of downtime costs.
Excerpted from Administering Data Centers by Kailash Jayaswal Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
About the Author.
Part One: Data Center Basics.
Chapter 1: No Time for Downtime.
Chapter 2: The High-Availability Continuum.
Part Two: Data Center Architecture.
Chapter 3: Data Center Requirements.
Chapter 4: Data Center Design.
Chapter 5: Network Infrastructure in a Data Center.
Chapter 6: Data Center Maintenance.
Chapter 7: Power Distribution in a Data Center.
Chapter 8: Data Center HVAC.
Part Three: Data Center Consolidation.
Chapter 9: Reasons for Data Center Consolidation.
Chapter 10: Data Center Consolidation Phases.
Part Four: Data Center Servers.
Chapter 11: Server Performance Metrics.
Chapter 12: Server Capacity Planning.
Chapter 13: Best Practices in IT.
Chapter 14: Server Security.
Chapter 15: Server Administration.
Chapter 16: Device Naming.
Chapter 17: Load Balancing.
Chapter 18: Fault Tolerance.
Chapter 19: RAID.
Part Five: Data Storage Technologies.
Chapter 20: Data Storage Solutions.
Chapter 21: Storage Area Networks.
Chapter 22: Configuring a SAN.
Chapter 23: Using SANs for High Availability.
Chapter 24: IP-Based Storage Communications.
Part Six: Data Center Clusters.
Chapter 25: Cluster Architecture.
Chapter 26: Cluster Requirements.
Chapter 27: Designing Cluster-Friendly Applications.
Part Seven: Network Design, Implementation, and Security.
Chapter 28: Network Devices.
Chapter 29: Network Protocols.
Chapter 30: IP Addressing.
Chapter 31: Network Technologies.
Chapter 32: Network Topologies.
Chapter 33: Network Design.
Chapter 34: Designing Fault-Tolerant Networks.
Chapter 35: Internet Access Technologies and VPNs.
Chapter 36: Firewalls.
Chapter 37: Network Security.
Part Eight: Disaster Recovery.
Chapter 38: Disaster Recovery.
Chapter 39: DR Architectures.
Part Nine: Future Considerations.
Chapter 40: Voice over IP and Converged Infrastructure.
Chapter 41: What’s Next.
Part Ten: Appendix.
Appendix A: Storage and Networking Solutions.