Windows NT Clustering Blueprints

Windows NT Clustering Blueprints

by Mark A. Sportack, James L. Mohler
COPY: Windows NT Clustering Blueprints is your one-stop guide to understanding essential clustering concepts and implementations. With this easy-to-understand guide, you'll have all the information you need to take your clustering skills to new heights. Written by an industry expert with years of real-world experience, this in-depth resource takes you through the


COPY: Windows NT Clustering Blueprints is your one-stop guide to understanding essential clustering concepts and implementations. With this easy-to-understand guide, you'll have all the information you need to take your clustering skills to new heights. Written by an industry expert with years of real-world experience, this in-depth resource takes you through the essentials of Windows NT clustering. You'll get extensive coverage of how Microsoft's Cluster Server operates, as well as how to build a cluster. Also covered are close looks at how a fail-over works, Windows NT Server, complex clusters, and clustering for competitive advantage. With its easy-to-follow approach, you will expand your knowledge of Windows NT clustering. BULLETS: - Gain a thorough understanding of clustering - Examine implementation issues such as programming for parallel environments - Get expert tips and advice on clustering for availability, scaleability, and competitive advantage - Master the intricacies of Microsoft's Cluster Server - Uncover the essentials of network-friendly clustering and cluster-friendly networking - Find out what hybridized clusters are and how to use them effectively - Learn how to build a new NT cluster as well as how to cluster an existing application

  • Approachably written, providing an introduction to the set of technologies that is encompassed by clustering
  • Provides an unbiased and non-marketing approach to the benefits, limitations and trade-offs of this branch of parallel computing

Product Details

Macmillan Publishing Company, Incorporated
Publication date:
Edition description:
1 ED
Product dimensions:
7.66(w) x 9.45(h) x 1.08(d)

Related Subjects

Read an Excerpt

Windows NT Clustering Blueprints -- Ch 3 -- Clustering for Scalability

[Figures are not included in this sample chapter]

Windows NT Clustering Blueprints

- 3 -

Clustering for Scalability

Scalable clusters could become the Grail of distributed parallel computing. Unlike
clustering for high reliability through redundancy and topological variation, scalability
is a much more complex goal. Achieving massive scalability might not be possible
with today's technology.

This chapter defines scalability, examines its implications for clustering, and
identifies the chronic performance bottlenecks that impede scalability in today's
systems. It also covers new and emerging technologies that can eliminate many of
these bottlenecks, and presents specific recommendations for developing clusters
that can be readily scaled.

3.1. Introduction

Scalability refers to the capability to expand a system. Unlike availability in
clusters, scalability is not a function of topological variation. Rather, scalability
is the product of the physical platform's overall design. It has two distinct components:
capacity and performance.

Capacity scalability is the capability to add system resources such as RAM, CPUs,
storage facilities, and so on, to an existing system or cluster. As explained in
Chapter 1, "Introduction to Clustering," scalability comes in many forms.
Computer systems tend to be internally scalable--up to the limit of their fixed capacity.

Clusters composed of either uniprocessor or multiprocessor systems can be both
internally scalable and externally scalable. External scalability means adding additional
self-contained resources (additional computer systems) to the cluster.

Yet even this cluster-level scalability has capacity limits. Most cluster management
products, whether planned or available, have finite limits on the extent to which
the cluster can scale. These limits, though rooted in software, are no different
in their effects on the internal scalability of a computer system's hardware than
are architectural limits. They are simply different expressions of capacity limitations.
Capacity limits are imposed by the inefficiencies of the technologies forming either
a system or cluster. In other words, capacity limitations tend to be a function of
performance limitations.

This chapter explores some of the more critical aspects of massive scalability
and the emerging standards, technologies, and architectures that can make massively
scalable clusters a reality.

3.2. Exploring System Bottlenecks

Every computer system contains fundamental performance constrictions known as
bottlenecks. The problem with bottlenecks is that they are relative: what
constitutes a problematic performance constriction for one application might not
be problematic for another application. Figure 3.1 illustrates the conceptual relationship
between an application's performance requirements and the system's capacity to satisfy
those requirements.

Figure 3.1. Conceptual view of a system bottleneck.

The bottleneck illustrated in Figure 3.1 can represent any one of a computer system's
physical resources: CPU, memory, I/O, and so on. Consequently, a system's performance
is likely to contain a series of bottlenecks.

Historically, one of the biggest limitations on computing performance for all
application types has been input/output (I/O) technologies. Because there
are enormous disparities between the speeds of I/O technologies and CPUs, I/O is
the most expensive function a computer can perform. The reason for this is the amount
of CPU cycles that are, or can be, wasted just waiting for the I/O request to complete.

Many workarounds to common bottlenecks have been developed, including paging,
multiprogramming, and multithreading. These techniques minimize the effects of performance
disparities without really addressing them. Each technique's effectiveness varies
with the application type. Different applications have different degrees of hardware
intensity. Scientific applications, for example, tend to be more CPU-intensive than
anything else. Therefore a processing platform built for the scientific community
should be optimized for CPU-intensive operations.

NOTE: Paging, multiprogramming, and multithreading are different techniques
for using the finite resources of a system more effectively. Although each has a
common goal, they take very different approaches.

Paging involves the temporary swapping of memory-resident data and/or instructions
to a disk cache. Ostensibly, this occurs because the paged matter was in a wait state,
or a higher priority instruction came along. Either way, the operating system cached
it to disk to make better use of the CPU's time. Because of the speed mismatch between
memory and disk drives, excessive paging can actually decrease system performance.

Multiprogramming, also known as multitasking, is a commonly used technique that
enables two or more different applications to execute simultaneously. This is possible
because of the speed mismatches that exist between the CPU, memory, and the various
I/O devices. In its simplest form, while one program's instruction is awaiting the
completion of an I/O request, it is preempted by an instruction from the other program.
For more information on multiprogramming, please refer to Chapter 10, "Programming
for Parallel Environments."

Multithreading is multitasking, but within a single program.

Business applications--Executive Information Systems (EIS), for example--tend
to be more disk- and I/O-intensive. Other decision support systems, including data
mining and warehousing, can be both I/O- and CPU-intensive. The point is that workarounds
are not panaceas. They can provide relief only if they are well matched to the needs
of the application, and if they help to mask the limitations of the hardware from
that application.

Extending this logic reveals the quandary: hardware platforms are frequently generic,
with limited potential for customization. Thus, developing a cluster platform that
is capable of massive scalability requires a system whose basic components are as
closely matched as possible.

Massive scalability requires that performance bottlenecks be significantly reduced
or eliminated. Figure 3.2 illustrates conceptually a computer system that is constructed
by a perfectly matched set of technologies. Each component is capable of the same
level of performance as every other component in the system; consequently, there
are no bottlenecks.

Unfortunately, the constitution of a perfect system depends on its intended function.
A perfect system for an I/O-intensive application might be absolutely punishing to
a CPU-intensive application. Thus, a universally perfect system, with no bottlenecks,
might not ever be commercially available or affordable.

Given this caveat, the single greatest performance bottleneck in most systems
is its I/O mechanism. Several concepts are emerging that promise to improve the scalability
of clusters either by avoiding the I/O bus entirely or by at least changing the paradigm
for accommodating I/O requests. These concepts establish new input/output models
and are presented in the rest of this section.

Figure 3.2. Conceptual view of a perfect system with no bottlenecks.

3.2.1. No Remote Memory Access (NORMA)

The No Remote Memory Access, or NORMA, I/O model is the conventional form
of memory access that most people are familiar with. In fact, this model might be
so familiar that it is taken for granted. NORMA requires that all requests for access
to memory originate locally and pass through the local I/O bus. Figure 3.3 illustrates
this model.

Figure 3.3. The NORMA I/O model.

Traditional LANs use the NORMA I/O model. They access the I/O bus through a network
interface card
(NIC). The NIC is responsible for translating between two different
data transport formats: the bus and the LAN frames.

3.2.2. Non-Uniform Memory Access (NUMA)

Non-Uniform Memory Access, or NUMA, is a hybridized I/O model. It still
requires requests for memory access to pass through the I/O bus, but this is satisfied
in a non-uniform manner. It is non-uniform because requests for memory access do
not have to originate locally and, therefore, require a non-uniform amount of time
to satisfy. An external switching mechanism enables remote hosts--other cluster nodes--to
directly address and access the memory resources of any given host.

Figure 3.4 illustrates the NUMA model.

Figure 3.4. NUMA provides for direct memory access by foreign hosts.

Specific examples of technologies that use the NUMA model are ServerNet and PCI-SCI.
Each of these technologies is described in more detail in this chapter.

3.2.3. Cache Coherent Non-Uniform Memory Access (ccNUMA)

Cache Coherent Non-Uniform Memory Access is also referred to as ccNUMA.
It is a bit more avant-garde than both NORMA and NUMA because it is a direct memory
access methodology. ccNUMA bypasses the I/O bus entirely.

Figure 3.5 illustrates the direct memory access of the ccNUMA I/O model.

Figure 3.5. ccNUMA provides for direct memory access.

ccNUMA is more an emerging technology than a fully developed product. By completely
bypassing the I/O bus, it offers the greatest relief from today's processing bottlenecks.
Unfortunately, it is also the least mature of the three I/O models, and will take
quite some time to become generally available in low-end computing platforms.

3.2.4. I/O Model Implementation

Understanding these I/O models is important when you're developing a highly scalable
cluster. Because it distributes a parallel processing environment across multiple
computers, a cluster is intrinsically more I/O-intensive than a comparable amount
of computing power contained in a single chassis. A cluster requires I/O to accept
incoming requests, to access the requested data stored on disks, and to transmit
the requested data.

What is more important, a cluster relies on I/O for its own management. Load balancing,
fail-overs, and other intracluster messaging all rely on I/O. These characteristics
require the highest level of performance possible. Failure to provide a robust network--a
Cluster Area Network (CAN)--in this area will directly impede the capability
of the cluster to manage itself.

Regardless of which I/O model is chosen, it should be hidden from the application
by the operating system or the cluster management utilities. This will ensure identical
application behavior, regardless of the physical platform.

Building in meaningful platform-independence will enable you to "future-proof"
your cluster. As more sophisticated clustering hardware platforms become available
(such as those that utilize NUMA or ccNUMA I/O models), you will be free to upgrade
the physical platform without having to worry about re-engineering the application

As cluster-specific technologies and protocols emerge, these I/O models will become
more familiar. They will be one of the keys to developing massively scalable clustering
platforms by making the cluster area network a semi-internal function of each clustered

3.3. Scalable Coherent Interface (SCI)

One recent standard designed to overcome the performance limitations of I/O was
defined in the IEEE/ANSI standards numbered 1596-1992. Cumulatively, they define
a Scalable Coherent Interface (SCI). SCI is an interface that uses the NUMA
I/O model. It is designed to provide scalability for clusters built from very high
performance multiprocessor systems. Using SCI, these clusters theoretically can grow
to a maximum of 65,536 nodes.

NOTE: SCI is intended as a replacement for traditional Local Area Network
technologies in cluster interconnect networks. Improvements in network efficiency
in this area directly affect the efficiency of cluster management.

SCI enables massively scalable clustering by focusing on performance scalability,
which operates on the theory that capacity limitations are a function of performance
limitations. Expanding the performance envelope is accomplished by providing a reliable,
efficient, high-bandwidth intracluster message passing mechanism. This will enable
hardware manufacturers to develop products with greater capacities so that more nodes
can be networked together in a cluster without the performance limitations of traditional

SCI provides its efficiency and bandwidth via direct mapping of protected network
memory across a highly specialized and efficient communications vehicle. This vehicle
functionally extends the I/O bus of a computer to other external computers. Avoiding
the high overhead conversion from I/O bus format to LAN frames enables the cluster's
performance to be much more scalable. This scalability is possible because the disparity
between the performance capacities of the CPU and network is substantially reduced.

Clustered nodes can be networked together directly, without traditional Local
Area Networking, in one of two basic topologies: ring or switched.

Ring topologies, even an SCI ring, suffer from a basic flaw. Each connection in
the ring constitutes a single point of failure. The failure of any one of them results
in the failure of the entire ring. The risks inherent in this topology increase directly
with the size of the cluster. It might be acceptable for small clusters of only a
few nodes, but the risks become unacceptable as the number of nodes increases. For
example, imagine a massive cluster of 8,000 nodes, each containing 8 CPUs that support
a mission critical application. This cluster would contain 64,000 CPUs, the largest
number that SCI can support. This cluster would contain 8,000 single points of failure.
This would probably be considered an unacceptable level of risk by any rational person.

The SCI ring topology is illustrated in Figure 3.6.

Figure 3.6. SCI ring topology.

A better alternative is the switched topology. Switched topologies, in general,
avoid many of the risks inherent in other network topologies. Each node enjoys its
own dedicated connection to the switch. The switch also enables any-to-any connectivity
through a hardware level mechanism. This ensures that switched connections occur
at wire speed.

The SCI switched topology is illustrated in Figure 3.7.

Figure 3.7. SCI switched topology.

In theory, each of the cluster node connections to the SCI switch is also a single
point of failure. The difference between the switched and ring implementations of
this technology is that if a switched connection fails, the cluster "fails-over"
to using the surviving nodes. In a ring topology failure, the entire cluster fails.

Early proponents of SCI include Novell, Sun Microsystems, and Data General. Each
of these companies uses SCI as a component in its clustering products.

3.3.1. SCI Addressing

SCI uses a 64-bit address that is unique to SCI. Up to 16 bits of the address
are used for node identification. Mathematically, this limits the number of potential
nodes in the cluster to 65,536. This maximum is defined by the address architecture
of SCI. It should not be misconstrued as a practical upper limit on the size of a

The remainder of the bits in the SCI address constitutes a single field that identifies
a physical memory address in the remote node. This memory address field contains
the actual page address in memory, as well as some attributes, that the remote node
is requesting to access.

Figure 3.8 illustrates the SCI address structure.

Figure 3.8. SCI address structure.

The memory address space can be segmented to contain both a physical memory address
and a network control register address. Local memory addresses are mapped to remote
memory addresses by SCI page descriptors. These descriptors are cached in the SCI
adapter card and paged in on demand from PCI memory, allowing the PCI memory to be
protected, yet remain accessible from the other nodes on the SCI network.

3.3.2. PCI-SCI Adapter

The SCI interface is available for the PCI bus. Support for other bus architectures
might be developed in the future. However, the PCI-SCI adapter bridges the gap between
many contemporary processing platforms that support NT, and the future. In theory,
when the SCI adapter becomes available for different bus types, SCI will be able
to integrate disparate processing platforms into one cluster. This assumes that the
application, cluster management software, network operating system, and so on are
similarly platform-independent.

The PCI-SCI adapter is a 32-bit card that operates at 33 MHz. It provides a translating
interface between the PCI bus interface and the SCI interface. Figure 3.9 illustrates
the functions of this adapter.

This adapter also features a 16-bit, 100 MHz path for linking to other cluster
hosts. This path contains Link In and Link Out paths that can be used
to build the SCI ring. A scatter/gather engine is used to access the processor's
Direct Memory Access (DMA) channels and to move data between the PCI and SCI

Figure 3.9. Functionality of the PCI-SCI adapter.

An address translation table, known as an SCI Page Table map, is used to map local
SCI addresses to a remote one. This translation table resides in the SCI switch,
and is updated by each of the SCI adapters' scatter/gather engines.

3.3.3. SCI Fault Tolerance

SCI was designed to be a cluster-enabling technology, so it should come as no
surprise that it contains many native fault tolerance features.

Two basic diagnostic mechanisms are built in to SCI. First, the SCI adapter card
can be detected via the loss of communications with other nodes on the SCI network.
The second diagnostic tool is designed to detect software failures by monitoring
a heartbeat signal in the remote node's memory. In combination, these tools enable
the SCI network, and the cluster it interconnects, to detect and recover from failures.

For example, a failure in an SCI network undergoes the following diagnostic routine.
If the SCI connection to a remote node is "up," but that node can't be
communicated with, the heartbeat is checked. If the system itself had experienced
a hardware failure, then, depending on the nature of the failure, either the communications
link itself or the heartbeat would both cease. If the communications link remains
up and the heartbeat continues to "beat," that node has suffered a software
failure. The cluster can then take the appropriate remedial actions.

Other built-in fault tolerance mechanisms can be found in the topological choices
supported by SCI. Small clusters might find that rings are the best approach. These
are inexpensive, but contain single points of failure. Consequently, they are not
very fault tolerant.

The switched topology offers more per-node fault tolerance by eliminating the
multiple single points of failure that would exist in an SCI ring. The customer can
select the configuration that provides whatever degree of fault tolerance they require
or can afford.

3.4. ServerNet

Tandem Computers, Inc. has developed a semi-internal network specifically for
use as a cluster area network. Because it is semi-internal, they regard it as a System
Area Network
(SAN). Tandem's trade name for this network technology is ServerNet.

ServerNet is designed to be the cluster area network for next generation, fault
tolerant, parallel servers. It features the NUMA I/O model, and is designed to be
the foundation for scalable clustering.

Like SCI, ServerNet provides the means for protected remote access of memory via
DMA channels. It also provides for reliable, efficient, high-bandwidth message passing
between clustered nodes without the overhead associated with network technologies
that use the NORMA I/O model. An interesting twist is that ServerNet supports queuing
of multiple interrupts with data. This minimizes the amount of overhead that would
otherwise be required to get an interrupt and then pass data. This is known as interrupt
with data
, but is for use only with relatively small transfers.

To help ensure ServerNet becomes a de facto industry standard, Tandem has decided
to license this technology to Original Equipment Manufacturers (OEMs).

3.4.1. ServerNet Addressing

ServerNet uses a 52-bit address. The first 20 bits identify the destination node.
This provides a maximum of one million possible nodes in a ServerNet. As with SCI,
this should be considered a mathematically possible maximum only and not a practical

The remaining 32 bits are the Address Validation and Translation Table
(AVTT) descriptor address. This address is used by the receiving, or destination,
node to locate a descriptor in the AVTT.

The AVTT descriptor contains the following fields:

  • PCI page number

  • Upper and lower access bounds

  • Source node ID

  • Access rights

  • Low-order, or last, 12 bits of AVTT address (This is an offset into the 4K page
    used by the ServerNet operation. It identifies whether the packet contains a READ
    or WRITE instruction.)

The ServerNet address structure is illustrated in Figure 3.10.

Figure 3.10. ServerNet address structure.

3.4.2. PCI ServerNet Adapter

The PCI-based version of the ServerNet adapter is a 32-bit card that operates
at 33 MHz. It also contains two 8-bit ports, labeled "X" and "Y,"
that operate at 50 MHz. These ports provide the basis for redundant network topologies.

This adapter contains two Block Transfer Engines (BTEs) that move data
between PCI and ServerNet addresses. The Address Validation and Translation Table
(AVTT) is used to validate and translate the addressing between ServerNet and PCI

Figure 3.11 shows a logical view of the ServerNet adapter's functionality.

Figure 3.11. Logical view of the ServerNet PCI adapter card function.

3.4.3. Fault Tolerance in ServerNet Networks

ServerNet provides several mechanisms for supporting fault tolerant operation
of a cluster. A failure of the adapter card itself is detected by other ServerNet
devices by a communications failure.

As happens with the SCI mechanism, software failures can be detected through the
use of a heartbeat that is kept in remote memory and monitored.

ServerNet also supports development of redundant network topologies through the
use of its X and Y ports. Figure 3.12 illustrates two separate ServerNet networks,
the X and Y networks, that provide complete path redundancy for intracluster communications.
These topological variations are at the customer's discretion.

Figure 3.12. Redundant X and Y System Area Networks in a single cluster.

3.5. Virtual Interface Architecture (VIA)

A relative newcomer to the field of low-end cluster scalability is Virtual
Interface Architecture
(VIA). VIA is still in draft stage, and is being reviewed
by players in the low-end computing industry. Detailed information about VIA is not
yet available, and even the information presented here is subject to change.

VIA was launched by Microsoft, COMPAQ, and Intel as a draft specification. The
goal of this draft was to define a standard that would encompass all future clusters
constructed of low-end computing components.

Although VIA includes provisions for supporting clusters of RISC-based com-puters,
this initiative is distinctly aimed at x86-based servers. The goal is to make low-end
clusters of x86 servers scalable enough to become competition for mid-range, and
even mainframe, computers. Because of these provisions for RISC support, VIA, in
theory, will enable the construction of mixed architecture clusters. x86 and RISC
machines will be able to interoperate in VIA-compatible clusters. This will facilitate
the extension of low-end computers into the markets traditionally served by more
expensive mid-range computers.

The breadth of its mission makes VIA an intentionally broad and generic proposal
for a System Area Network. If it is successful, it will help speed the maturation
of standardized clusters constructed from low-end commodity components. This, in
turn, will expedite the development of cluster-aware commercial applications. In
the absence of an industry standard, such applications would be slow to emerge.

3.5.1. An "Architecture," Not a "Product"

VIA, theoretically, will describe an interface architecture that allows for low-
latency, high-bandwidth connections between clustered nodes. VIA is not a product,
so eventually existing products might be modified for VIA compatibility. At least
one manufacturer of SCI adapters is already gearing up to produce VIA-compatible
PCI-SCI adapters.

Similarly, other manufacturers are investigating the potential to develop VIA-compatible
ATM, Fast Ethernet, Gigabit Ethernet, and Fibre Channel adapters. These manufacturers
are less interested in becoming VIA-compliant than in seeking other avenues for demonstrating
the potential of their products.

Regardless of their motivations, VIA could help network technologies that were
previously reserved for high-end systems into the affordable range of options for
low-end computing systems and clusters.

3.5.2. Architecture by Committee?

Standardizing a far-flung collection of components can have its risks. These risks
are magnified because the specification is being drafted by a large committee. Techno-politics
are sure to be responsible for more than a little of the finished specification because
the players are motivated to protect their own interests.

Even in this era of "open" computing, manufacturers have always tried
to differentiate their product by adding bells and whistles. Sometimes, these features
compromised the openness of the product. Thus customers could purchase open products
but still find themselves tied to a single vendor, depending on how the products
were implemented.

The nature of clusters, particularly VIA-compatible clusters, should preclude
this behavior. The architecture must allow for disparate architectures to be so tightly
coupled that they function as a logical, single computer. It seems logical, then,
that the architecture be fairly loose so it can be as open as possible.

Loose architectures bring the risk of suboptimal performance levels. At least
one participating manufacturer openly doubts the capability of VIA-compatible products
to match the performance of its proprietary technology. This manufacturer isn't doubting
that architecturally correct products will work; it's doubting that it will work
as well as noncompliant products.

The draft VIA specification is currently being reviewed by more than 50 companies.
These companies comprise the full spectrum of manufacturers in low-end computing.
They include Hewlett-Packard, Novell, Oracle, Santa Cruz Operation, and many others.
VIA appears to have momentum, but so did the ATM Forum in 1993. Only time, and the
market, will tell whether VIA is ultimately successful and effective in satisfying
its mission.

3.6. Clustering for Future Scalability

The goal of any future scalable cluster should be that it is easy to administer
and use as a single, self-contained system. It should also be as easy to program
as any multiprocessor system. These relatively simple goals can be deceptively difficult
to achieve.

The following basic guidelines should help to future-proof the scalability of
your cluster.

3.6.1. Networks

The network is probably the easiest component of a cluster to scale up. All the
major Local Area Network(LAN) manufacturers presently make switching hubs that feature
modular upgrades and multitopology support. In the short term, these hubs will make
excellent platforms for supporting a cluster that is likely to experience significant
increases in use.

Recognize, however, that conventional Local Area Networks contain a fair bit of
latency and overhead relative to the demands of intracluster communications. The
best way to proactively accommodate future scalability is to deploy the highest bandwidth,
lowest latency networks possible for intracluster communications.

Another way to extend the life of conventional LAN technologies in a cluster is
to deploy one for each of the cluster's network functional areas. By using one LAN
for client access, and another for intracluster communications, both networks will
be able to better serve their intended purpose. Thus, they will both be able to serve
longer as the cluster scales upwards.

In the long term, as the semi-internal bus extension networks of NUMA and ccNUMA
become stable and well supported, consideration should be given to using them instead
of a formal Local Area Network technology for intracluster communications. Given
this as a likely future migration path, it is logical to establish whether or not
the products you use in assembling your cluster will support this change in I/O models.

3.6.2. Platforms

Processing platforms can be easily built for future scalability, up to a point.
Developing a cluster with lightly loaded symmetrical multiprocessors creates the
opportunity to provide future scalability by adding memory, CPUs, and disks as they
are needed. This is the easy way to provide scalability and is directly analogous
to using modularly upgradable switching hubs. Both provide the capability to add
capacity incrementally--without a forklift.

Depending on your choice of cluster management software, an SMP cluster can be
scaled beyond the internal limitations of its nodes by adding additional nodes. Again,
the degree to which this form of scalability can be utilized will depend directly
on your choice of cluster management software. Most of the products available for
low-end computing are focused on reliability. Scalability has distinct and finite
upper limits because of the performance issues described throughout this chapter.
These limits will be eased somewhat over time as the cluster management software

In the long term, the only way to provide high levels of scalability is to eliminate
bottlenecks. This must begin with a thorough understanding of the clustered applications
resource needs and end with implementation of technologies that are well suited to
those needs. Given the intrinsic relationship between the performance of cluster
management and I/O, eliminating I/O bottlenecks will be a priority for any cluster,
regardless of the application.

Improving the throughput of I/O functions can be done by building your cluster
with the most robust bus architecture possible. Then, further improvements can be
gained by streamlining external I/O functions using either a NUMA or ccNUMA technology
to extend that I/O bus to other clustered nodes.

3.6.3. Cluster Topologies

Cluster topologies, too, can directly affect the future scalability of your cluster.
Shared disk topologies, although relatively inexpensive to build and maintain, are
not very scalable. The complexities of trying to share disks across more than two
or three clustered nodes become obvious quickly. The increased competition for disk
access will also adversely affect the performance of a clustered application.

The bottom line is avoid shared disk clusters, if your future goal is scalability.
Shared nothing clusters are much more versatile and scalable, though they are a bit
more expensive to purchase and operate.

Some additional topological customization can also make any cluster slightly more
scalable. Chapter 4, "Complex Clusters," presents several ways that the
functions of a clustered application can be split among two or more nested or layered
clusters. This functional segregation permits each cluster's hardware to be optimized
for its particular function. This increases the effectiveness with which it can perform,
and might well forestall the need for reinvestment as the cluster's workload increases.

3.6.4. System-Level Software

The system-level software in a cluster includes operating systems and cluster
management software. Both are equally important to the future scalability of a clustered
application. They should work together to hide component failures from the users
and applications. They should also be well matched to each other's performance levels.
Otherwise, it is possible that one or the other will become a new bottleneck. Unfortunately,
this type of bottleneck will become apparent only after the cluster has been implemented
and actually starts to experience increased usage.

Make sure that both the operating system and the cluster management software are
capable of handling the projected work load. More importantly, verify that both are
capable of handling the specific types of processing that the cluster will support.
Different application types can have very different processing and resource requirements.

3.6.5. Application Software

Finally, consider the application software itself. Building the most technically
elegant cluster will do your users no good if the application software is not cluster-aware!
Whether you install shrink-wrapped software or have an in-house custom development
group, the application software must be aware of the features and services of the
cluster. This software must be able to interact with either the operating system
or the cluster management software. Thus, it is critical that they use industry standard
Application Programming Interfaces (APIs).

More importantly, the application should be engineered with future scalability
in mind. Each application can have its own performance bottlenecks that must be minimized.
The application must be designed not only to process in parallel, but to process
in distributed parallel.

3.7. Summary

There is no one thing that can be done to ensure the future scalability of your
cluster. Rather, scalability is the aggregate benefit of many underlying factors.
Each one must be identified, understood, and either eliminated or worked around in
such a way that the resource intensity of the application and cluster is reduced.
Minimizing resource intensity is the surest way to build future scalability into
any application, clustered or otherwise.

© Copyright, Macmillan Computer Publishing. All rights reserved.

Meet the Author

Customer Reviews

Average Review:

Write a Review

and post it to your social network


Most Helpful Customer Reviews

See all customer reviews >