Advanced Computer Architecture and Parallel Processing / Edition 1 available in Hardcover
- Pub. Date:
Computer architecture deals with the physical configuration, logical structure, formats, protocols, and operational sequences for processing data, controlling the configuration, and controlling the operations over a computer. It also encompasses word lengths, instruction codes, and the interrelationships among the main parts of a computer or group of computers. This two-volume set offers a comprehensive coverage of the field of computer organization and architecture.
|Series:||Wiley Series on Parallel and Distributed Computing Series , #30|
|Product dimensions:||6.40(w) x 9.53(h) x 0.74(d)|
About the Author
HESHAM EL-REWINI, PHD, PE, is a full professor and chairman of the Department of Computer Sciences and Engineering at Southern Methodist University (SMU). He has co-authored several books, published numerous research papers in journals and conference proceedings, and chaired many international conferences.
MOSTAFA ABD-EL-BARR, PHD, PEnG, is a professor and chairman of the Department of Information Science at Kuwait University. He has co-authored two other books, published more than 120 research papers in journals and conference proceedings, and served as chair for a number of international conferences and symposia.
Read an Excerpt
Advanced Computer Architecture and Parallel Processing
By Hesham El-Rewini M. Abd-El-Barr
John Wiley & SonsCopyright © 2005 John Wiley & Sons, Inc.
All right reserved.
Chapter OneIntroduction to Advanced Computer Architecture and Parallel Processing
Computer architects have always strived to increase the performance of their computer architectures. High performance may come from fast dense circuitry, packaging technology, and parallelism. Single-processor supercomputers have achieved unheard of speeds and have been pushing hardware technology to the physical limit of chip manufacturing. However, this trend will soon come to an end, because there are physical and architectural bounds that limit the computational power that can be achieved with a single-processor system. In this book we will study advanced computer architectures that utilize parallelism via multiple processing units.
Parallel processors are computer systems consisting of multiple processing units connected via some interconnection network plus the software needed to make the processing units work together. There are two major factors used to categorize such systems: the processing units themselves, and the interconnection network that ties them together. The processing units can communicate and interact with each other using either shared memory or message passing methods. The interconnection network for shared memory systems can be classified as bus-based versus switch-based. In message passing systems, the interconnection network is divided into static and dynamic. Static connections have a fixed topology that does not change while programs are running. Dynamic connections create links on the fly as the program executes.
The main argument for using multiprocessors is to create powerful computers by simply connecting multiple processors. A multiprocessor is expected to reach faster speed than the fastest single-processor system. In addition, a multiprocessor consisting of a number of single processors is expected to be more cost-effective than building a high-performance single processor. Another advantage of a multiprocessor is fault tolerance. If a processor fails, the remaining processors should be able to provide continued service, albeit with degraded performance.
1.1 FOUR DECADES OF COMPUTING
Most computer scientists agree that there have been four distinct paradigms or eras of computing. These are: batch, time-sharing, desktop, and network. Table 1.1 is modified from a table proposed by Lawrence Tesler. In this table, major characteristics of the different computing paradigms are associated with each decade of computing, starting from 1960.
1.1.1 Batch Era
By 1965 the IBM System/360 mainframe dominated the corporate computer centers. It was the typical batch processing machine with punched card readers, tapes and disk drives, but no connection beyond the computer room. This single mainframe established large centralized computers as the standard form of computing for decades. The IBM System/360 had an operating system, multiple programming languages, and 10 megabytes of disk storage. The System/360 filled a room with metal boxes and people to run them. Its transistor circuits were reasonably fast. Power users could order magnetic core memories with up to one megabyte of 32-bit words. This machine was large enough to support many programs in memory at the same time, even though the central processing unit had to switch from one program to another.
1.1.2 Time-Sharing Era
The mainframes of the batch era were firmly established by the late 1960s when advances in semiconductor technology made the solid-state memory and integrated circuit feasible. These advances in hardware technology spawned the minicomputer era. They were small, fast, and inexpensive enough to be spread throughout the company at the divisional level. However, they were still too expensive and difficult to use to hand over to end-users. Minicomputers made by DEC, Prime, and Data General led the way in defining a new kind of computing: time-sharing. By the 1970s it was clear that there existed two kinds of commercial or business computing: (1) centralized data processing mainframes, and (2) time-sharing minicomputers. In parallel with small-scale machines, supercomputers were coming into play. The first such supercomputer, the CDC 6600, was introduced in 1961 by Control Data Corporation. Cray Research Corporation introduced the best cost/performance supercomputer, the Cray-1, in 1976.
1.1.3 Desktop Era
Personal computers (PCs), which were introduced in 1977 by Altair, Processor Technology, North Star, Tandy, Commodore, Apple, and many others, enhanced the productivity of end-users in numerous departments. Personal computers from Compaq, Apple, IBM, Dell, and many others soon became pervasive, and changed the face of computing.
Local area networks (LAN) of powerful personal computers and workstations began to replace mainframes and minis by 1990. The power of the most capable big machine could be had in a desktop model for one-tenth of the cost. However, these individual desktop computers were soon to be connected into larger complexes of computing by wide area networks (WAN).
1.1.4 Network Era
The fourth era, or network paradigm of computing, is in full swing because of rapid advances in network technology. Network technology outstripped processor technology throughout most of the 1990s. This explains the rise of the network paradigm listed in Table 1.1. The surge of network capacity tipped the balance from a processor-centric view of computing to a network-centric view.
The 1980s and 1990s witnessed the introduction of many commercial parallel computers with multiple processors. They can generally be classified into two main categories: (1) shared memory, and (2) distributed memory systems. The number of processors in a single machine ranged from several in a shared memory computer to hundreds of thousands in a massively parallel system. Examples of parallel computers during this era include Sequent Symmetry, Intel iPSC, nCUBE, Intel Paragon, Thinking Machines (CM-2, CM-5), MsPar (MP), Fujitsu (VPP500), and others.
1.1.5 Current Trends
One of the clear trends in computing is the substitution of expensive and specialized parallel machines by the more cost-effective clusters of workstations. A cluster is a collection of stand-alone computers connected using some interconnection network. Additionally, the pervasiveness of the Internet created interest in network computing and more recently in grid computing. Grids are geographically distributed platforms of computation. They should provide dependable, consistent, pervasive, and inexpensive access to high-end computational facilities.
1.2 FLYNN'S TAXONOMY OF COMPUTER ARCHITECTURE
The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn's classification scheme is based on the notion of a stream of information. Two types of information flow into a processor: instructions and data. The instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined as the data traffic exchanged between the memory and the processing unit. According to Flynn's classification, either of the instruction or data streams can be single or multiple. Computer architecture can be classified into the following four distinct categories:
single-instruction single-data streams (SISD);
single-instruction multiple-data streams (SIMD);
multiple-instruction single-data streams (MISD); and
multiple-instruction multiple-data streams (MIMD).
Conventional single-processor von Neumann computers are classified as SISD systems. Parallel computers are either SIMD or MIMD. When there is only one control unit and all processors execute the same instruction in a synchronized fashion, the parallel machine is classified as SIMD. In a MIMD machine, each processor has its own control unit and can execute different instructions on different data. In the MISD category, the same stream of data flows through a linear array of processors executing different instruction streams. In practice, there is no viable MISD machine; however, some authors have considered pipelined machines (and perhaps systolic-array computers) as examples for MISD. Figures 1.1, 1.2, and 1.3 depict the block diagrams of SISD, SIMD, and MIMD, respectively.
An extension of Flynn's taxonomy was introduced by D. J. Kuck in 1978. In his classification, Kuck extended the instruction stream further to single (scalar and array) and multiple (scalar and array) streams. The data stream in Kuck's classification is called the execution stream and is also extended to include single (scalar and array) and multiple (scalar and array) streams. The combination of these streams results in a total of 16 categories of architectures.
1.3 SIMD ARCHITECTURE
The SIMD model of parallel computing consists of two parts: a front-end computer of the usual von Neumann style, and a processor array as shown in Figure 1.4. The processor array is a set of identical synchronized processing elements capable of simultaneously performing the same operation on different data. Each processor in the array has a small amount of local memory where the distributed data resides while it is being processed in parallel. The processor array is connected to the memory bus of the front end so that the front end can randomly access the local processor memories as if it were another memory. Thus, the front end can issue special commands that cause parts of the memory to be operated on simultaneously or cause data to move around in the memory. A program can be developed and executed on the front end using a traditional serial programming language. The application program is executed by the front end in the usual serial way, but issues commands to the processor array to carry out SIMD operations in parallel. The similarity between serial and data parallel programming is one of the strong points of data parallelism. Synchronization is made irrelevant by the lock-step synchronization of the processors. Processors either do nothing or exactly the same operations at the same time. In SIMD architecture, parallelism is exploited by applying simultaneous operations across large sets of data. This paradigm is most useful for solving problems that have lots of data that need to be updated on a wholesale basis. It is especially powerful in many regular numerical calculations.
There are two main configurations that have been used in SIMD machines (see Fig. 1.5). In the first scheme, each processor has its own local memory. Processors can communicate with each other through the interconnection network. If the interconnection network does not provide direct connection between a given pair of processors, then this pair can exchange data via an intermediate processor. The ILLIAC IV used such an interconnection scheme. The interconnection network in the ILLIAC IV allowed each processor to communicate directly with four neighboring processors in an 8 x 8 matrix pattern such that the i th processor can communicate directly with the (i - 1)th, (i + 1)th, (i - 8)th, and (i + 8)th processors. In the second SIMD scheme, processors and memory modules communicate with each other via the interconnection network. Two processors can transfer data between each other via intermediate memory module(s) or possibly via intermediate processor(s). The BSP (Burroughs' Scientific Processor) used the second SIMD scheme.
1.4 MIMD ARCHITECTURE
Multiple-instruction multiple-data streams (MIMD) parallel architectures are made of multiple processors and multiple memory modules connected together via some interconnection network. They fall into two broad categories: shared memory or message passing. Figure 1.6 illustrates the general architecture of these two categories. Processors exchange information through their central shared memory in shared memory systems, and exchange information through their interconnection network in message passing systems.
A shared memory system typically accomplishes interprocessor coordination through a global memory shared by all processors. These are typically server systems that communicate through a bus and cache memory controller. The bus/ cache architecture alleviates the need for expensive multiported memories and interface circuitry as well as the need to adopt a message-passing paradigm when developing application software. Because access to shared memory is balanced, these systems are also called SMP (symmetric multiprocessor) systems. Each processor has equal opportunity to read/write to memory, including equal access speed. Commercial examples of SMPs are Sequent Computer's Balance and Symmetry, Sun Microsystems multiprocessor servers, and Silicon Graphics Inc. multiprocessor servers.
A message passing system (also referred to as distributed memory) typically combines the local memory and processor at each node of the interconnection network. There is no global memory, so it is necessary to move data from one local memory to another by means of message passing. This is typically done by a Send/Receive pair of commands, which must be written into the application software by a programmer. Thus, programmers must learn the message-passing paradigm, which involves data copying and dealing with consistency issues. Commercial examples of message passing architectures c. 1990 were the nCUBE, iPSC/2, and various Transputer-based systems. These systems eventually gave way to Internet connected systems whereby the processor/memory nodes were either Internet servers or clients on individuals' desktop.
It was also apparent that distributed memory is the only way efficiently to increase the number of processors managed by a parallel and distributed system. If scalability to larger and larger systems (as measured by the number of processors) was to continue, systems had to use distributed memory techniques. These two forces created a conflict: programming in the shared memory model was easier, and designing systems in the message passing model provided scalability. The distributed-shared memory (DSM) architecture began to appear in systems like the SGI Origin2000, and others. In such systems, memory is physically distributed; for example, the hardware architecture follows the message passing school of design, but the programming model follows the shared memory school of thought. In effect, software covers up the hardware. As far as a programmer is concerned, the architecture looks and behaves like a shared memory machine, but a message passing architecture lives underneath the software. Thus, the DSM machine is a hybrid that takes advantage of both design schools.
1.4.1 Shared Memory Organization
A shared memory model is one in which processors communicate by reading and writing locations in a shared memory that is equally accessible by all processors. Each processor may have registers, buffers, caches, and local memory banks as additional memory resources. A number of basic issues in the design of shared memory systems have to be taken into consideration. These include access control, synchronization, protection, and security. Access control determines which process accesses are possible to which resources. Access control models make the required check for every access request issued by the processors to the shared memory, against the contents of the access control table. The latter contains flags that determine the legality of each access attempt. If there are access attempts to resources, then until the desired access is completed, all disallowed access attempts and illegal processes are blocked. Requests from sharing processes may change the contents of the access control table during execution. The flags of the access control with the synchronization rules determine the system's functionality. Synchronization constraints limit the time of accesses from sharing processes to shared resources. Appropriate synchronization ensures that the information flows properly and ensures system functionality. Protection is a system feature that prevents processes from making arbitrary access to resources belonging to other processes. Sharing and protection are incompatible; sharing allows access, whereas protection restricts it.
Excerpted from Advanced Computer Architecture and Parallel Processing by Hesham El-Rewini M. Abd-El-Barr Copyright © 2005 by John Wiley & Sons, Inc.. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
1. Introduction to Advanced Computer Architecture and Parallel Processing 1
1.1 Four Decades of Computing 2
1.2 Flynn’s Taxonomy of Computer Architecture 4
1.3 SIMD Architecture 5
1.4 MIMD Architecture 6
1.5 Interconnection Networks 11
1.6 Chapter Summary 15
2. Multiprocessors Interconnection Networks 19
2.1 Interconnection Networks Taxonomy 19
2.2 Bus-Based Dynamic Interconnection Networks 20
2.3 Switch-Based Interconnection Networks 24
2.4 Static Interconnection Networks 33
2.5 Analysis and Performance Metrics 41
2.6 Chapter Summary 45
3. Performance Analysis of Multiprocessor Architecture 51
3.1 Computational Models 51
3.2 An Argument for Parallel Architectures 55
3.3 Interconnection Networks Performance Issues 58
3.4 Scalability of Parallel Architectures 63
3.5 Benchmark Performance 67
3.6 Chapter Summary 72
4. Shared Memory Architecture 77
4.1 Classification of Shared Memory Systems 78
4.2 Bus-Based Symmetric Multiprocessors 80
4.3 Basic Cache Coherency Methods 81
4.4 Snooping Protocols 83
4.5 Directory Based Protocols 89
4.6 Shared Memory Programming 96
4.7 Chapter Summary 99
5. Message Passing Architecture 103
5.1 Introduction to Message Passing 103
5.2 Routing in Message Passing Networks 105
5.3 Switching Mechanisms in Message Passing 109
5.4 Message Passing Programming Models 114
5.5 Processor Support for Message Passing 117
5.6 Example Message Passing Architectures 118
5.7 Message Passing Versus Shared Memory Architectures 122
5.8 Chapter Summary 123
6. Abstract Models 127
6.1 The PRAM Model and Its Variations 127
6.2 Simulating Multiple Accesses on an EREW PRAM 129
6.3 Analysis of Parallel Algorithms 131
6.4 Computing Sum and All Sums 133
6.5 Matrix Multiplication 136
6.6 Sorting 139
6.7 Message Passing Model 140
6.8 Leader Election Problem 146
6.9 Leader Election in Synchronous Rings 147
6.10 Chapter Summary 154
7. Network Computing 157
7.1 Computer Networks Basics 158
7.2 Client/Server Systems 161
7.3 Clusters 166
7.4 Interconnection Networks 170
7.5 Cluster Examples 175
7.6 Grid Computing 177
7.7 Chapter Summary 178
8. Parallel Programming in the Parallel Virtual Machine 181
8.1 PVM Environment and Application Structure 181
8.2 Task Creation 185
8.3 Task Groups 188
8.4 Communication Among Tasks 190
8.5 Task Synchronization 196
8.6 Reduction Operations 198
8.7 Work Assignment 200
8.8 Chapter Summary 201
9. Message Passing Interface (MPI) 205
9.1 Communicators 205
9.2 Virtual Topologies 209
9.3 Task Communication 213
9.4 Synchronization 217
9.5 Collective Operations 220
9.6 Task Creation 225
9.7 One-Sided Communication 228
9.8 Chapter Summary 231
10 Scheduling and Task Allocation 235
10.1 The Scheduling Problem 235
10.2 Scheduling DAGs without Considering Communication 238
10.3 Communication Models 242
10.4 Scheduling DAGs with Communication 244
10.5 The NP-Completeness of the Scheduling Problem 248
10.6 Heuristic Algorithms 250
10.7 Task Allocation 256
10.8 Scheduling in Heterogeneous Environments 262
What People are Saying About This
"This outstanding treatise...allows students and professionals to become familiar with the inner workings of an inherently complex architecture." (CHOICE, July 2005)