- Shopping Bag ( 0 items )
"Once in a great while, a landmark computer-science book is published. Computer Architecture: A Quantitative Approach, Second Edition, is such a book. In an era of fluff computer books that are, quite properly, remaindered within weeks of publication, this book will stand the test of time, becoming lovingly dog-eared in the hands of anyone who designs computers or has concerns about the performance of computer programs." - Robert Bernecky, Dr. Dobb's Journal, April 1998
Computer Architecture: A Quantitative Approach was the first book to focus on computer architecture as a modern science. Its publication in 1990 inspired a new approach to studying and understanding computer design. Now, the second edition explores the next generation of architectures and design techniques with view to the future.
A basis for modern computer architecture
As the authors explain in their preface to the Second Edition, computer architecture itself has undergone significant change since 1990. Concentrating on currently predominant and emerging commercial systems, the Hennessy and Patterson have prepared entirely new chapters covering additional advanced topics:
This book continues the style of the first edition, with revised sections on Fallacies and Pitfalls, Putting It All Together and Historical Perspective, and contains entirely new sections on Crosscutting Issues. The focus on fundamental techniques for designing real machines and the attention to maximizing cost/performance are crucial to both students and working professionals. Anyone involved in building computers, from palmtops to supercomputers, will profit from the expertise offered by Hennessy and Patterson.
Expanded, updated and completely revised, presents advanced concepts and analysis in the context of the most recent real machines. Describes basic principles while emphasizing cost/performance tradeoffs and good engineering design. Not an architectural survey, focuses on core concepts and instruction set principles via a generic RISC machine, DLX, and Intel 80x86 (IBM 360 and VAX have been eliminated). Discusses pipelining and parallelism, memory-hierarchy design, storage systems, interconnection networks and multiprocessors. Individual chapters are logically connected with real machine examples and contain sections on fallacies, pitfalls and architectural traps. A university text, supplementary material available. Good companion volume which enhances and elucidates related topics is Computer Organization & Design : The Hardware/Software Interface by the same authors.
When I reviewed the first edition of this book in the October, 1990 Programmer's Bookshelf column of Dr. Dobb's Journal, I wrote:
Computer Architecture: A Quantitative Approach is a tour-de-force on several levels. The book is a masterpiece of technical writing -- Hennessy and Patterson's clear, direct style is absorbing and effective, and their enthusiasm for their subject is contagious. The design and production, too, are impeccable. Furthermore, because the book presents a hardheaded and pragmatic approach to computer design, based on real examples, real measurements, and lessons learned from the successes and misadventures of the past, it should revolutionize the teaching of computer architecture and implementation.
Although this book was not written primarily for programmers, it is a thorough and extraordinarily wide-ranging education in that magical interface between the programmer's intentions and the electron's actions. It should be read by every software craftsman who cares about wringing the last drop of performance from his machine.
Flowery words, without a doubt -- the type of glowing review that one sometimes regrets a few years later. In this case, however, the book (and my review) have stood the test of time. It is even more clear now that Computer Architecture is one of the all-time greats.
The second edition is about 200 pages longer than the first and has a significant shift in focus, although the structure is much the same. The first edition contained fascinating explanations of the IBM 360 and DEC VAX architectures as the archetypes of "mainframes" and "minicomputers" respectively. You'll never find a clearer explanation of IBM's "channels" anywhere. The second edition jettisons almost all the IBM and VAX material, and substitutes discussions of the HP PA-RISC and Motorola PowerPC for the Intel i860 and Motorola 88000, but considerably expands the coverage of high-performance instruction execution strategies, multiprocessing, caches, and magnetic storage. Comprehensive sections on processor interconnection, networking, and 64-bit architectures have been added.
Computer Architecture: A Quantitative Approach, Second Edition is a must-buy for every serious programmer, engineer, computer science student, and technical library. But hold on to your copy of the first edition as well; it may come in very handy in a few decades when your grandchildren ask you "What were those giant VAX boxes anyway?"--Dr. Dobb's Electronic Review of Computer Books
The question is this: Where does the I/O occur in the computer-between the I/O device and the cache or between the I/O device and main memory? If input puts data into the cache and output reads data from the cache, both I/O and the CPU see the same data, and the problem is solved. The difficulty in this approach is that it interferes with the CPU. I/O competing with the CPU for cache access will cause the CPU to stall for I/O. Input will also interfere with the cache by displacing some information with the new data that is unlikely to be accessed by the CPU soon. For example, on a page fault the CPU may need to access a few words in a page, but a program is not likely to access every word of the page if it were loaded into the cache. Given the integration of caches onto the same integrated circuit, it is also difficult for that interface to be visible.
The goal for the I/O system in a computer with a cache is to prevent the stale data problem while interfering with the CPU as little as possible. Many systems, therefore, prefer that I/O occur directly to main memory, with main memory
FIGURE 5.46 The cache-coherency problem. A' and B refer to the cached copiesof A and B in memory. (a) shows cache and main memory in a coherent state. In (b) we assume a write-back cache when the CPU writes 550 into A. Now A' has the value but the value in memory has the old, stale value of 100. If an output used the value of A from memory, it would get the stale data. In (c) the I/O system inputs 440 into the memory copy of B, so now B, in the cache has the old, stale data acting as an I/O buffer. If a write-through cache is used, then memory has an upto-date copy of the information, and there is no stale-data issue for output. (This is a reason many machines use write through.) Input requires some extra work. The software solution is to guarantee that no blocks of the I/O buffer designated for input are in the cache. In one approach, a buffer page is marked as noncachable; the operating system always inputs to such a page. In another approach, the operating system flushes the buffer addresses from the cache after the input occurs. A hardware solution is to check the I/O addresses on input to see if they are in the cache; to avoid slowing down the cache to check addresses, sometimes a duplicate set of tags are used to allow checking of I/O addresses in parallel with processor cache accesses. If there is a match of I/O addresses in the cache, the cache entries are invalidated to avoid stale data. All these approaches can also be used for output with write-back caches. More about this is found in Chapter 6.
The cache-coherency problem applies to multiprocessors as well as I/O. Unlike I/O, where multiple data copies are a rare event-one to be avoided whenever possible-a program running on multiple processors will want to have copies of the same data in several caches. Performance of a multiprocessor program depends on the performance of the system when sharing data. The protocols to maintain coherency for multiple processors are called cache-coherency protocols and are described in Chapter 8.
Let's really start at the beginning, when the Alpha is turned on. Hardware on the chip loads the instruction cache from an external PROM. This initialization allows the 8-KB instruction cache to omit a valid bit, for there are always valid instructions in the cache; they just might not be the ones your program is interested in. The hardware does clear the valid bits in the data cache. The PC is set to the kseg segment so that the instruction addresses are not translated, thereby avoiding the TLB.
One of the first steps is to update the instruction TLB with valid page table entries (PTEs) for this process. Kernel code updates the TLB with the contents of the appropriate page table entry for each page to be mapped. The instruction TLB has eight entries for 8-KB pages and four for 4-MB pages. (The 4-MB pages are used by large programs such as the operating system or data bases that will likely touch most of their code.) A miss in the TLB invokes the Privileged Architecture Library (PAL code) software that updates the TLB. PAL code is simply machine language routines with some implementation-specific extensions to allow access to low-level hardware, such as the TLB. PAL code runs with exceptions disabled, and instruction accesses are not checked for memory management violations, allowing PAL code to fill the TLB.
Once the operating system is ready to begin executing a user process, it sets the PC to the appropriate address in segment segO. We are now ready to follow memory hierarchy in action: Figure 5.47 is labeled with the steps of this narrative. The page frame portion of this address is sent to the TLB (step 1), while the 8-bit index from the page offset is sent to the direct-mapped 8-KB (256 32-byte blocks) instruction cache (step 2). The fully associative TLB simultaneously searches all 12 entries to find a match between the address and a valid PTE (step 3). In addition to translating the address, the TLB checks to see if the PTE demands that this access result in an exception. An exception might occur if either this access violates the protection on the page or ifthe page is not in main memory. If there is no exception, and if the translated physical address matches the tag in the instruction cache (step 4), then the proper 8 bytes of the 32-byte block are furnished to the CPU using the lower bits of the page offset (step 5), and the instruction stream access is done.
A miss, on the other hand, simultaneously starts an access to the second-level cache (step 6) and checks the prefetch instruction stream buffer (step 7). If the desired instruction is found in the stream buffer (step 8), the critical 8 bytes are sent to the CPU, the full 32-byte block of the stream buffer is written into the instruction cache (step 9), and the request to the second-level cache is canceled. Steps 6 to 9 take just a single clock cycle.
If the instruction is not in the prefetch stream buffer, the second-level cache continues trying to fetch the block. The 21064 microprocessor is designed to work with direct-mapped second-level caches from 128 KB to 8 MB with a miss penalty between 3 and 16 clock cycles. For this section we use the memory system of the DEC 3000 model 800 Alpha AXP. It has a 2-MB (65,536 32-byte blocks) second-level cache, so the 29-bit block address is divided into a 13-bit tag and a 16-bit index (step 10). The cache reads the tag from that index and if it matches (step 11), the cache returns the critical 16 bytes in the first 5 clock cycles and the other 16 bytes in the next 5 clock cycles (step 12). The path between the first- and second-level cache is 128 bits wide (16 bytes). At the same time, a request is made for the next sequential 32-byte block, which is loaded into the instruction stream buffer in the next 10 clock cycles (step 13).
The instruction stream does not rely on the TLB for address translation. It simply increments the physical address of the miss by 32 bytes, checking to make sure that the new address is within the same page. If the incremented address crosses a page boundary, then the prefetch is suppressed.
If the instruction is not found in the secondary cache, the translated physical address is sent to memory (step 14). The DEC 3000 model 800 divides memory into four memory mother boards (MMB), each of which contains two to eight SIMMs (single inline memory modules). The SIMMs come with eight DRAMs for information plus two DRAMs for error protection per side, and the options are single- or double-sided SIMMs using I-Mbit, 4-Mbit, or 16-Mbit DRAMs. Hence the memory capacity of the model 800 is 8 MB (4 x 2 x 8 x I x 1/8) to 1024 MB (4 x 8 x 8 x 16 x 2/8), always organized 256 bits wide. The average time to transfer 32 bytes from memory to the secondary cache is 36 clock cycles after the processor makes the request. The second-level cache loads this data 16 bytes at a time.
Since the second-level cache is a write-back cache, any miss can lead to the old block being written back to memory. The 21064 places this "victim" block into a victim buffer to get out of the way of new data (step 15). The new data are loaded into the cache as soon as they arrive (step 16), and then the old data are written from the victim buffer (step 17). There is a single block in the victim buffer, so a second miss would need to stall until the victim buffer empties.
Suppose this initial instruction is a load. It will send the page frame of its data address to the data TLB (step 18) at the same time as the 8-bit index from the page offset is sent to the data cache (step 19). The data TLB is a fully associative cache containing 32 PTEs, each of which represents page sizes from 8 KB to 4 MB. A TLB miss will trap to PAL code to load the valid PTE for this address. In the worst case, the page is not in memory, and the operating system gets the page from disk (step 20). Since millions of instructions could execute during a page fault, the operating system will swap in another process if there is something waiting to run.
Assuming that we have a valid PTE in the data TLB (step 21), the cache tag and the physical page frame are compared (step 22), with a match sending the desired 8 bytes from the 32-byte block to the CPU (step 23). A miss goes to the second-level cache, which proceeds exactly like an instruction miss.
Suppose the instruction is a store instead of a load. The page frame portion of the data address is again sent to the data TLB and the data cache (steps 18 and 19), which checks for protection violations as well as translates the address. The physical address is then sent to the data cache (steps 21 and 22). Since the data cache uses write through, the store data are simultaneously sent to the write buffer (step 24) and the data cache (step 25). As explained on page 425, the 21064 pipelines write hits. The data address of this store is checked for a match, and at the same time the data from the previous write hit are written to the cache (step 26). If the address check was a hit, then the data from this store are placed in the write pipeline buffer. On a miss, the data are just sent to the write buffer since the data cache does not allocate on a write miss.
The write buffer takes over now. It has four entries, each containing a whole cache block. If the buffer is full, then the CPU must stall until a block is written to the second-level cache. If the buffer is not full, the CPU continues and the address of the word is presented to the write buffer (step 27). It checks to see if the word matches any block already in the buffer so that a sequence of writes can be stitched together into a full block, thereby optimizing use of the write bandwidth between the first- and second-level cache.
All writes are eventually passed on to the second-level cache. If a write is a hit, then the data are written to the cache (step 28). Since the second-level cache uses write back, it cannot pipeline writes: a full 32-byte block write takes 5 clock cycles to check the address and 10 clock cycles to write the data. A write of 16 bytes or less takes 5 clock cycles to check the address and 5 clock cycles to write the data. In either case the cache marks the block as dirty.
If the access to the second-level cache is a miss, the victim block is checked to see if it is dirty; if so, it is placed in the victim buffer as before (step 15). If the new data are a full block, then the data are simply written and marked dirty. A partial block write results in an access to main memory since the second-level cache policy is to allocate on a write miss....
|1||Fundamentals of Computer Design|
|2||Instruction Set Principles and Examples|
|4||Advanced Pipelining and Instruction-Level Parallelism|
|Appendix A: Computer Arithmetic|
|Appendix B: Vector Processors|
|Appendix C: Survey of RISC Architectures|
|Appendix D: An Alternative to RISC: The Intel 80x86|
|Appendix E: Implementing Coherence Protocols|
Computer technology has made incredible progress in the roughly 55 years since the first general-purpose electronic computer was created. Today, less than a thousand dollars will purchase a personal computer that has more performance, more main memory, and more disk storage than a computer bought in 1980 for 1 million dollars. This rapid rate of improvement has come both from advances in the technology used to build computers and from innovation in computer design.
Although technological improvements have been fairly steady, progress arising from better computer architectures has been much less consistent. During the first 25 years of electronic computers, both forces made a major contribution; but beginning in about 1970, computer designers became largely dependent upon integrated circuit technology. During the 1970s, performance continued to improve at about 25% to 30% per year for the mainframes and minicomputers that dominated the industry.
The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology more closely than the less integrated mainframes and minicomputers led to a higher rate of improvement—roughly 35% growth per year in performance.
This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business being based on microprocessors. In addition, two significant changes in the computer marketplace made it easier than ever before to be commercially successful with a new architecture. First, the virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux, lowered the cost and risk of bringing out a new architecture.
These changes made it possible to successfully develop a new set of architectures, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques, the exploitation of instruction-level parallelism (initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations). The combination of architectural and organizational enhancements has led to 20 years of sustained growth in performance at an annual rate of over 50%. Figure 1.1 shows the effect of this difference in performance growth rates.
The effect of this dramatic growth rate has been twofold. First, it has signifi- cantly enhanced the capability available to computer users. For many applications, the highest-performance microprocessors of today outperform the supercomputer of less than 10 years ago.
Second, this dramatic rate of improvement has led to the dominance of microprocessor-based computers across the entire range of the computer design. Workstations and PCs have emerged as major products in the computer industry. Minicomputers, which were traditionally made from off-the-shelf logic or from gate arrays, have been replaced by servers made using microprocessors. Mainframes have been almost completely replaced with multiprocessors consisting of small numbers of off-the-shelf microprocessors. Even high-end supercomputers are being built with collections of microprocessors.
Freedom from compatibility with old designs and the use of microprocessor technology led to a renaissance in computer design, which emphasized both architectural innovation and efficient use of technology improvements. This renaissance is responsible for the higher performance growth shown in Figure 1.1—a rate that is unprecedented in the computer industry. This rate of growth has compounded so that by 2001, the difference between the highestperformance microprocessors and what would have been obtained by relying solely on technology, including improved circuit design, was about a factor of 15.
Figure 1.1 Growth in microprocessor performance since the mid-1980s has been substantially higher than in earlier years as shown by plotting SPECint performance. This chart plots relative performance as measured by the SPECint benchmarks with base of one being a VAX 11/780. Since SPEC has changed over the years, performance of newer machines is estimated by a scaling factor that relates the performance for two different versions of SPEC (e.g., SPEC92 and SPEC95). Prior to the mid-1980s, microprocessor performance growth was largely technology driven and averaged about 35% per year. The increase in growth since then is attributable to more advanced architectural and organizational ideas. By 2001 this growth led to a difference in performance of about a factor of 15. Performance for floating-point-oriented calculations has increased even faster.
In the last few years, the tremendous improvement in integrated circuit capability has allowed older, less-streamlined architectures, such as the x86 (or IA-32) architecture, to adopt many of the innovations first pioneered in the RISC designs. As we will see, modern x86 processors basically consist of a front end that fetches and decodes x86 instructions and maps them into simple ALU, memory access, or branch operations that can be executed on a RISC-style pipelined processor. Beginning in the late 1990s, as transistor counts soared, the overhead (in transistors) of interpreting the more complex x86 architecture became negligible as a percentage of the total transistor count of a modern microprocessor.
This text is about the architectural ideas and accompanying compiler improvements that have made this incredible growth rate possible. At the center of this dramatic revolution has been the development of a quantitative approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools. It is this style and approach to computer design that is reflected in this text.
Sustaining the recent improvements in cost and performance will require continuing innovations in computer design, and we believe such innovations will be founded on this quantitative approach to computer design. Hence, this book has been written not only to document this design style, but also to stimulate you to contribute to this progress.
The Art of Computer Programming,
Preface to Volume 1 (of 7) (1968)
Why We Wrote This Book
This quote from Donald Knuth has never been truer, at least, not for us! Although this is officially the second edition of Computer Architecture: A Quantitative Approach, it is really our third book in a series that began with the first edition, continued with Computer Organization and Design: The Hardware/Software Interface, and brings us now to this book, one that we have largely written from scratch. Since a major goal of our efforts has been to present concepts in the context of the most recent machines, there was remarkably little from the first edition that could be preserved intact.
This said, we once again offer an enthusiastic welcome to anyone who came along with us the first time, as well as to those who are joining us now. Either way, we can promise the same quantitative approach to, and analysis of, real machines.
As with the first edition, our goal has been to describe the basic principles underlying what will be tomorrow's technological developments. For readers who are new to our series with this book, we hope to demonstrate what we stated about computer architecture in our preface to the first edition: It is not a dreary science of paper machines that will never work. No! It's a discipline of keen intellectualinterest, requiring the balance of marketplace forces to cost/performance, leading to glorious failures and some notable successes.
Our primary objective in writing our first book was to change the way people learn and think about computer architecture. We feel this goal is still valid and important. The field is changing daily and must be studied with real examples and measurements on real machines, rather than simply as a collection of definitions and designs that will never need to be realized.
We have strived to produce a new edition that will continue to be as relevant for professional engineers and architects as it is for those involved in advanced computer architecture and design courses. As much as its predecessor, this edition aims to dem ystify computer architecture through an emphasis on cost/performance tradeoffs and good engineering design. We believe that the field has continued to mature and move toward the rigorous quantitative foundation of long-established scientific and engineeri ng disciplines. Our greatest satisfaction derives from the fact that the principles described in our first edition in 1990 could be applied successfully to help predict the landscape of computing technology that exists today. We hope that our new book wil l allow readers to apply the fundamentals for similar results as we look forward to the coming decades.
The second edition of Computer Architecture: A Quantitative Approach should have been easy to write. After all, our approach hasn''t changed and we sought to continue our focus on the basic principles of computer design. Unfortunately for our task as authors, though the principles haven''t changed, the examples that are relevant today and the conclusions that may be drawn have changed a great deal.
Consequently, we started with a careful evaluation of the first edition content from some very helpful readers who are gratefully acknowledged in the section that follows. We also thought carefully about what we would need to do to fulfill the bookOs ori ginal objectives, given the current state of computing. Being pragmatists, we started with a selective view of what would need to be revised, but there was no way to avoid the fact that everything needed to be reexamined. In the end we can offer you somet hing that is virtually new.
We created fresh drafts, we condensed, we rewrote, we revised, and we eliminated less essential discussions. We enlarged the discussion of the topics we feel are the most exciting and valuable new ideas of computer engineering and science, and we reduced the coverage of other topics. In addition, we benefited from the help of another beta class-testing.
Despite these efforts or perhaps because of them, the second edition was actually harder to write than the first. We blame our workload on society''s hunger for faster computers. The microprocessors being shipped today are the most sophisticated computers ever built, and feeding an expectation of doubled performance every 18 months has meant introducing new and extremely sophisticated implementation techniques.
Perhaps most important, we did not want to produce a new edition unless we were certain that it could meet the standard of contribution that we set for ourselves in our first book. In the final cut, the basic concepts and approach remain, but are now viewed with the added perspective of several years of progress in the field.
Topic Selection and Organization
As before, we have taken a conservative approach to topic selection, for there are many more interesting ideas in the field than can reasonably be covered in a treatment of basic principles. We have steered away from a comprehensive survey of every archit ecture a reader might encounter. Instead, our presentation focuses on core concepts likely to be found in any new machine. The key criterion remains that of selecting ideas that have been examined and utilized successfully enough to permit their discussion in quantitative terms.
Again, our hope is that the design principles and quantitative data in this book will delimit discussions of architecture styles to terms like ''faster and cheaper,' in contrast to previous debates.
Our first dilemma in determining the new topic selections was that topics requiring only a few pages in the first edition have since exploded in their importance. Secondly, topics that we excluded previously have matured to a point where they can be disc ussed based on our quantitative criteria and their success in the marketplace.
As a final filter, we evaluated the extent of introductory material necessary for this presentation independent of our decision making for the first edition. Again, our reviewers and readers were immensely helpful. To the extent that readers felt certain discussions were less necessary, we eliminated them in favor of new material. For example, many of our readers learned the basics of our approach in the first edition or from Computer Organization and Design: The Hardware/Software Interface.
Our intent has always been to focus on material that is not available in equivalent form from other sources, so we continue to emphasize advanced content wherever possible. With the availability at this writing of a complete introduction in Computer Orga nization and Design, the focus on advanced topics was somewhat easier to maintain than in our first book. An Overview of the Content In keeping with our focus on advanced topics, we consolidated the first five chapters of our earlier edition into two. The basic content of the first edition Chapters 1 (Fundamentals of Computer Design) and 2 (Performance and Cost) are introduced in the opening chapter, Fundamentals of Computer Design. These ideas work better together than they did in isolation.
The two chapters on instruction sets from the previous edition are now presented in our second chapter, Instruction Set Principles and Examples, and its companion piece, Appendix D, An Alternative to RISC: The Intel 80x86. We rewrote this material to sho w how program measurements led to the design of our generic RISC machine, DLX. Because of the dominance of the 80x86 design, we no longer cover the VAX or IBM 360 instruction sets. The 80x86 appendix represents increased coverage and can form the basis of a careful study of this architecture and other computers of its class.
This de-emphasis on the history of instruction sets follows the continuing evolution of the content of computer architecture courses. In the 1950s and 1960s, course content consisted primarily of computer arithmetic. In the 1970s and early 1980s, it was large instruction set design. Today, it is largely high-performance processor implementation, memory hierarchy, input/output, and multiprocessors. Hence the organization of the remaining chapters and appendices.
Chapters 3 and 4 cover all aspects of instruction execution in high-performance processors: pipelining, superscalar execution, branch prediction, and dynamic scheduling. This encompasses so much material that, even with the introduction to pipelining ava ilable in Computer Organization and Design, we needed an intermediate and an advanced chapter.
Chapter 5 discusses the remarkable number of optimizations in cache design that have occurred over the last six years. A partial list of topics new to this edition includes victim caches, pseudo-associative caches, prefetching, cache-oriented compiler op timizations, nonblocking caches, and pipelined caches.
Chapters 6 and 7 represent the increasing importance of input/output in today''s computers. Chapter 6, Storage Systems, concentrates on storage I/O, presenting trends in magnetic disks and tapes. It describes redundant arrays of inexpensive disks, or RAID, and gives a comparison of disk I/O performance for a wide range of UNIX systems. What was covered in three pages on interconnection networks in the first edition has expanded into Chapter 7. This new chapter takes the novel approach of covering in a com mon framework the interconnection networks found in local area networks, wide area networks, and massively parallel processors.
Chapter 8, Multiprocessors, and Appendix E, Implementing Coherence Protocols, expand the 15 pages from the first edition into an extensive discussion of the issues involved with shared-memory multiprocessors. Snooping bus protocols and directory-based protocols are fully covered, with Appendix E describing in detail why these protocols work and providing insight into some of the complexities of real coherency algorithms.
This brings us to Appendices A through C. Appendix A updates computer arithmetic, including fused multiply-add and the infamous Pentium floating point error that was front page news in 1994. Vector processors, covered previously as Chapter 7 of the first edition, has been revised as Appendix B in this presentation, due in part to the 100-plus pages on instruction level parallelism in the main text and in part reflecting a decline in the popularity of vector architectures. Appendix C updates the the first edition RISC appendix, replacing coverage of the Intel i860 and Motorola 88000 with that of the HP PA-RISC and IBM/Motorola PowerPC and giving the 64-bit instruction set architectures of MIPS, PowerPC, and SPARC.
There is no single best order in which to approach these chapters. We wrote the text so that it can be covered in several ways, the only real restriction being that some chapters should be read in sequence, namely, Chapters 2, 3, and 4 (pipelining) and Chapters 6 and 7 (storage systems, interconnection networks). Readers should start with Chapter 1 and read Chapter 5 (memory-hierarchy design) before Chapter 8 (multiprocessors), but the rest of the material can be covered in any order.
In summary, about 70% of the pages are new to this edition. As the second edition is also about 20% longer than the first, you can see why it was as much work as writing a new book.
Readers interested in a gentle introduction to computer architecture should read Computer Organization and Design: The Hardware/Software Interface. If you already have the first edition of Computer Architecture and are not interested in advanced processo r design, advanced cache design, storage systems, interconnection networks, multiprocessors, 64-bit RISC architectures, or the 80x86, then you can save both money and space on your bookshelf. Obviously, we think owners of the first edition should be inter ested in many of these topics or we wouldnOt have written about them!
Chapter Structure and Exercises
The material we have selected has been stretched upon a consistent framework that is followed in each chapter. We start by explaining the ideas of a chapter. These ideas are followed by a Crosscutting Issues section, a feature new to the second edition t hat shows how the ideas covered in one chapter interact with those given in other chapters. This is followed by a Putting It All Together section that ties these ideas together by showing how they are used in a real machine.
Next in the sequence is Fallacies and Pitfalls, which lets readers learn from the mistakes of others. We show examples of common misunderstandings and architectural traps that are difficult to avoid even when you know they are lying in wait for you. Each chapter ends with a Concluding Remarks section, followed by a Historical Perspective and References section that attempts to give proper credit for the ideas in the chapter and a sense of the history surrounding the inventions. We like to think of this a s presenting the human drama of computer design. It also supplies references that the student of architecture may want to pursue. If you have time, we recommend reading some of the classic papers in the field that are mentioned in these sections. It is bo th enjoyable and educational to hear the ideas directly from the creators.
Each chapter ends with exercises, over 200 in total, which vary from one-minute reviews to term projects. Brackets for each exercise (  1 minute (read and understand) Supplements An Instructor''s Manual with fully worked-out solutions to the exercises in the book is available from the publisher only to instructors teaching from this book. If you are not an instructor and would like access to solutions, there is also a Solutions Manual available for sale from the publisher that consists of a selected subset of the solutions in the Instructor''s Manual. Software to accompany this book, including DLX compilers, assemblers, and simulators, cache simulation tools, and traces, is available to readers at the Morgan Kaufmann home page on the World Wide Web at http://mkp.com. This page also contains a list of errata, postscript versions of the numb ered figures in the book, and pointers to related material that readers may enjoy. In response to your continued support, the publisher will add new materials and establish links to other sites on a regular basis. Finally, it is possible to make money while reading this book. Talk about cost/performance! If you read the Acknowledgments that follow, you will see that we went to great lengths to correct mistakes. Since a book goes through many printings, we have the opportunity to make even more corrections. If you uncover any remaining resilient bugs, please contact the publisher by electronic mail. The first reader to report an error with a fix that we incorporate in a future printing will be re warded with a $1.00 bounty. Please check the errata sheet on the home page to see if the bug has already been reported. We process the bugs and send the checks about once a year, so please be patient. Concluding Remarks Once again this book is a true co-authorship, with each of us writing half the chapters and half the appendices. We can''t imagine how long it would have taken without someone else doing half the work, offering inspiration when the task seemed hopeless, providing the key insight to explain a difficult concept, supplying reviews over the weekend of 100-page chapters, and commiserating when the weight of our other obligations made it hard to pick up the pen. Thus, once again we share equally the blame for what you are about to read.
 15 to 20 minutes for full answer
 1 hour for full written answer
 Short programming project: less than 1 full day of programming
 Significant programming project: 2 weeks of elapsed time
 Term project (2 to 4 weeks by two people)
[Discussion] Topic for discussion with others interested in computer architecture
 1 minute (read and understand)
An Instructor''s Manual with fully worked-out solutions to the exercises in the book is available from the publisher only to instructors teaching from this book.
If you are not an instructor and would like access to solutions, there is also a Solutions Manual available for sale from the publisher that consists of a selected subset of the solutions in the Instructor''s Manual. Software to accompany this book, including DLX compilers, assemblers, and simulators, cache simulation tools, and traces, is available to readers at the Morgan Kaufmann home page on the World Wide Web at http://mkp.com. This page also contains a list of errata, postscript versions of the numb ered figures in the book, and pointers to related material that readers may enjoy. In response to your continued support, the publisher will add new materials and establish links to other sites on a regular basis.
Finally, it is possible to make money while reading this book. Talk about cost/performance! If you read the Acknowledgments that follow, you will see that we went to great lengths to correct mistakes. Since a book goes through many printings, we have the opportunity to make even more corrections. If you uncover any remaining resilient bugs, please contact the publisher by electronic mail. The first reader to report an error with a fix that we incorporate in a future printing will be re warded with a $1.00 bounty. Please check the errata sheet on the home page to see if the bug has already been reported. We process the bugs and send the checks about once a year, so please be patient.
Once again this book is a true co-authorship, with each of us writing half the chapters and half the appendices. We can''t imagine how long it would have taken without someone else doing half the work, offering inspiration when the task seemed hopeless, providing the key insight to explain a difficult concept, supplying reviews over the weekend of 100-page chapters, and commiserating when the weight of our other obligations made it hard to pick up the pen. Thus, once again we share equally the blame for what you are about to read.
I am very lucky to have studied computer architecture under Prof. David Patterson at U.C. Berkeley more than 20 years ago. I enjoyed the courses I took from him, in the early days of RISC architecture. Since leaving Berkeley to help found Sun Microsystems, I have used the ideas from his courses and many more that are described in this important book.
The good news today is that this book covers incredibly important and contemporary material. The further good news is that much exciting and challenging work remains to be done, and that working from Computer Architecture: A Quantitative Approach is a great way to start.
The most successful architectural projects that I have been involved in have always started from simple ideas, with advantages explainable using simple numerical models derived from hunches and rules of thumb. The continuing rapid advances in computing technology and new applications ensure that we will need new similarly simple models to understand what is possible in the future, and that new classes of applications will stress systems in different and interesting ways.
The quantitative approach introduced in Chapter 1 is essential to understanding these issues. In particular, we expect to see, in the near future, much more emphasis on minimizing power to meet the demands of a given application, across all sizes of systems; much remains to be learned in this area.
I have worked with many different instruction sets in my career. I first programmed a PDP-8, whose instruction set was so simple that a friend easily learned to disassemble programs just by glancing at the hole punches in paper tape! I wrote a lot of code in PDP-11 assembler, including an interpreter for the Pascal programming language and for the VAX (which was used as an example in the first edition of this book); the success of the VAX led to the widespread use of UNIX on the early Internet.
The PDP-11 and VAX were very conventional complex instruction set (CISC) computer architectures, with relatively compact instruction sets that proved nearly impossible to pipeline. For a number of years in public talks I used the performance of the VAX 11/780 as the baseline; its speed was extremely well known because faster implementations of the architecture were so long delayed. VAX performance stalled out just as the x86 and 680x0 CISC architectures were appearing in microprocessors; the strong economic advantages of microprocessors led to their overwhelming dominance. Then the simpler reduced instruction set (RISC) computer architectures—pioneered by John Cocke at IBM; promoted and named by Patterson and Hennessy; and commercialized in POWER PC, MIPS, and SPARC—were implemented as microprocessors and permitted highperformance pipeline implementations through the use of their simple registeroriented instruction sets. A downside of RISC was the larger code size of programs and resulting greater instruction fetch bandwidth, a cost that could be seen to be acceptable using the techniques of Chapter 1 and by believing in the future CMOS technology trends promoted in the now-classic views of Carver Mead. The kind of clear-thinking approach to the present problems and to the shape of future computing advances that led to RISC architecture is the focus of this book.
Chapter 2 (and various appendices) presents interesting examples of contemporary and important historical instruction set architecture. RISC architecture—the focus of so much work in the last twenty years—is by no means the final word here. I worked on the design of the SPARC architecture and several implementations for a decade, but more recently have worked on two different styles of processor: picoJava, which implemented most of the Java Virtual Machine instructions—a compact, high-level, bytecoded instruction set—and MAJC, a very simple and multithreaded VLIW for Java and media-intensive applications. These two architectures addressed different and new market needs: for lowpower chips to run embedded devices where space and power are at a premium, and for high performance for a given amount of power and cost where parallel applications are possible. While neither has achieved widespread commercial success, I expect that the future will see many opportunities for different ISAs, and an in-depth knowledge of history here often gives great guidance—the relationships between key factors, such as the program size, execution speed, and power consumption, returning to previous balances that led to great designs in the past.
Chapters 3 and 4 describe instruction-level parallelism (ILP): the ability to execute more than one instruction at a time. This has been aided greatly, in the last 20 years, by techniques such as RISC and VLIW (very long instruction word) computing. But as later chapters here point out, both RISC and especially VLIW as practiced in the Intel itanium architecture are very power intensive. In our attempts to extract more instruction-level parallelism, we are running up against the fact that the complexity of a design that attempts to execute N instructions simultaneously grows like N2: the number of transistors and number of watts to produce each result increases dramatically as we attempt to execute many instructions of arbitrary programs simultaneously. There is thus a clear countertrend emerging: using simpler pipelines with more realistic levels of ILP while exploiting other kinds of parallelism by running both multiple threads of execution per processor and, often, multiple processors on a single chip. The challenge for designers of high-performance systems of the future is to understand when simultaneous execution is possible, but then to use these techniques judiciously in combination with other, less granular techniques that are less power intensive and complex.
In graduate school I would often joke that cache memories were the only great idea in computer science. But truly, where you put things affects profoundly the design of computer systems. Chapter 5 describes the classical design of cache and main memory hierarchies and virtual memory. And now, new, higher-level programming languages like Java support much more reliable software because they insist on the use of garbage collection and array bounds checking, so that security breaches from "buffer overflow" and insidious bugs from false sharing of memory do not creep into large programs. It is only languages, such as Java, that insist on the use of automatic storage management that can implement true software components. But garbage collectors are notoriously hard on memory hierarchies, and the design of systems and language implementations to work well for such areas is an active area of research, where much good work has been done but much exciting work remains.
Java also strongly supports thread-level parallelism—a key to simple, powerefficient, and high-performance system implementations that avoids the N2 problem discussed earlier but brings challenges of its own. A good foundational understanding of these issues can be had in Chapter 6. Traditionally, each processor was a separate chip, and keeping the various processors synchronized was expensive, both because of its impact on the memory hierarchy and because the synchronization operations themselves were very expensive. The Java language is also trying to address these issues: we tried, in the Java Language Specification, which I coauthored, to write a description of the memory model implied by the language. While this description turned out to have (fixable) technical problems, it is increasingly clear that we need to think about the memory hierarchy in the design of languages that are intended to work well on the newer system platforms. We view the Java specification as a first step in much good work to be done in the future.
As Chapter 7 describes, storage has evolved from being connected to individual computers to being a separate network resource. This is reminiscent of computer graphics, where graphics processing that was previously done in a host processor often became a separate function as the importance of graphics increased. All this is likely to change radically in the coming years—massively parallel host processors are likely to be able to do graphics better than dedicated outboard graphics units, and new breakthroughs in storage technologies, such as memories made from molecular electronics and other atomic-level nanotechnologies, should greatly reduce both the cost of storage and the access time. The resulting dramatic decreases in storage cost and access time will strongly encourage the use of multiple copies of data stored on individual computing nodes, rather than shared over a network. The "wheel of reincarnation," familiar from graphics, will appear in storage.
Chapter 8 provides a great foundational description of computer interconnects and networks. My model of these comes from Andy Bechtolsheim, another of the cofounders of Sun, who famously said, "Ethernet always wins."More modestly stated: given the need for a new networking interconnect, and despite its shortcomings, adapted versions of the Ethernet protocols seem to have met with overwhelming success in the marketplace. Why? Factors such as the simplicity and familiarity of the protocols are obvious, but quite possibly the most likely reason is that the people who are adapting Ethernet can get on with the job at hand rather than arguing about details that, in the end, aren’t dispositive. This lesson can be generalized to apply to all the areas of computer architecture discussed in this book.
One of the things I remember Dave Patterson saying many years ago is that for each new project you only get so many "cleverness beans." That is, you can be very clever in a few areas of your design, but if you try to be clever in all of them, the design will probably fail to achieve its goals—or even fail to work or to be finished at all. The overriding lesson that I have learned in 20 plus years of working on these kinds of designs is that you must choose what is important and focus on that; true wisdom is to know what to leave out. A deep knowledge of what has gone before is key to this ability.
And you must also choose your assumptions carefully. Many years ago I attended a conference in Hawaii (yes, it was a boondoggle, but read on) where Maurice Wilkes, the legendary computer architect, gave a speech. What he said, paraphrased in my memory, is that good research often consists of assuming something that seems untrue or unlikely today will become true and investigating the consequences of that assumption. And if the unlikely assumption indeed then becomes true in the world, you will have done timely and sometimes, then, even great research! So, for example, the research group at Xerox PARC assumed that everyone would have access to a personal computer with a graphics display connected to others by an internetwork and the ability to print inexpensively using Xerography. How true all this became, and how seminally important their work was!
In our time, and in the field of computer architecture, I think there are a number of assumptions that will become true. Some are not controversial, such as that Moore’s Law is likely to continue for another decade or so and that the complexity of large chip designs is reaching practical limits, often beyond the point of positive returns for additional complexity. More controversially, perhaps, molecular electronics is likely to greatly reduce the cost of storage and probably logic elements as well, optical interconnects will greatly increase the bandwidth and reduce the error rates of interconnects, software will continue to be unreliable because it is so difficult, and security will continue to be important because its absence is so debilitating.
Taking advantage of the strong positive trends detailed in this book and using them to mitigate the negative ones will challenge the next generation of computer architects, to design a range of systems of many shapes and sizes.
Computer architecture design problems are becoming more varied and interesting. Now is an exciting time to be starting out or reacquainting yourself with the latest in this field, and this book is the best place to start. See you in the chips!
Posted February 2, 2002
Posted December 11, 2000
I studied processor design using two books - this one and 'Computer Organization and Design' by the same authors (Hennessy and Patterson). I still remember an intense joy of understanding when I read chapter about pipeline organization. I actually designed a little processor using these two books as the main source of information. This book also contains many stories about computer industry veterans - stories of invention and success. Note: The target audience is hardware engineers or EDA/CAD/CAE people. This book is NOT for software/CS people - I saw some software students who did not understand why do they need to study it and what this book is talking about. But hardware student with some Verilog / VHDL experience can instantly implement many ideas described in this book.Was this review helpful? Yes NoThank you for your feedback. Report this reviewThank you, this review has been flagged.
Posted June 3, 2000
In an attempt to be a complete authority on different architecture schemes, this book did not prioritize content. There was not enough emphasis on what has become industry standards. Many of the architecture designs covered have been abandoned by the industry for more than a decade. This book is good at covering a broad range of information, but I'd prefer that it contain more information on what is actually found in today's computers.Was this review helpful? Yes NoThank you for your feedback. Report this reviewThank you, this review has been flagged.