The Barnes & Noble Review
Computer Architecture: A Quantitative Approach, Fourth Edition doesn't just illuminates the cutting edge in computer architecture; it captures the passion and excitement that still drives the discipline.
Yes, the field has matured: For instance, traditional sequential processing no longer delivers major performance boosts. Hence, this new edition focuses far more heavily on parallel processing at the thread and data levels, not just the instruction level. These techniques are crucial to today's single-chip multicores, and tomorrow's.
That's just one way Hennessey and Patterson have updated their classic. Error rates are rising as process sizes keep shrinking, so there's more prominent coverage of dependability. Power efficiency gets more important every year, so that coverage has been beefed up, too. What hasn't changed? The book thoroughly demystifies the trade-offs that are central to architecture and design, making it as valuable to experienced professionals as it is to newcomers. Bill Camarda, from the December 2006 Read Only
Text reference for computer architecture and design presents the critical tools to analyze uniprocessor computers. It shows the practicing engineer how technology changes over time and offers the empirical constants needed for design. A baseline is established for analysis and comparisons by using the most important machines in each class: mainframe (IBM 360), mini (DEC VAX), and micro/PC (Intel 80x86). With this foundation, the coming mainline of simpler pipelined and parallel processors is shown. Annotation c. Book News, Inc., Portland, OR (booknews.com)
Rebirth of an Instant Classic
When I reviewed the first edition of this book in the October, 1990 Programmer's Bookshelf column of Dr. Dobb's Journal, I wrote:
Computer Architecture: A Quantitative Approach
is a tour-de-force on several levels. The book is a masterpiece of technical writing -- Hennessy and Patterson's clear, direct style is absorbing and effective, and their enthusiasm for their subject is contagious. The design and production, too, are impeccable. Furthermore, because the book presents a hardheaded and
pragmatic approach to computer design, based on real examples, real
measurements, and lessons learned from the successes and misadventures of the past, it should revolutionize the teaching of computer architecture and implementation.
Although this book was not written primarily for programmers, it is a thorough and extraordinarily wide-ranging education in that magical interface between the programmer's intentions and the electron's actions. It should be read by every software craftsman who cares about wringing the last drop of performance from his machine.
Flowery words, without a doubt -- the type of glowing review that one sometimes regrets a few years later. In this case, however, the book (and my review) have stood the test of time. It is even more clear now that Computer Architecture is one of the all-time greats.
The second edition is about 200 pages longer than the first and has a significant shift in focus, although the structure is much the same. The first edition contained fascinating explanations of the IBM 360 and DEC VAX architectures as the archetypes of "mainframes" and "minicomputers" respectively. You'll never find a clearer explanation of IBM's "channels" anywhere. The second edition jettisons almost all the IBM and VAX material, and substitutes discussions of the HP
PA-RISC and Motorola PowerPC for the Intel i860 and Motorola 88000, but considerably expands the coverage of high-performance instruction execution strategies, multiprocessing, caches, and magnetic storage.
Comprehensive sections on processor interconnection, networking, and 64-bit architectures have been added.
Computer Architecture: A Quantitative Approach, Second Edition is
a must-buy for every serious programmer, engineer, computer science student, and technical library. But hold on to your copy of the first edition as well; it may come in very handy in a few decades when your grandchildren ask you "What were those giant VAX boxes anyway?"--Dr. Dobb's Electronic Review of Computer Books
From the Publisher
"What has made this book an enduring classic is that each edition is not an update, but an extensive revision that presents the most current information and unparalleled insight into this fascinating and fast changing field. For me, after over twenty years in this profession, it is also another opportunity to experience that student-grade admiration for two remarkable teachers." From the Foreword by Luiz André Barroso, Google, Inc.
Read an Excerpt
Chapter 5: I/O And Consistency of Cached DataBecause of caches, data can be found in memory and in the cache. As long as the CPU is the sole device changing or reading the data and the cache stands between the CPU and memory, there is little danger in the CPU seeing the old or stale copy. I/O devices give the opportunity for other devices to cause copies to be inconsistent or for other devices to read the stale copies. Figure 5.46 illustrates the problem, generally referred to as the cache-coherency problem.
The question is this: Where does the I/O occur in the computer-between the I/O device and the cache or between the I/O device and main memory? If input puts data into the cache and output reads data from the cache, both I/O and the CPU see the same data, and the problem is solved. The difficulty in this approach is that it interferes with the CPU. I/O competing with the CPU for cache access will cause the CPU to stall for I/O. Input will also interfere with the cache by displacing some information with the new data that is unlikely to be accessed by the CPU soon. For example, on a page fault the CPU may need to access a few words in a page, but a program is not likely to access every word of the page if it were loaded into the cache. Given the integration of caches onto the same integrated circuit, it is also difficult for that interface to be visible.
The goal for the I/O system in a computer with a cache is to prevent the stale data problem while interfering with the CPU as little as possible. Many systems, therefore, prefer that I/O occur directly to main memory, with main memory
FIGURE 5.46 The cache-coherency problem. A' and B refer to the cached copiesof A and B in memory. (a) shows cache and main memory in a coherent state. In (b) we assume a write-back cache when the CPU writes 550 into A. Now A' has the value but the value in memory has the old, stale value of 100. If an output used the value of A from memory, it would get the stale data. In (c) the I/O system inputs 440 into the memory copy of B, so now B, in the cache has the old, stale data acting as an I/O buffer. If a write-through cache is used, then memory has an upto-date copy of the information, and there is no stale-data issue for output. (This is a reason many machines use write through.) Input requires some extra work. The software solution is to guarantee that no blocks of the I/O buffer designated for input are in the cache. In one approach, a buffer page is marked as noncachable; the operating system always inputs to such a page. In another approach, the operating system flushes the buffer addresses from the cache after the input occurs. A hardware solution is to check the I/O addresses on input to see if they are in the cache; to avoid slowing down the cache to check addresses, sometimes a duplicate set of tags are used to allow checking of I/O addresses in parallel with processor cache accesses. If there is a match of I/O addresses in the cache, the cache entries are invalidated to avoid stale data. All these approaches can also be used for output with write-back caches. More about this is found in Chapter 6.
The cache-coherency problem applies to multiprocessors as well as I/O. Unlike I/O, where multiple data copies are a rare event-one to be avoided whenever possible-a program running on multiple processors will want to have copies of the same data in several caches. Performance of a multiprocessor program depends on the performance of the system when sharing data. The protocols to maintain coherency for multiple processors are called cache-coherency protocols and are described in Chapter 8.
5.10 Putting It All Together the Alpha AXP 21064 Memory Hierarchy Thus far we have given glimpses of the Alpha AXP 21064 memory hierarchy; this section unveils the full design and shows the performance of its components for the SPEC92 programs. Figure 5.47 gives the overall picture of this design.
Let's really start at the beginning, when the Alpha is turned on. Hardware on the chip loads the instruction cache from an external PROM. This initialization allows the 8-KB instruction cache to omit a valid bit, for there are always valid instructions in the cache; they just might not be the ones your program is interested in. The hardware does clear the valid bits in the data cache. The PC is set to the kseg segment so that the instruction addresses are not translated, thereby avoiding the TLB.
One of the first steps is to update the instruction TLB with valid page table entries (PTEs) for this process. Kernel code updates the TLB with the contents of the appropriate page table entry for each page to be mapped. The instruction TLB has eight entries for 8-KB pages and four for 4-MB pages. (The 4-MB pages are used by large programs such as the operating system or data bases that will likely touch most of their code.) A miss in the TLB invokes the Privileged Architecture Library (PAL code) software that updates the TLB. PAL code is simply machine language routines with some implementation-specific extensions to allow access to low-level hardware, such as the TLB. PAL code runs with exceptions disabled, and instruction accesses are not checked for memory management violations, allowing PAL code to fill the TLB.
Once the operating system is ready to begin executing a user process, it sets the PC to the appropriate address in segment segO. We are now ready to follow memory hierarchy in action: Figure 5.47 is labeled with the steps of this narrative. The page frame portion of this address is sent to the TLB (step 1), while the 8-bit index from the page offset is sent to the direct-mapped 8-KB (256 32-byte blocks) instruction cache (step 2). The fully associative TLB simultaneously searches all 12 entries to find a match between the address and a valid PTE (step 3). In addition to translating the address, the TLB checks to see if the PTE demands that this access result in an exception. An exception might occur if either this access violates the protection on the page or ifthe page is not in main memory. If there is no exception, and if the translated physical address matches the tag in the instruction cache (step 4), then the proper 8 bytes of the 32-byte block are furnished to the CPU using the lower bits of the page offset (step 5), and the instruction stream access is done.
A miss, on the other hand, simultaneously starts an access to the second-level cache (step 6) and checks the prefetch instruction stream buffer (step 7). If the desired instruction is found in the stream buffer (step 8), the critical 8 bytes are sent to the CPU, the full 32-byte block of the stream buffer is written into the instruction cache (step 9), and the request to the second-level cache is canceled. Steps 6 to 9 take just a single clock cycle.
If the instruction is not in the prefetch stream buffer, the second-level cache continues trying to fetch the block. The 21064 microprocessor is designed to work with direct-mapped second-level caches from 128 KB to 8 MB with a miss penalty between 3 and 16 clock cycles. For this section we use the memory system of the DEC 3000 model 800 Alpha AXP. It has a 2-MB (65,536 32-byte blocks) second-level cache, so the 29-bit block address is divided into a 13-bit tag and a 16-bit index (step 10). The cache reads the tag from that index and if it matches (step 11), the cache returns the critical 16 bytes in the first 5 clock cycles and the other 16 bytes in the next 5 clock cycles (step 12). The path between the first- and second-level cache is 128 bits wide (16 bytes). At the same time, a request is made for the next sequential 32-byte block, which is loaded into the instruction stream buffer in the next 10 clock cycles (step 13).
The instruction stream does not rely on the TLB for address translation. It simply increments the physical address of the miss by 32 bytes, checking to make sure that the new address is within the same page. If the incremented address crosses a page boundary, then the prefetch is suppressed.
If the instruction is not found in the secondary cache, the translated physical address is sent to memory (step 14). The DEC 3000 model 800 divides memory into four memory mother boards (MMB), each of which contains two to eight SIMMs (single inline memory modules). The SIMMs come with eight DRAMs for information plus two DRAMs for error protection per side, and the options are single- or double-sided SIMMs using I-Mbit, 4-Mbit, or 16-Mbit DRAMs. Hence the memory capacity of the model 800 is 8 MB (4 x 2 x 8 x I x 1/8) to 1024 MB (4 x 8 x 8 x 16 x 2/8), always organized 256 bits wide. The average time to transfer 32 bytes from memory to the secondary cache is 36 clock cycles after the processor makes the request. The second-level cache loads this data 16 bytes at a time.
Since the second-level cache is a write-back cache, any miss can lead to the old block being written back to memory. The 21064 places this "victim" block into a victim buffer to get out of the way of new data (step 15). The new data are loaded into the cache as soon as they arrive (step 16), and then the old data are written from the victim buffer (step 17). There is a single block in the victim buffer, so a second miss would need to stall until the victim buffer empties.
Suppose this initial instruction is a load. It will send the page frame of its data address to the data TLB (step 18) at the same time as the 8-bit index from the page offset is sent to the data cache (step 19). The data TLB is a fully associative cache containing 32 PTEs, each of which represents page sizes from 8 KB to 4 MB. A TLB miss will trap to PAL code to load the valid PTE for this address. In the worst case, the page is not in memory, and the operating system gets the page from disk (step 20). Since millions of instructions could execute during a page fault, the operating system will swap in another process if there is something waiting to run.
Assuming that we have a valid PTE in the data TLB (step 21), the cache tag and the physical page frame are compared (step 22), with a match sending the desired 8 bytes from the 32-byte block to the CPU (step 23). A miss goes to the second-level cache, which proceeds exactly like an instruction miss.
Suppose the instruction is a store instead of a load. The page frame portion of the data address is again sent to the data TLB and the data cache (steps 18 and 19), which checks for protection violations as well as translates the address. The physical address is then sent to the data cache (steps 21 and 22). Since the data cache uses write through, the store data are simultaneously sent to the write buffer (step 24) and the data cache (step 25). As explained on page 425, the 21064 pipelines write hits. The data address of this store is checked for a match, and at the same time the data from the previous write hit are written to the cache (step 26). If the address check was a hit, then the data from this store are placed in the write pipeline buffer. On a miss, the data are just sent to the write buffer since the data cache does not allocate on a write miss.
The write buffer takes over now. It has four entries, each containing a whole cache block. If the buffer is full, then the CPU must stall until a block is written to the second-level cache. If the buffer is not full, the CPU continues and the address of the word is presented to the write buffer (step 27). It checks to see if the word matches any block already in the buffer so that a sequence of writes can be stitched together into a full block, thereby optimizing use of the write bandwidth between the first- and second-level cache.
All writes are eventually passed on to the second-level cache. If a write is a hit, then the data are written to the cache (step 28). Since the second-level cache uses write back, it cannot pipeline writes: a full 32-byte block write takes 5 clock cycles to check the address and 10 clock cycles to write the data. A write of 16 bytes or less takes 5 clock cycles to check the address and 5 clock cycles to write the data. In either case the cache marks the block as dirty.
If the access to the second-level cache is a miss, the victim block is checked to see if it is dirty; if so, it is placed in the victim buffer as before (step 15). If the new data are a full block, then the data are simply written and marked dirty. A partial block write results in an access to main memory since the second-level cache policy is to allocate on a write miss....