Programming Massively Parallel Processors: A Hands-on Approach

Paperback (Print)
Used and New from Other Sellers
Used and New from Other Sellers
from $19.99
Usually ships in 1-2 business days
(Save 71%)
Other sellers (Paperback)
  • All (9) from $19.99   
  • New (4) from $19.99   
  • Used (5) from $36.13   


Programmig Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.

Describes computational thinking techniques that will enable you to think about problems in ways that are amenable to high-performance parallel computing

Utilizes CUDA (Compute Unified Device Architecture), Nvidia's software development tool created specifically for massively parallel environments

Shows you how to achieve both high performance and high reliability using the CUDA programming model as well as OpenCL

Read More Show Less

Editorial Reviews

From the Publisher
"For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend, as it introduces CUDA (tm), a C-like data parallel language, and Tesla(tm), the architecture of the current generation of NVIDIA GPUs. In addition to explaining the language and the architecture, they define the nature of data parallel problems that run well on the heterogeneous CPU-GPU hardware ... This book is a valuable addition to the recently reinvigorated parallel computing literature." - David Patterson, Director of The Parallel Computing Research Laboratory and the Pardee Professor of Computer Science, U.C. Berkeley. Co-author of Computer Architecture: A Quantitative Approach

"Written by two teaching pioneers, this book is the definitive practical reference on programming massively parallel processors—a true technological gold mine. The hands-on learning included is cutting-edge, yet very readable. This is a most rewarding read for students, engineers, and scientists interested in supercharging computational resources to solve today's and tomorrow's hardest problems." - Nicolas Pinto, MIT, NVIDIA Fellow, 2009

"I have always admired Wen-mei Hwu's and David Kirk's ability to turn complex problems into easy-to-comprehend concepts. They have done it again in this book. This joint venture of a passionate teacher and a GPU evangelizer tackles the trade-off between the simple explanation of the concepts and the in-depth analysis of the programming techniques. This is a great book to learn both massive parallel programming and CUDA." - Mateo Valero, Director, Barcelona Supercomputing Center

"The use of GPUs is having a big impact in scientific computing. David Kirk and Wen-mei Hwu's new book is an important contribution towards educating our students on the ideas and techniques of programming for massively parallel processors." - Mike Giles, Professor of Scientific Computing, University of Oxford

"This book is the most comprehensive and authoritative introduction to GPU computing yet. David Kirk and Wen-mei Hwu are the pioneers in this increasingly important field, and their insights are invaluable and fascinating. This book will be the standard reference for years to come." - Hanspeter Pfister, Harvard University

"This is a vital and much-needed text. GPU programming is growing by leaps and bounds. This new book will be very welcomed and highly useful across inter-disciplinary fields." - Shannon Steinfadt, Kent State University

"GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors." –

Read More Show Less

Product Details

  • ISBN-13: 9780123814722
  • Publisher: Elsevier Science
  • Publication date: 2/5/2010
  • Series: Applications of GPU Computing Series
  • Edition description: Older Edition
  • Pages: 280
  • Sales rank: 985,624
  • Product dimensions: 7.40 (w) x 9.10 (h) x 0.80 (d)

Meet the Author

David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow.

At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers.

Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide.

Wen-mei Hwu: CTO of MulticoreWare, and is a professor at University of Illinois at Urbana-Champaign specializing in compiler design, computer architecture, computer microarchitecture, and parallel processing. He currently holds the Walter J. ("Jerry") Sanders III-Advanced Micro Devices Endowed Chair in Electrical and Computer Engineering in the Coordinated Science Laboratory. He is a PI for the petascale Blue Waters system, is co-director of the Intel and Microsoft funded Universal Parallel Computing Research Center (UPCRC), and PI for the world's first NVIDIA CUDA Center of Excellence. At the Illinois Coordinated Science Lab, Dr. Hwu leads the IMPACT Research Group and is director of the OpenIMPACT project - which has delivered new compiler and computer architecture technologies to the computer industry since 1987. He previously edited GPU Computing Gems, a similar work focusing on NVIDIA CUDA.

Read More Show Less

Read an Excerpt

Programming Massively Parallel Processors

A Hands-on Approach
By David B. Kirk Wen-mei W. Hwu


Copyright © 2013 David B. Kirk/NVIDIA Corporation and Wen-mei Hwu
All right reserved.

ISBN: 978-0-12-391418-7

Chapter One



1.1 Heterogeneous Parallel Computing 2
1.2 Architecture of a Modern GPU 8
1.3 Why More Speed or Parallelism? 10
1.4 Speeding Up Real Applications 12
1.5 Parallel Programming Languages and Models 14
1.6 Overarching Goals 16
1.7 Organization of the Book 17
References 21

Microprocessors based on a single central processing unit (CPU), such as those in the Intel Pentium family and the AMD Opteron family, drove rapid performance increases and cost reductions in computer applications for more than two decades. These microprocessors brought GFLOPS, or giga (1012) floating-point operations per second, to the desktop and TFLOPS, or tera (1015) floating-point operations per second, to cluster servers. This relentless drive for performance improvement has allowed application software to provide more functionality, have better user interfaces, and generate more useful results. The users, in turn, demand even more improvements once they become accustomed to these improvements, creating a positive (virtuous) cycle for the computer industry.

This drive, however, has slowed since 2003 due to energy consumption and heat dissipation issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Since then, virtually all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing power. This switch has exerted a tremendous impact on the software developer community [Sutter2005].

Traditionally, the vast majority of software applications are written as sequential programs, as described by von Neumann in his seminal report in 1945 [vonNeumann1945]. The execution of these programs can be understood by a human sequentially stepping through the code. Historically, most software developers have relied on the advances in hardware to increase the speed of their sequential applications under the hood; the same software simply runs faster as each new generation of processors is introduced. Computer users have also become accustomed to the expectation that these programs run faster with each new generation of microprocessors. Such expectation is no longer valid from this day onward. A sequential program will only run on one of the processor cores, which will not become significantly faster than those in use today. Without performance improvement, application developers will no longer be able to introduce new features and capabilities into their software as new microprocessors are introduced, reducing the growth opportunities of the entire computer industry.

Rather, the applications software that will continue to enjoy performance improvement with each new generation of microprocessors will be parallel programs, in which multiple threads of execution cooperate to complete the work faster. This new, dramatically escalated incentive for parallel program development has been referred to as the concurrency revolution [Sutter2005]. The practice of parallel programming is by no means new. The high-performance computing community has been developing parallel programs for decades. These programs run on large-scale, expensive computers. Only a few elite applications can justify the use of these expensive computers, thus limiting the practice of parallel programming to a small number of application developers. Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased dramatically. There is now a great need for software developers to learn about parallel programming, which is the focus of this book.


Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessors [Hwu2008]. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began with two core processors with the number of cores increasing with each semiconductor process generation. A current exemplar is the recent Intel Core i7™ microprocessor with four processor cores, each of which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction set, supporting hyper-threading with two hardware threads, designed to maximize the execution speed of sequential programs. In contrast, the many-thread trajectory focuses more on the execution throughput of parallel applications. The many-threads began with a large number of threads, and once again, the number of threads increases with each generation. A current exemplar is the NVIDIA GTX680 graphics processing unit (GPU) with 16,384 threads, executing in a large number of simple, in-order pipelines.

Many-threads processors, especially the GPUs, have led the race of floating-point performance since 2003. As of 2012, the ratio of peak floating-point calculation throughput between many-thread GPUs and multicore CPUs is about 10. These are not necessarily application speeds, but are merely the raw speed that the execution resources can potentially support in these chips: 1.5 teraflops versus 150 gigaflops double precision in 2012.

Such a large performance gap between parallel and sequential execution has amounted to a significant "electrical potential" build-up, and at some point, something will have to give. We have reached that point now. To date, this large performance gap has already motivated many application developers to move the computationally intensive parts of their software to GPUs for execution. Not surprisingly, these computationally intensive parts are also the prime target of parallel programming—when there is more work to do, there is more opportunity to divide the work among cooperating parallel workers.

One might ask why there is such a large peak-performance gap between many-threads GPUs and general-purpose multicore CPUs. The answer lies in the differences in the fundamental design philosophies between the two types of processors, as illustrated in Figure 1.1. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of 2012, the high-end general-purpose multicore microprocessors typically have six to eight large processor cores and multiple megabytes of on-chip cache memories designed to deliver strong sequential code performance.

Memory bandwidth is another important issue. The speed of many applications is limited by the rate at which data can be delivered from the memory system into the processors. Graphics chips have been operating at approximately six times the memory bandwidth of contemporaneously available CPU chips. In late 2006, GeForce 8800 GTX, or simply G80, was capable of moving data at about 85 gigabytes per second (GB/s) in and out of its main dynamic random-access memory (DRAM) because of graphics frame buffer requirements and the relaxed memory model (the way various system software, applications, and input/output (I/O) devices expect how their memory accesses work). The more recent GTX680 chip supports about 200 GB/s. In contrast, general-purpose processors have to satisfy requirements from legacy operating systems, applications, and I/O devices that make memory bandwidth more difficult to increase. As a result, CPUs will continue to be at a disadvantage in terms of memory bandwidth for some time.

The design philosophy of GPUs is shaped by the fast-growing video game industry that exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. This demand motivates GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. The prevailing solution is to optimize for the execution throughput of massive numbers of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduced area and power of the memory access hardware and arithmetic units allows the designers to have more of them on a chip and thus increase the total execution throughput.

The application software is expected to be written with a large number of parallel threads. The hardware takes advantage of the large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to the DRAM. This design style is commonly referred to as throughput-oriented design since it strives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute.

The CPUs, on the other hand, are designed to minimize the execution latency of a single thread. Large last-level on-chip caches are designed to capture frequently accessed data and convert some of the long-latency memory accesses into short-latency cache accesses. The arithmetic units and operand data delivery logic are also designed to minimize the effective latency of operation at the cost of increased use of chip area and power. By reducing the latency of operations within the same thread, the CPU hardware reduces the execution latency of each individual thread. However, the large cache memory, low-latency arithmetic units, and sophisticated operand delivery logic consume chip area and power that could be otherwise used to provide more arithmetic execution units and memory access channels. This design style is commonly referred to as latency-oriented design.

It should be clear now that GPUs are designed as parallel, throughput-oriented computing engines and they will not perform well on some tasks on which CPUs are designed to perform well. For programs that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs. When a program has a large number of threads, GPUs with higher execution throughput can achieve much higher performance than CPUs. Therefore, one should expect that many applications use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. This is why the CUDA programming model, introduced by NVIDIA in 2007, is designed to support joint CPU–GPU execution of an application. The demand for supporting joint CPU–GPU execution is further reflected in more recent programming models such as OpenCL (see Chapter 14), OpenACC (see Chapter 15), and C++AMP (see Chapter 18).


Excerpted from Programming Massively Parallel Processors by David B. Kirk Wen-mei W. Hwu Copyright © 2013 by David B. Kirk/NVIDIA Corporation and Wen-mei Hwu. Excerpted by permission of ELSEVIER. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More Show Less

Table of Contents

Preface xi

Acknowledgments xvii

Dedication xix

Chapter 1 Introduction 1

1.1 GPUs as Parallel Computers 2

1.2 Architecture of a Modern GPU 8

1.3 Why More Speed or Parallelism? 10

1.4 Parallel Programming Languages and Models 13

1.5 Overarching Goals 15

1.6 Organization of the Book 16

Chapter 2 History Of GPU Computing 21

2.1 Evolution of Graphics Pipelines 21

2.1.1 The Era of Fixed-Function Graphics Pipelines 22

2.1.2 Evolution of Programmable Real-Time Graphics 26

2.1.3 Unified Graphics and Computing Processors 29

2.1.4 GPGPU: An Intermediate Step 31

2.2 GPU Computing 32

2.2.1 Scalable GPUs 33

2.2.2 Recent Developments 34

2.3 Future Trends 34

Chapter 3 Introduction To Cuda 39

3.1 Data Parallelism 39

3.2 Cuda Program Structure 41

3.3 A Matrix-Matrix Multiplication Example 42

3.4 Device Memories and Data Transfer 46

3.5 Kernel Functions and Threading 51

3.6 Summary 56

3.6.1 Function declarations 56

3.6.2 Kernel launch 56

3.6.3 Predefined variables 56

3.6.4 Runtime API 57

Chapter 4 Cuda Threads 59

4.1 Cuda Thread Organization 59

4.2 Using blockIdx and threadIdx 64

4.3 Synchronization and Transparent Scalability 68

4.4 Thread Assignment 70

4.5 Thread Scheduling and Latency Tolerance 71

4.6 Summary 74

4.7 Exercises 74

Chapter 5 Cuda? Memories 77

5.1 Importance of Memory Access Efficiency 78

5.2 CUDA Device Memory Types 79

5.3 A Strategy for Reducing Global Memory Traffic 83

5.4 Memory as a Limiting Factor to Parallelism 90

5.5 Summary 92

5.6 Exercises 93

Chapter 6 Performance On Siderations 95

6.1 More on Thread Execution 96

6.2 Global Memory Bandwidth 103

6.3 Dynamic Partitioning of SM Resources 111

6.4 Data Prefetching 113

6.5 Instruction Mix 115

6.6 Thread Granularity 116

6.7 Measured Performance and Summary 118

6.8 Exercises 120

Chapter 7 Floating Point Considerations 125

7.1 Floating-Point Format 126

7.1.1 Normalized Representation of M 126

7.1.2 Excess Encoding of E 127

7.2 Representable Numbers 129

7.3 Special Bit Patterns and Precision 134

7.4 Arithmetic Accuracy and Rounding 135

7.5 Algorithm Considerations 136

7.6 Summary 138

7.7 Exercises 138

Chapter 8 Application Case Study: Advanced MRI Reconstruction 141

8.1 Application Background 142

8.2 Iterative Reconstruction 144

8.3 Computing FHd 148

Step 1 Determine the Kernel Parallelism Structure 149

Step 2 Getting Around the Memory Bandwidth Limitation 156

Step 3 Using Hardware Trigonometry Functions 163

Step 4 Experimental Performance Tuning 166

8.4 Final Evaluation 167

8.5 Exercises 170

Chapter 9 Application Case Study: Molecular Visualization and Analysis 173

9.1 Application Background 174

9.2 A Simple Kernel Implementation 176

9.3 Instruction Execution Efficiency 180

9.4 Memory Coalescing 182

9.5 Additional Performance Comparisons 185

9.6 Using Multiple GPUs 187

9.7 Exercises 188

Chapter 10 Parallel Programming and Computational Thinking 191

10.1 Goals of Parallel Programming 192

10.2 Problem Decomposition 193

10.3 Algorithm Selection 196

10.4 Computational Thinking 202

10.5 Exercises 204

Chapter 11 A Brief Introduction To Opencl? 205

11.1 Background 205

11.2 Data Parallelism Model 207

11.3 Device Architecture 209

11.4 Kernel Functions 211

11.5 Device Management and Kernel Launch 212

11.6 Electrostatic Potential Map in OpenCL 214

11.7 Summary 219

11.8 Exercises 220

Chapter 12 Conclusion And Future Outlook 221

12.1 Goals Revisited 221

12.2 Memory Architecture Evolution 223

12.2.1 Large Virtual and Physical Address Spaces 223

12.2.2 Unified Device Memory Space 224

12.2.3 Configurable Caching and Scratch Pad 225

12.2.4 Enhanced Atomic Operations 226

12.2.5 Enhanced Global Memory Access 226

12.3 Kernel Execution Control Evolution 227

12.3.1 Function Calls within Kernel Functions 227

12.3.2 Exception Handling in Kernel Functions 227

12.3.3 Simultaneous Execution of Multiple Kernels 228

12.3.4 Interruptible Kernels 228

12.4 Core Performance 229

12.4.1 Double-Precision Speed 229

12.4 2 Better Control Flow Efficiency 229

12.5 Programming Environment 230

12.6 A Bright Outlook 230

Appendix A Matrix Multiplication Host-Only Version Source Code 233

A.I matrixmul . cu 233

A.2 matrixmul_gold.cpp 237

A.3 matrixmul . h 238

A.4 assist.h 239

A.5 Expected Output 243

Appendix B GPU Compute Capabilities 245

B.1 GPU Compute Capability Tables 245

B.2 Memory Coalescing Variations 246

Index 251

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously
Sort by: Showing 1 Customer Reviews
  • Posted June 18, 2010

    Okay, but far from perfect

    This is the best book on CUDA programming, but mostly that's because it's the only book out there at this time. My two main issues with the book are that it's incomplete and not optimally organized.

    An example for the incompleteness is that texture device memory is never discussed in the book, although it shows up in some schematic device drawings.

    As for the organization, this is an example of Don Knuth's saying that "premature optimization is the root of all evil." I recognize that optimization is a major point in GPU computing, and ultimately necessary, but just like you won't start an introduction to C with a discussion of loop unrolling and cache prefetching, this book should put more emphasis on general principles first and deal with optimization later.

    Was this review helpful? Yes  No   Report this review
Sort by: Showing 1 Customer Reviews

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)