GPU Computing Gems Jade Edition

GPU Computing Gems, Jade Edition, offers hands-on, proven techniques for general purpose GPU programming based on the successful application experiences of leading researchers and developers. One of few resources available that distills the best practices of the community of CUDA programmers, this second edition contains 100% new material of interest across industry, including finance, medicine, imaging, engineering, gaming, environmental science, and green computing. It covers new tools and frameworks for productive GPU computing application development and provides immediate benefit to researchers developing improved programming environments for GPUs. Divided into five sections, this book explains how GPU execution is achieved with algorithm implementation techniques and approaches to data structure layout. More specifically, it considers three general requirements: high level of parallelism, coherent memory access by threads within warps, and coherent control flow within warps. Chapters explore topics such as accelerating database searches; how to leverage the Fermi GPU architecture to further accelerate prefix operations; and GPU implementation of hash tables. There are also discussions on the state of GPU computing in interactive physics and artificial intelligence; programming tools and techniques for GPU computing; and the edge and node parallelism approach for computing graph centrality metrics. In addition, the book proposes an alternative approach that balances computation regardless of node degree variance. Software engineers, programmers, hardware engineers, and advanced students will find this book extremely usefull. For useful source codes discussed throughout the book, the editors invite readers to the following website: …" - This second volume of GPU Computing Gems offers 100% new material of interest across industry, including finance, medicine, imaging, engineering, gaming, environmental science, green computing, and more - Covers new tools and frameworks for productive GPU computing application development and offers immediate benefit to researchers developing improved programming environments for GPUs - Even more hands-on, proven techniques demonstrating how general purpose GPU computing is changing scientific research - Distills the best practices of the community of CUDA programmers; each chapter provides insights and ideas as well as 'hands on' skills applicable to a variety of fields

GPU Computing Gems Jade Edition

74.95 In Stock

GPU Computing Gems Jade Edition

Add to Wishlist

GPU Computing Gems Jade Edition

eBook

$74.95

eBook
$74.95

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9780123859648
Publisher:	Morgan Kaufmann Publishers
Publication date:	11/02/2011
Series:	Applications of GPU Computing Series
Sold by:	Barnes & Noble
Format:	eBook
Pages:	560
File size:	8 MB

About the Author

Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Read an Excerpt

GPU Computing Gems Jade Edition

By Wen-mei W. Hwu

MORGAN KAUFMANN PUBLISHERS

Copyright © 2012 Elsevier, Inc.
All right reserved.
ISBN: 978-0-12-385964-8

Chapter One

Large-Scale GPU Search

Tim Kaldewey and Andrea Di Blas

With P-ary search we developed a novel scalable parallel search algorithm that optimally leverages Single Instruction Multiple Data (SIMD) architectures like graphical processing unit (GPUs). It outperforms conventional search algorithms like binary search in terms of throughput and response time by up to two orders of magnitude. Moreover, P-ary search scales with the number of threads/cores used to collaboratively answer an individual search query. While response time for conventional search algorithms tends to increase with workload size, P-ary search provides nearly constant query response time, independent of workload. Finally, P-ary search is particularly suited for data structures like B-trees, which are widely used for database indexes.

1.1 INTRODUCTION

As search engines like Google and Yahoo! demonstrate every day, efficiently searching for information in large data collections has become an indispensable operation of the information age. Although disciplines like information retrieval have developed highly efficient algorithms to manage large quantities of data, at the system level searching large data sets is I/O bound. Rapidly growing main memory sizes eliminate disk I/O as a bottleneck, even for traditionally data-intensive applications like search. While the front lines may have shifted from disk to main memory I/O, the basic problem remains: processors are significantly faster than storage. On today's systems accessing main memory takes hundreds of compute cycles.

While memory bandwidth has been increasing continuously, we hit the "memory wall" for latency more than a decade ago. Moreover, the increases in memory bandwidth over the past decade are a result of the multi/many-core era, with GPUs being a poster child, offering memory bandwidth well in excess of 100 GB/s. However, achieving peak bandwidth requires concurrent (parallel) memory access from all cores. The obvious use of increasingly parallel architectures for large-scale search applications is to execute multiple search queries simultaneously, which often results in significant increases in throughput. Query response time, on the other hand, remains memory-latency bound, unless we abandon conventional serial algorithms and go back to the drawing board.

To grasp the implications of porting a memory-bound application like search to the GPU, we first conducted a brief memory performance analysis (see Section 1.2). Conventional threading models and the memory access pattern produced by concurrent search queries do not map well to the GPU's memory subsystem (see Section 1.3.2). With P-ary search we demonstrate how parallel memory access combined with the superior synchronization capabilities of SIMD architectures like GPUs can be leveraged to compensate for memory latency (see the section titled "P-ary Search: Parallel Search from Scratch"). P-ary search outperforms conventional search algorithms not only in terms of throughput but also response time (see Section 1.4). Implemented on a GPU it can outperform a similarly priced CPU by up to three times, and is compatible with existing index data structures like inverted lists and B-trees. We expect the underlying concepts of P-ary search — parallel memory access and exploiting efficient SIMD thread synchronization — to be applicable to an entire class of memory-bound applications (see Section 1.5).

1.2 MEMORY PERFORMANCE

With more than six times the memory bandwidth of contemporary CPUs, GPUs are leading the trend toward throughput computing. On the other hand, traditional search algorithms besides linear scan are latency bound since their iterations are data dependent. However, as large database systems usually serve many queries concurrently both metrics— latency and bandwidth—are relevant.

Memory latency is mainly a function of where the requested piece of data is located in the memory hierarchy. Comparing CPU and GPU memory latency in terms of elapsed clock cycles shows that global memory accesses on the GPU take approximately 1.5 times as long as main memory accesses on the CPU, and more than twice as long in terms of absolute time (Table 1.1). Although shared memory does not operate the same way as the L1 cache on the CPU, its latency is comparable.

Memory bandwidth, on the other hand, depends on multiple factors, such as sequential or random access pattern, read/write ratio, word size, and concurrency. The effects of word size and read/write behavior on memory bandwidth are similar to the ones on the CPU — larger word sizes achieve better performance than small ones, and reads are faster than writes. On the other hand, the impact of concurrency and data access pattern require additional consideration when porting memory-bound applications to the GPU.

Little's Law, a general principle for queuing systems, can be used to derive how many concurrent memory operations are required to fully utilize memory bandwidth. It states that in a system that processes units of work at a certain average rate W, the average amount of time L that a unit spends inside the system is the product of W and λ, where λ is the average unit's arrival rate: L = λW. Applying Little's Law to memory, the number of outstanding requests must match the product of latency and bandwidth. For our GTX 285 GPU the latency is 500 clock cycles, and the peak bandwidth is 128 bytes per clock cycle—the physical bus width is 512 bits, or a 64-byte memory block, and two of these blocks are transferred per clock cycle— so:

outstanding reads = latency x bandwidth/request size = = 500cc x 128B/cc/(4B/request)/ = 16K requests

assuming 4-byte reads as in the code in Section 1.4. This can be achieved using different combinations of number of threads and outstanding requests per thread. In our example, we could make full use of the global memory by having 1K threads issue 16 independent reads each, or 2K threads issue eight reads each, and so on.

The plots in Figure 1.1 show the case in which each thread has only one outstanding memory request. This serves as a baseline example, mimicking the behavior of conventional search algorithms that at any given time have at most one outstanding memory request per search (thread), due to data dependencies. Obviously, if there are no constraints issuing more than one read per thread it is much more efficient to issue multiple reads per thread to maximize memory utilization. In our case, to saturate memory bandwidth we need at least 16,000 threads, for instance as 64 blocks of 256 threads each, where we observe a local peak. The maximum bandwidth of 150 GB/s is not reached here because the number of threads cannot compensate for some overhead required to manage threads and blocks. Increasing the number of threads, the bandwidth takes a small hit before reaching its peak (Figure 1.1a). Although there are many options to launch 16,000 or more threads, only certain configurations can achieve memory bandwidth close to the maximum. Using fewer than 30 blocks is guaranteed to leave some of the 30 streaming multiprocessors (SMs) idle, and using more blocks that can actively fit the SMs will leave some blocks waiting until others finish and might create some load imbalance. Having mutliple threads per block is always desirable to improve efficiency, but a block cannot have more than 512 threads. Considering 4-byte reads as in our experiments, fewer than 16 threads per block cannot fully use memory coalescing as described below.

Returning to Little's Law, we notice that it assumes that the full bandwidth be utilized, meaning, that all 64 bytes transferred with each memory block are useful bytes actually requested by an application, and not bytes that are transferred just because they belong to the same memory block. When any amount of data is accessed, with a minimum of one single byte, the entire 64-byte block that the data belongs to is actually transferred. To make sure that all bytes transferred are useful, it is necessary that accesses are coalesced, i.e. requests from different threads are presented to the memory management unit (MMU) in such a way that they can be packed into accesses that will use an entire 64-byte block. If, for example, theMMUcan only find 10 threads that read 10 4-byte words from the same block, 40 bytes will actually be used and 24 will be discarded. It is clear that coalescing is extremely important to achieve high memory utilization, and that it is much easier when the access pattern is regular and contiguous. The experimental results in Figure 1.1b confirm that random-access memory bandwidth is significantly lower than in the coalesced case. A more comprehensive explanation of memory architecture, coalescing, and optimization techniques can be found in Nvidia's CUDA Programming Guide.

1.3 SEARCHING LARGE DATA SETS

The problem of searching is not theoretically difficult in itself, but quickly searching a large data set offers an exquisite example of how finding the right combination of algorithms and data structures for a specific architecture can dramatically improve performance.

1.3.1 Data Structures

Searching for specific values in unsorted data leaves no other choice than scanning the entire data set. The time to search an unsorted data set in memory is determined by sequential memory bandwidth, because the access pattern is nothing else but sequential reads. Performance is identical to that of coalesced reads (Figure 1.1a), which peak at about 150 GB/s on the GPU when using many thread blocks and many threads per block. On the other hand, searching sorted data or indexes can be implemented much more efficiently.

Databases and many other applications use indexes, stored as sorted lists or B-trees, to accelerate searches. For example, searching a sorted list using binary search requires O(log₂(n)) memory accesses as opposed to O(n) using linear search, for a data set of size n. However, the data access pattern of binary search is not amenable to caching and prefetching, as each iteration's memory access is data dependent and distant (Figure 1.2). Although each access incurs full memory latency this approach is orders of magnitude faster on sorted data. For example, assuming a memory latency of 500 clock cycles, searching a 512MB data set of 32-bit integers in the worst case takes log₂(128M) * 500cc = 13,500cc, as opposed to millions when scanning the entire data set. B-trees group pivot elements as nodes and store them linearly, which makes them more amenable to caching (Figure 1.2). However, using the same threading model as one would on the CPU — assigning one thread to one search query—inevitably results in threads diverging quickly such that memory accesses across threads cannot be coalesced.

1.3.2 Limitations of Conventional Search Algorithms

Implementing a search function in CUDA can be done by simply addding a global function qualifier to a textbook implementation of binary search. Obviously, the performance of such a naive implementation cannot even keep up with a basic CPU implementation. Although optimizations like manual caching, using vector data types, and inlining functions can significantly improve response time and throughput, the basic problem remains: algorithms like binary search were designed for serial machines and not for massively parallel architectures like the GPU. For instance, each iteration of binary search depends on the outcome of a single load and compare, and response time becomes simply a matter of memory latency (Table 1.1) multiplied by the number of iterations required to find the search key, log₂(n). As the GPU has higher memory latency than the CPU, we cannot expect any improvements in response time.

Adopting the CPU threading model, simply running multiple searches concurrently — mapping one search query to one thread—makes matters worse, for two reasons. First, given the GPU's SIMD architecture, all threads within a thread block have to "wait" until all of them complete their searches. With increasing block size the likelihood for one of the threads to require worst-case runtime increases. Second, while all threads start with the same pivot element(s), they quickly diverge as each thread is assigned a different search key. The resulting memory access patterns are not amenable to caching or coalescing. Moreover, the large amount of small memory accesses is likely to lead to contention, thus introducing additional latency. On the other hand, achieving high memory bandwidth, and thus high application throughput, requires a high level of concurrency, which translates to large workloads. Using conventional algorithms, one has to make a choice of whether to optimize for throughput or for response time, as they appear to be conflicting goals.

(Continues...)

Excerpted from GPU Computing Gems Jade Edition by Wen-mei W. Hwu Copyright © 2012 by Elsevier, Inc.. Excerpted by permission of MORGAN KAUFMANN PUBLISHERS. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Part 1: Parallel Algorithms and Data Structures – Paulius Micikevicius, NVIDIA 1 Large-Scale GPU Search 2 Edge v. Node Parallelism for Graph Centrality Metrics 3 Optimizing parallel prefix operations for the Fermi architecture 4 Building an Efficient Hash Table on the GPU 5 An Efficient CUDA Algorithm for the Maximum Network Flow Problem 6 On Improved Memory Access Patterns for Cellular Automata Using CUDA 7 Fast Minimum Spanning Tree Computation on Large Graphs 8 Fast in-place sorting with CUDA based on bitonic sort Part 2: Numerical Algorithms – Frank Jargstorff, NVIDIA 9 Interval Arithmetic in CUDA 10 Approximating the erfinv Function 11 A Hybrid Method for Solving Tridiagonal Systems on the GPU 12 LU Decomposition in CULA 13 GPU Accelerated Derivative-free Optimization Part 3: Engineering Simulation – Peng Wang, NVIDIA 14 Large-scale gas turbine simulations on GPU clusters 15 GPU acceleration of rarefied gas dynamic simulations 16 Assembly of Finite Element Methods on Graphics Processors 17 CUDA implementation of Vertex-Centered, Finite Volume CFD methods on Unstructured Grids with Flow Control Applications 18 Solving Wave Equations on Unstructured Geometries 19 Fast electromagnetic integral equation solvers on graphics processing units (GPUs) Part 4: Interactive Physics for Games and Engineering Simulation – Richard Tonge, NVIDIA 20 Solving Large Multi-Body Dynamics Problems on the GPU 21 Implicit FEM Solver in CUDA 22 Real-time Adaptive GPU multi-agent path planning Part 5: Computational Finance – Thomas Bradley, NVIDIA 23 High performance finite difference PDE solvers on GPUs for financial option pricing 24 Identifying and Mitigating Credit Risk using Large-scale Economic Capital Simulations 25 Financial Market Value-at-Risk Estimation using the Monte Carlo Method Part 6: Programming Tools and Techniques – Cliff Wooley, NVIDIA 26 Thrust: A Productivity-Oriented Library for CUDA 27 GPU Scripting and Code Generation with PyCUDA 28 Jacket: GPU Powered MATLAB Acceleration 29 Accelerating Development and Execution Speed with Just In Time GPU Code Generation 30 GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot 31 Abstraction for AoS and SoA Layout in C++ 32 Processing Device Arrays with C++ Metaprogramming 33 GPU Metaprogramming: A Case Study in Biologically-Inspired Machine Vision 34 A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs 35 Dynamic Load Balancing using Work-Stealing 36 Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

What People are Saying About This

From the Publisher

Leading minds in GPGPU share cutting-edge parallel computing techniques that increase the speed of scientific innovation

From the B&N Reads Blog

Page 1 of

GPU Computing Gems Jade Edition

GPU Computing Gems Jade Edition

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

GPU Computing Gems Jade Edition

MORGAN KAUFMANN PUBLISHERS

Chapter One

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

GPU Computing Gems Jade Edition

MORGAN KAUFMANN PUBLISHERS

Chapter One

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews