Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

GPU Computing Gems Emerald Edition

GPU Computing Gems Emerald Edition

by Elsevier Science

See All Formats & Editions

GPU Computing Gems Emerald Edition offers practical techniques in parallel computing using graphics processing units (GPUs) to enhance scientific research. The first volume in Morgan Kaufmann's Applications of GPU Computing Series, this book offers the latest insights and research in computer vision, electronic design automation, and emerging


GPU Computing Gems Emerald Edition offers practical techniques in parallel computing using graphics processing units (GPUs) to enhance scientific research. The first volume in Morgan Kaufmann's Applications of GPU Computing Series, this book offers the latest insights and research in computer vision, electronic design automation, and emerging data-intensive applications. It also covers life sciences, medical imaging, ray tracing and rendering, scientific simulation, signal and audio processing, statistical modeling, video and image processing.

This book is intended to help those who are facing the challenge of programming systems to effectively use GPUs to achieve efficiency and performance goals. It offers developers a window into diverse application areas, and the opportunity to gain insights from others' algorithm work that they may apply to their own projects. Readers will learn from the leading researchers in parallel programming, who have gathered their solutions and experience in one volume under the guidance of expert area editors. Each chapter is written to be accessible to researchers from other domains, allowing knowledge to cross-pollinate across the GPU spectrum. Many examples leverage NVIDIA's CUDA parallel computing architecture, the most widely-adopted massively parallel programming solution. The insights and ideas as well as practical hands-on skills in the book can be immediately put to use.

Computer programmers, software engineers, hardware engineers, and computer science students will find this volume a helpful resource. For useful source codes discussed throughout the book, the editors invite readers to the following website: …"

  • Covers the breadth of industry from scientific simulation and electronic design automation to audio / video processing, medical imaging, computer vision, and more
  • Many examples leverage NVIDIA's CUDA parallel computing architecture, the most widely-adopted massively parallel programming solution
  • Offers insights and ideas as well as practical "hands-on" skills you can immediately put to use

Editorial Reviews

From the Publisher
Praise for GPU Computing Gems: Emerald Edition: "GPU computing is becoming an outstanding field in high performance computing. Due to its easiness, the CUDA approach enables programmers to take advantage of GPU-acceleration very quickly… My research in complex science as well as applications in high frequency trading benefited significantly from GPU computing." —Dr. Tobias Preis, ETH Zurich, Switzerland

"This book is an important reference for everyone working on GPU/CUDA, and contains definitive work in a selection of fields. The patterns of CUDA parallelization it describes can often be adapted to applications in other fields." —Dr. Ming Ouyang, Assistant Professor - Director Visualization and Intensive Graphics Lab, University of Louisville

"Diving into the world of GPU computing has never been more important these days. GPU Computing Gems: Emerald Edition takes you through the looking glass into this fascinating world." —Martin Eisemann, Computer Graphics Lab, TU Braunschweig

"…an outstanding collection of vignettes of how to program GPUs for a breathtaking range of applications." —Dr. Amitabh Varshney, Director, Institute for Advanced Computer Studies, University of Maryland

"The book features a useful index that might help readers mine the gems in search of a solution to a specific algorithmic problem. The index is accompanied by online resources containing source code samples-and further information-for some of the chapters. A second volume with another 30 chapters of GPGPU application reports, somewhat more focused on generic algorithms and programming techniques, is currently in the pipeline and scheduled to appear as the "Jade Edition" sometime this month." —Computing in Science and Engineering

"The book is an excellent selection of important papers describing various applications of GPUs. As such, I believe it would be a valuable addition to the bookshelf of any researcher in modeling and simulation…This is not a substitute for a more detailed text on massively parallel programming...Instead, it is a nice practical addition to that text." —Computing Reviews, August 2012

"...the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010

Product Details

Elsevier Science
Publication date:
Applications of GPU Computing Series
Sold by:
Barnes & Noble
File size:
21 MB
This product may take a few minutes to download.

Read an Excerpt

GPU Computing Gems Emerald Edition

By Wen-mei W. Hwu

Morgan Kaufmann

Copyright © 2011 NVIDIA Corporation and Wen-mei W. Hwu
All right reserved.

ISBN: 978-0-12-384989-2

Chapter One

GPU-Accelerated Computation and Interactive Display of Molecular Orbitals John E. Stone, David J. Hardy, Jan Saam, Kirby L. Vandivort, Klaus Schulten

In this chapter, we present several graphics processing unit (GPU) algorithms for evaluating molecular orbitals on three-dimensional lattices, as is commonly used for molecular visualization. For each kernel, we describe necessary design trade-offs, applicability to various problem sizes, and performance on different generations of GPU hardware. We then demonstrate the appropriate and effective use of fast on-chip GPU memory subsystems for access to key data structures, show several GPU kernel optimization principles, and explore the application of advanced techniques such as dynamic kernel generation and just-in-time (JIT) kernel compilation techniques.


The GPU kernels described here form the basis for the high-performance molecular orbital display algorithms in VMD, a popular molecular visualization and analysis tool. VMD (Visual Molecular Dynamics) is a software system designed for displaying, animating, and analyzing large biomolecular systems. More than 33,000 users have registered and downloaded the most recent VMD software, version 1.8.7. Due to its versatility and user-extensibility, VMD is also capable of displaying other large datasets, such as sequence data, results of quantum chemistry calculations, and volumetric data. While VMD is designed to run on a diverse range of hardware — laptops, desktops, clusters, and supercomputers — it is primarily used as a scientific workstation application for interactive 3-D visualization and analysis. For computations that run too long for interactive use, VMD can also be used in a batch mode to render movies for later use. A motivation for using GPU acceleration in VMD is to make slow batch-mode jobs fast enough for interactive use, thereby drastically improving the productivity of scientific investigations. With CUDA-enabled GPUs widely available in desktop PCs, such acceleration can have a broad impact on the VMD user community. To date, multiple aspects of VMD have been accelerated with the NVIDIA Compute Unified Device Architecture (CUDA), including electrostatic potential calculation, ion placement, molecular orbital calculation and display, and imaging of gas migration pathways in proteins.

Visualization of molecular orbitals (MOs) is a helpful step in analyzing the results of quantum chemistry calculations. The key challenge involved in the display of molecular orbitals is the rapid evaluation of these functions on a three-dimensional lattice; the resulting data can then be used for plotting isocontours or isosurfaces for visualization as shown in Fig. 1.1, and for other types of analyses. Most existing software packages that render MOs perform calculations on the CPU and have not been heavily optimized. Thus, they require runtimes of tens to hundreds of seconds depending on the complexity of the molecular system and spatial resolution of the MO discretization and subsequent surface plots.

With sufficient performance (two orders of magnitude faster than traditional CPU algorithms), a fast real-space lattice computation enables interactive display of even very large electronic structures and makes it possible to smoothly animate trajectories of orbital dynamics. Prior to the use of the GPU, this could be accomplished only through extensive batch-mode precalculation and preloading of timevarying lattice data into memory, making it impractical for everyday interactive visualization tasks. Efficient single-GPU algorithms are capable of evaluating molecular orbital lattices up to 186 times faster than a single CPU core (see Table 1.1), enabling MOs to be rapidly computed and animated on the fly for the first time. A multi-GPU version of our algorithm has been benchmarked at up to 419 times the performance of a single CPU core (see Table 1.2).


Since our target application is visualization focused, we are concerned with achieving interactive rendering performance while maintaining sufficient accuracy. The CUDA programming language enables GPU hardware features — inaccessible in existing programmable shading languages — to be exploited for higher performance, and it enables the use of multiple GPUs to accelerate computation further. Another advantage of using CUDA is that the results can be used for nonvisualization purposes.

Our approach combines several performance enhancement strategies. First, we use the host CPU to carefully organize input data and coefficients, eliminating redundancies and enforcing a sorted ordering that benefits subsequent GPU memory traversal patterns. The evaluation of molecular orbitals on a 3-D lattice is performed on one or more GPUs; the 3-D lattice is decomposed into 2-D planar slices, each of which is assigned to a GPU and computed. The workload is dynamically scheduled across the pool of GPUs to balance load on GPUs of varying capability. Depending on the specific attributes of the problem, one of three hand-coded GPU kernels is algorithmically selected to optimize performance. The three kernels are designed to use different combinations of GPU memory systems to yield peak memory bandwidth and arithmetic throughput depending on whether the input data can fit into constant memory, shared memory, or L1/L2 cache (in the case of recently released NVIDIA "Fermi" GPUs). One useful optimization involves the use of zero-copy memory access techniques based on the CUDA mapped host memory feature to eliminate latency associated with calls to cudaMemcpy(). Another optimization involves dynamically generating a problem-specific GPU kernel "on the fly" using justin-time (JIT) compilation techniques, thereby eliminating various sources of overhead that exist in the three general precoded kernels.


A molecular orbital (MO) represents a statistical state in which an electron can be found in a molecule, where the MO's spatial distribution is correlated with the associated electron's probability density. Visualization of MOs is an important task for understanding the chemistry of molecular systems. MOs appeal to the chemist's intuition, and inspection of the MOs aids in explaining chemical reactivities. Some popular software tools with these capabilities include MacMolPlt, Molden, Molekel, and VMD.

The calculations required for visualizing MOs are computationally demanding, and existing quantum chemistry visualization programs are only fast enough to interactively compute MOs for only small molecules on a relatively coarse lattice. At the time of this writing, only VMD and MacMolPlt support multicore CPUs, and only VMD uses GPUs to accelerate MO computations. A great opportunity exists to improve upon the capabilities of existing tools in terms of interactivity, visual display quality, and scalability to larger and more complex molecular systems.

1.3.1 Mathematical Background

In this section we provide a short introduction to MOs, basis sets, and their underlying equations. Interested readers are directed to seek further details from computational chemistry texts and review articles. Quantum chemistry packages solve the electronic Schrödinger equation HΨ = EΨ or a given system. Molecular orbitals are the solutions produced by these packages. MOs are the eigenfunctions Ψv for expression of the molecular wavefunction Ψ, with H the Hamiltonian operator and E the system energy. The wavefunction determines molecular properties, for instance, the oneelectron density is ρ(r) = |Ψ(r)|. The visualization of the molecular orbitals resulting from quantum chemistry calculations requires evaluating the wavefunction on a 3-D lattice so that isovalue surfaces can be computed and displayed. With minor modifications, the algorithms and approaches we present for evaluating the wavefunction can be adapted to compute other molecular properties such as charge density, the molecular electrostatic potential, or multipole moments.

Each MO Ψv can be expressed as a linear combination over a set of K basis functions Φk,


where cvk are coefficients contained in the quantum chemistry calculation output files, and used as input for our algorithms. The basis functions used by the vast majority of quantum chemical calculations are atom-centered functions that approximate the solution of the Schrödinger equation for a single hydrogen atom with one electron, so-called atomic orbitals. For increased computational efficiency, Gaussian type orbitals (GTOs) are used to model the basis functions, rather than the exact solutions for the hydrogen atom:


The exponential factor ζ is defined by the basis set; i, j, and k are used to modulate the functional shape; and Nζijk is a normalization factor that follows from the basis set definition. The distance from a basis function's center (nucleus) to a point in space is represented by the vector R = {x, y, z} of length R = |R|.

The exponential term in Eq. 1.2 determines the radial decay of the function. Composite basis functions known as contracted GTOs (CGTOs) are composed of a linear combination of P individual GTO primitives in order to accurately describe the radial behavior of atomic orbitals.


The set of contraction coefficients {cp} and associated exponents {ζp} defining the CGTO are contained in the quantum chemistry simulation output.

CGTOs are classified into different shells based on the sum l = i + j + k of the exponents of the x, y, and z factors. The shells are designated by letters s, p, d, f, and g for l = 0, 1, 2, 3, 4, respectively, where we explicitly list here the most common shell types but note that higher-numbered shells are occasionally used. The set of indices for a shell is also referred to as the angular momenta of that shell. We establish an alternative indexing of the angular momenta based on the shell number l and a systematic indexing m over the possible number of sums l = i + j + k, where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] counts the number of combinations and m = 0, ..., Ml - 1 references the set {(i, j, k): i + j + k = l}.

The linear combination defining the MO Ψv must also sum contributions from each of the N atoms of the molecule and the Ln shells of each atom n. The entire expression, now described in terms of the data output from a QM package, for an MO wavefunction evaluated at a point r in space then becomes


where we have replaced cvk by cvnlm, with the vectors Rn = r - rn connecting the position rn of the nucleus of atom n to the desired spatial coordinate r. We have dropped the subscript p from the set of contraction coefficients {c} and exponents {ζ} with the understanding that each CGTO requires an additional summation over the primitives, as expressed in Eq. 1.3.

The normalization factor Nζijk in Eq. 1.2 can be factored into a first part ηζl that depends on both the exponent ζ and shell type l = i + j + k and a second part ηijk (=ηlm in terms of our alternative indexing) that depends only on the angular momentum,


The separation of the normalization factor in Eq. 1.5 allows us to factor the summation over the primitives from the summation over the array of wavefunction coefficients. Combining Eqs. 1.2–1.4 and rearranging terms gives



Excerpted from GPU Computing Gems Emerald Edition by Wen-mei W. Hwu Copyright © 2011 by NVIDIA Corporation and Wen-mei W. Hwu . Excerpted by permission of Morgan Kaufmann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Meet the Author

Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Customer Reviews

Average Review:

Post to your social network


Most Helpful Customer Reviews

See all customer reviews