CUDA Application Design and Development

Paperback (Print)
Used and New from Other Sellers
Used and New from Other Sellers
from $38.06
Usually ships in 1-2 business days
(Save 23%)
Other sellers (Paperback)
  • All (5) from $38.06   
  • New (4) from $38.06   
  • Used (1) from $39.50   


As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan.

The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries.

Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding.

  • Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computing
  • Addresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchy
  • Includes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure.
  • Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material
Read More Show Less

Editorial Reviews

From the Publisher

The book by Rob Faber on CUDA Application Design and Development is required reading for anyone who wants to understand and efficiently program CUDA for scientific and visual programming. It provides a hands-on exposure to the details in a readable and easy to understand form. Jack Dongarra, Innovative ComputingLaboratory, EECS Department, University of Tennessee

GPUs have the potential to take computational simulations to new levels of scale and detail. Many scientists are already realising these benefits, tackling larger and more complex problems that are not feasible on conventional CPU-based systems. This book provides the tools and techniques for anyone wishing to join these pioneers, in an accessible though thorough text that a budding CUDA programmer would do well to keep close to hand. Dr. George Beckett, EPCC, University of Edinburgh

With his book, Farber takes us on a journey to the exciting world of programming multi-core processor machines with CUDA. Farber's pragmatic approach is effective in guiding the reader across challenges and their solutions. Farber's broader presentation of parallel programming with CUDA ranging from CUDA in Cloud and Cluster environments to CUDA for real problems and applicationshelps the reader learning about the unique opportunities this parallel programming language can offer to the scientific community. This book is definitely a must for students, teachers, and developers! Michela Taufer, Assistant Professor, Department of Computer and Information Sciences, University of Delaware

Rob Farber has written an enlightening and accessible book on the application to CUDA for real research tasks, with an eye to developing scalable and distributed GPU applications. He supplies clear and usable code examples combined with insight about _why_ one should use a particular approach. This is an excellent book filled with practical advice for experienced CUDA programmers and ground-up guidance for beginners wondering if CUDA can accelerate their time to solution. Paul A. Navrátil, Manager, Visualization Software, Texas Advanced Computing Center

The book provides a solid introduction to the CUDA programming language starting with the basics and progressively exposing the reader to advanced concepts through the well annotated implementation of real-world applications. It makes a first-rate presentation of CUDA, its use in the implementation of portable and efficient applications and the underlying architecture of GPGPU/CPU systems with particular emphasis on memory hierarchies. This is complemented by a thorough presentation both of the CUDA Tool Suite and of techniques for the parallelisation of applications. Farber's book is a valuable addition to the bookshelves of both the advanced and novice CUDA programmer. Francis Wray, Independent Consultant and Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston

At a brisk pace, "CUDA Application Design and Development" will take one from the basics of CUDA programming to the level where real-time video processing becomes a stroll in the park. Along the way, the reader can get a clear understanding of how the hybrid CPU-GPU computing idea can be capitalized on, and how a 500-GPU configuration can be used in large scale machine learning problems. Wasting no time on obscure issues of little relevance, the book provides an excellent account of the CUDA execution model, memory access issues, opportunities to increase parallelism in a program, and how advanced profiling can squeeze performance out of a code. Rob provides a snapshot of everything that is relevant in CUDA based GPU computing in a style honed through a long series of Dr. Dobb’s articles that have delighted scores of CUDA programmers. His followers will be delighted once again. Dan Negrut, Associate Professor, University of Wisconsin-Madison, NVIDIA CUDA Fellow

Read More Show Less

Product Details

  • ISBN-13: 9780123884268
  • Publisher: Elsevier Science
  • Publication date: 11/14/2011
  • Pages: 336
  • Sales rank: 1,427,435
  • Product dimensions: 7.46 (w) x 9.18 (h) x 0.82 (d)

Meet the Author

Rob Farber has served as a scientist at the Irish Center for High-End Computing, U.S. national labs in Los Alamos, Berkeley, and the Pacific Northwest, and external faculty at the Santa Fe Institute. His articles have appeared in Dr. Dobb's Journal and Scientific Computing, among others.
Read More Show Less

Read an Excerpt

CUDA Application Design and Development

By Rob Farber


Copyright © 2011 NVIDIA Corporation and Rob Farber
All right reserved.

ISBN: 978-0-12-388432-9

Chapter One

First Programs and How to Think in CUDA

The purpose of this chapter is to introduce the reader to CUDA (the parallel computing architecture developed by NVIDIA) and differentiate CUDA from programming conventional single and multicore processors. Example programs and instructions will show the reader how to compile and run programs as well as how to adapt them to their own purposes. The CUDA Thrust and runtime APIs (Application Programming Interface) will be used and discussed. Three rules of GPGPU programming will be introduced as well as Amdahl's law, Big-O notation, and the distinction between data-parallel and task-parallel programming. Some basic GPU debugging tools will be introduced, but for the most part NVIDIA has made debugging CUDA code identical to debugging any other C or C++ application. Where appropriate, references to introductory materials will be provided to help novice readers. At the end of this chapter, the reader will be able to write and debug massively parallel programs that concurrently utilize both a GPGPU and the host processor(s) within a single application that can handle a million threads of execution.

At the end of the chapter, the reader will have a basic understanding of:

* How to create, build, and run CUDA applications.

* Criteria to decide which CUDA API to use.

* Amdahl's law and how it relates to GPU computing.

* Three rules of high-performance GPU computing.

* Big-O notation and the impact of data transfers.

* The difference between task-parallel and data-parallel programming.

* Some GPU-specific capabilities of the Linux, Mac, and Windows CUDA debuggers.

* The CUDA memory checker and how it can find out-of-bounds and misaligned memory errors.


Source code for all the examples in this book can be downloaded from A wiki (a website collaboratively developed by a community of users) is available to share information, make comments, and find teaching material; it can be reached at any of the following aliases on

* My name:

* The title of this book as one word: CUDAapplicationdesignanddevelopment.

* The name of my series: supercomputingforthemasses.


Programming a sequential processor requires writing a program that specifies each of the tasks needed to compute some result. See Example 1.1, "seqSerial.cpp, a sequential C++ program":

Example 1.1 //seqSerial.cpp #include <iostream> #include <vector> using namespace std; int main() { const int N=50000; // task 1: create the array vector<int> a(N); // task 2: fill the array for(int i=0; i < N; i++) a[i]=i; // task 3: calculate the sum of the array int sumA=0; for(int i=0; i < N; i++) sumA += a[i]; // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else {cerr << "Test FAILED!" << endl; return(1);} return(0); }

Example 1.1 performs five tasks:

1. It creates an integer array.

2. A for loop fills the array a with integers from 0 to N-1.

3. The sum of the integers in the array is computed.

4. A separate for loop computes the sum of the integers by an alternate method.

5. A comparison checks that the sequential and parallel results are the same and reports the success of the test.

Notice that the processor runs each task consecutively one after the other. Inside of tasks 2–4, the processor iterates through the loop starting with the first index. Once all the tasks have finished, the program exits. This is an example of a single thread of execution, which is illustrated in Figure 1.1 for task 2 as a single thread fills the first three elements of array a.

This program can be compiled and executed with the following commands:

* Linux and Cygwin users (Example 1.2, "Compiling with g++"):

Example 1.2 g++ seqSerial.cpp –o seqSerial ./seqSerial

* Utilizing the command-line interface for Microsoft Visual Studio users (Example 1.3, "Compiling with the Visual Studio Command-Line Interface"):

Example 1.3 cl.exe seqSerial.cpp –o seqSerial.exe seqSerial.exe

* Of course, all CUDA users (Linux, Windows, MacOS, Cygwin) can utilize the NVIDIA nvcc compiler regardless of platform (Example 1.4, "Compiling with nvcc"):

Example 1.4 nvcc seqSerial.cpp –o seqSerial ./seqSerial

In all cases, the program will print "Test succeeded!"

For comparison, let's create and run our first CUDA program, in C++. (Note: CUDA supports both C and C++ programs. For simplicity, the following example was written in C++ using the Thrust data-parallel API as will be discussed in greater depth in this chapter.) CUDA programs utilize the file extension suffix ".cu" to indicate CUDA source code. See Example 1.5, "A Massively Parallel CUDA Code Using the Thrust API":

Example 1.5 // #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> int main() { const int N=50000; // task 1: create the array thrust::device_vector<int> a(N); // task 2: fill the array thrust::sequence(a.begin(), a.end(), 0); // task 3: calculate the sum of the array int sumA= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else { cerr << "Test FAILED!" << endl; return(1);} return(0); }

Example 1.5 is compiled with the NVIDIA nvcc compiler under Windows, Linux, and MacOS. If nvcc is not available on your system, download and install the free CUDA tools, driver, and SDK (Software Development Kit) from the NVIDIA CUDA Zone ( See Example 1.6, "Compiling and Running the Example":

Example 1.6 nvcc –o seqCuda ./seqCuda

Again, running the program will print "Test succeeded!"

Congratulations: you just created a CUDA application that uses 50,000 software threads of execution and ran it on a GPU! (The actual number of threads that run concurrently on the hardware depends on the capabilities of the GPGPU in your system.)

Aside from a few calls to the CUDA Thrust API (prefaced by thrust:: in this example), the CUDA code looks almost identical to the sequential C++ code. The highlighted lines in the example perform parallel operations.

Unlike the single-threaded execution illustrated in Figure 1.1, the code in Example 1.5 utilizes many threads to perform a large number of concurrent operations as is illustrated in Figure 1.2 for task 2 when filling array a.


CUDA offers several APIs to use when programming. They are from highest to lowest level:

1. The data-parallel C++ Thrust API

2. The runtime API, which can be used in either C or C++

3. The driver API, which can be used with either C or C++

Regardless of the API or mix of APIs used in an application, CUDA can be called from other high-level languages such as Python, Java, FORTRAN, and many others. The calling conventions and details necessary to correctly link vary with each language.

Which API to use depends on the amount of control the developer wishes to exert over the GPU. Higher-level APIs like the C++ Thrust API are convenient, as they do more for the programmer, but they also make some decisions on behalf of the programmer. In general, Thrust has been shown to deliver high computational performance, generality, and convenience. It also makes code development quicker and can produce easier to read source code that many will argue is more maintainable. Without modification, programs written in Thrust will most certainly maintain or show improved performance as Thrust matures in future releases. Many Thrust methods like reduction perform significant work, which gives the Thrust API developers much freedom to incorporate features in the latest hardware that can improve performance. Thrust is an example of a well-designed API that is simple yet general and that has the ability to be adapted to improve performance as the technology evolves.

A disadvantage of a high-level API like Thrust is that it can isolate the developer from the hardware and expose only a subset of the hardware capabilities. In some circumstances, the C++ interface can become too cumbersome or verbose. Scientific programmers in particular may feel that the clarity of simple loop structures can get lost in the C++ syntax.

Use a high-level interface first and choose to drop down to a lower-level API when you think the additional programming effort will deliver greater performance or to make use of some lower-level capability needed to better support your application. The CUDA runtime in particular was designed to give the developer access to all the programmable features of the GPGPU with a few simple yet elegant and powerful syntactic additions to the C-language. As a result, CUDA runtime code can sometimes be the cleanest and easiest API to read; plus, it can be extremely efficient. An important aspect of the lowest-level driver interface is that it can provide very precise control over both queuing and data transfers.

Expect code size to increase when using the lower-level interfaces, as the developer must make more API calls and/or specify more parameters for each call. In addition, the developer needs to check for runtime errors and version incompatibilities. In many cases when using low-level APIs, it is not unusual for more lines of the application code to be focused on the details of the API interface than on the actual work of the task.

Happily, modern CUDA developers are not restricted to use just a single API in an application, which was not the case prior to the CUDA 3.2 release in 2010. Modern versions of CUDA allow developers to use any of the three APIs in their applications whenever they choose. Thus, an initial code can be written in a high-level API such as Thrust and then refactored to use some special characteristic of the runtime or driver API.

Let's use this ability to mix various levels of API calls to highlight and make more explicit the parallel nature of the sequential fill task (task 2) from our previous examples. Example 1.7, "Using the CUDA Runtime to Fill an Array with Sequential Integers," also gives us a chance to introduce the CUDA runtime API:

Example 1.7 // #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> __global__ void fillKernel(int *a, int n) { int tid = blockIdx.x*blockDim.x + threadIdx.x; if (tid < n) a[tid] = tid; } void fill(int* d_a, int n) { int nThreadsPerBlock= 512; int nBlocks= n/nThreadsPerBlock + ((n%nThreadsPerBlock)?1:0); fillKernel <<< nBlocks, nThreadsPerBlock >>> (d_a, n); } int main() { const int N=50000; // task 1: create the array thrust::device_vector<int> a(N); // task 2: fill the array using the runtime fill(thrust::raw_pointer_cast(&a[0]),N); // task 3: calculate the sum of the array int sumA= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else { cerr << "Test FAILED!" << endl; return(1);} return(0); }


Excerpted from CUDA Application Design and Development by Rob Farber Copyright © 2011 by NVIDIA Corporation and Rob Farber. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More Show Less

Table of Contents

1. First Programs and How to Think in CUDA

2. CUDA for Machine Learning and Optimization

3. The CUDA Tool Suite: Profiling a PCA/NLPCA Functor

4. The CUDA Execution Model

5. CUDA Memory

6. Efficiently Using GPU Memory

7. Techniques to Increase Parallelism

8. CUDA for All GPU and CPU Applications

9. Mixing CUDA and Rendering

10. CUDA in a Cloud and Cluster Environments

11. CUDA for Real Problems: Monte Carlo, Modeling, and More

12. Application Focus on Live Streaming Video

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)