CUDA Application Design and Development

As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan. The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries. Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding. - Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computing - Addresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchy - Includes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure. - Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material

CUDA Application Design and Development

49.95 In Stock

CUDA Application Design and Development

Add to Wishlist

CUDA Application Design and Development

eBook

$49.95

View All Available Formats & Editions

eBook
$49.95

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9780123884329
Publisher:	Morgan Kaufmann Publishers
Publication date:	10/08/2011
Sold by:	Barnes & Noble
Format:	eBook
Pages:	336
File size:	3 MB

About the Author

Rob Farber has served as a scientist in Europe at the Irish Center for High-End Computing as well as U.S. national labs in Los Alamos, Berkeley, and the Pacific Northwest. He has also been on the external faculty at the Santa Fe Institute, consultant to fortune 100 companies, and co-founder of two computational startups that achieved liquidity events. He is the author of "CUDA Application Design and Development as well as numerous articles and tutorials that have appeared in Dr. Dobb's Journal and Scientific Computing, The Code Project and others.
Rob Farber has served as a scientist in Europe at the Irish Center for High-End Computing as well as U.S. national labs in Los Alamos, Berkeley, and the Pacific Northwest. He has also been on the external faculty at the Santa Fe Institute, consultant to fortune 100 companies, and co-founder of two computational startups that achieved liquidity events. He is the author of “CUDA Application Design and Development” as well as numerous articles and tutorials that have appeared in Dr. Dobb's Journal and Scientific Computing, The Code Project and others.

Read an Excerpt

CUDA Application Design and Development

By Rob Farber

MORGAN KAUFMANN

Copyright © 2011 NVIDIA Corporation and Rob Farber
All right reserved.
ISBN: 978-0-12-388432-9

Chapter One

First Programs and How to Think in CUDA

The purpose of this chapter is to introduce the reader to CUDA (the parallel computing architecture developed by NVIDIA) and differentiate CUDA from programming conventional single and multicore processors. Example programs and instructions will show the reader how to compile and run programs as well as how to adapt them to their own purposes. The CUDA Thrust and runtime APIs (Application Programming Interface) will be used and discussed. Three rules of GPGPU programming will be introduced as well as Amdahl's law, Big-O notation, and the distinction between data-parallel and task-parallel programming. Some basic GPU debugging tools will be introduced, but for the most part NVIDIA has made debugging CUDA code identical to debugging any other C or C++ application. Where appropriate, references to introductory materials will be provided to help novice readers. At the end of this chapter, the reader will be able to write and debug massively parallel programs that concurrently utilize both a GPGPU and the host processor(s) within a single application that can handle a million threads of execution.

At the end of the chapter, the reader will have a basic understanding of:

* How to create, build, and run CUDA applications.

* Criteria to decide which CUDA API to use.

* Amdahl's law and how it relates to GPU computing.

* Three rules of high-performance GPU computing.

* Big-O notation and the impact of data transfers.

* The difference between task-parallel and data-parallel programming.

* Some GPU-specific capabilities of the Linux, Mac, and Windows CUDA debuggers.

* The CUDA memory checker and how it can find out-of-bounds and misaligned memory errors.

SOURCE CODE AND WIKI

Source code for all the examples in this book can be downloaded from http://booksite.mkp.com/9780123884268. A wiki (a website collaboratively developed by a community of users) is available to share information, make comments, and find teaching material; it can be reached at any of the following aliases on gpucomputing.net:

* My name: http://gpucomputing.net/RobFarber.

* The title of this book as one word: http://gpucomputing.net/ CUDAapplicationdesignanddevelopment.

* The name of my series: http://gpucomputing.net/ supercomputingforthemasses.

DISTINGUISHING CUDA FROM CONVENTIONAL PROGRAMMING WITH A SIMPLE EXAMPLE

Programming a sequential processor requires writing a program that specifies each of the tasks needed to compute some result. See Example 1.1, "seqSerial.cpp, a sequential C++ program":

Example 1.1 //seqSerial.cpp #include <iostream> #include <vector> using namespace std; int main() { const int N=50000; // task 1: create the array vector<int> a(N); // task 2: fill the array for(int i=0; i < N; i++) a[i]=i; // task 3: calculate the sum of the array int sumA=0; for(int i=0; i < N; i++) sumA += a[i]; // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else {cerr << "Test FAILED!" << endl; return(1);} return(0); }

Example 1.1 performs five tasks:

1. It creates an integer array.

2. A for loop fills the array a with integers from 0 to N-1.

3. The sum of the integers in the array is computed.

4. A separate for loop computes the sum of the integers by an alternate method.

5. A comparison checks that the sequential and parallel results are the same and reports the success of the test.

Notice that the processor runs each task consecutively one after the other. Inside of tasks 2–4, the processor iterates through the loop starting with the first index. Once all the tasks have finished, the program exits. This is an example of a single thread of execution, which is illustrated in Figure 1.1 for task 2 as a single thread fills the first three elements of array a.

This program can be compiled and executed with the following commands:

* Linux and Cygwin users (Example 1.2, "Compiling with g++"):

Example 1.2 g++ seqSerial.cpp –o seqSerial ./seqSerial

* Utilizing the command-line interface for Microsoft Visual Studio users (Example 1.3, "Compiling with the Visual Studio Command-Line Interface"):

Example 1.3 cl.exe seqSerial.cpp –o seqSerial.exe seqSerial.exe

* Of course, all CUDA users (Linux, Windows, MacOS, Cygwin) can utilize the NVIDIA nvcc compiler regardless of platform (Example 1.4, "Compiling with nvcc"):

Example 1.4 nvcc seqSerial.cpp –o seqSerial ./seqSerial

In all cases, the program will print "Test succeeded!"

For comparison, let's create and run our first CUDA program seqCuda.cu, in C++. (Note: CUDA supports both C and C++ programs. For simplicity, the following example was written in C++ using the Thrust data-parallel API as will be discussed in greater depth in this chapter.) CUDA programs utilize the file extension suffix ".cu" to indicate CUDA source code. See Example 1.5, "A Massively Parallel CUDA Code Using the Thrust API":

Example 1.5 //seqCuda.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> int main() { const int N=50000; // task 1: create the array thrust::device_vector<int> a(N); // task 2: fill the array thrust::sequence(a.begin(), a.end(), 0); // task 3: calculate the sum of the array int sumA= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else { cerr << "Test FAILED!" << endl; return(1);} return(0); }

Example 1.5 is compiled with the NVIDIA nvcc compiler under Windows, Linux, and MacOS. If nvcc is not available on your system, download and install the free CUDA tools, driver, and SDK (Software Development Kit) from the NVIDIA CUDA Zone (http://developer.nvidia.com). See Example 1.6, "Compiling and Running the Example":

Example 1.6 nvcc seqCuda.cu –o seqCuda ./seqCuda

Again, running the program will print "Test succeeded!"

Congratulations: you just created a CUDA application that uses 50,000 software threads of execution and ran it on a GPU! (The actual number of threads that run concurrently on the hardware depends on the capabilities of the GPGPU in your system.)

Aside from a few calls to the CUDA Thrust API (prefaced by thrust:: in this example), the CUDA code looks almost identical to the sequential C++ code. The highlighted lines in the example perform parallel operations.

Unlike the single-threaded execution illustrated in Figure 1.1, the code in Example 1.5 utilizes many threads to perform a large number of concurrent operations as is illustrated in Figure 1.2 for task 2 when filling array a.

CHOOSING A CUDA API

CUDA offers several APIs to use when programming. They are from highest to lowest level:

1. The data-parallel C++ Thrust API

2. The runtime API, which can be used in either C or C++

3. The driver API, which can be used with either C or C++

Regardless of the API or mix of APIs used in an application, CUDA can be called from other high-level languages such as Python, Java, FORTRAN, and many others. The calling conventions and details necessary to correctly link vary with each language.

Which API to use depends on the amount of control the developer wishes to exert over the GPU. Higher-level APIs like the C++ Thrust API are convenient, as they do more for the programmer, but they also make some decisions on behalf of the programmer. In general, Thrust has been shown to deliver high computational performance, generality, and convenience. It also makes code development quicker and can produce easier to read source code that many will argue is more maintainable. Without modification, programs written in Thrust will most certainly maintain or show improved performance as Thrust matures in future releases. Many Thrust methods like reduction perform significant work, which gives the Thrust API developers much freedom to incorporate features in the latest hardware that can improve performance. Thrust is an example of a well-designed API that is simple yet general and that has the ability to be adapted to improve performance as the technology evolves.

A disadvantage of a high-level API like Thrust is that it can isolate the developer from the hardware and expose only a subset of the hardware capabilities. In some circumstances, the C++ interface can become too cumbersome or verbose. Scientific programmers in particular may feel that the clarity of simple loop structures can get lost in the C++ syntax.

Use a high-level interface first and choose to drop down to a lower-level API when you think the additional programming effort will deliver greater performance or to make use of some lower-level capability needed to better support your application. The CUDA runtime in particular was designed to give the developer access to all the programmable features of the GPGPU with a few simple yet elegant and powerful syntactic additions to the C-language. As a result, CUDA runtime code can sometimes be the cleanest and easiest API to read; plus, it can be extremely efficient. An important aspect of the lowest-level driver interface is that it can provide very precise control over both queuing and data transfers.

Expect code size to increase when using the lower-level interfaces, as the developer must make more API calls and/or specify more parameters for each call. In addition, the developer needs to check for runtime errors and version incompatibilities. In many cases when using low-level APIs, it is not unusual for more lines of the application code to be focused on the details of the API interface than on the actual work of the task.

Happily, modern CUDA developers are not restricted to use just a single API in an application, which was not the case prior to the CUDA 3.2 release in 2010. Modern versions of CUDA allow developers to use any of the three APIs in their applications whenever they choose. Thus, an initial code can be written in a high-level API such as Thrust and then refactored to use some special characteristic of the runtime or driver API.

Let's use this ability to mix various levels of API calls to highlight and make more explicit the parallel nature of the sequential fill task (task 2) from our previous examples. Example 1.7, "Using the CUDA Runtime to Fill an Array with Sequential Integers," also gives us a chance to introduce the CUDA runtime API:

Example 1.7 //seqRuntime.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> __global__ void fillKernel(int *a, int n) { int tid = blockIdx.x*blockDim.x + threadIdx.x; if (tid < n) a[tid] = tid; } void fill(int* d_a, int n) { int nThreadsPerBlock= 512; int nBlocks= n/nThreadsPerBlock + ((n%nThreadsPerBlock)?1:0); fillKernel <<< nBlocks, nThreadsPerBlock >>> (d_a, n); } int main() { const int N=50000; // task 1: create the array thrust::device_vector<int> a(N); // task 2: fill the array using the runtime fill(thrust::raw_pointer_cast(&a[0]),N); // task 3: calculate the sum of the array int sumA= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else { cerr << "Test FAILED!" << endl; return(1);} return(0); }

(Continues...)

Excerpted from CUDA Application Design and Development by Rob Farber Copyright © 2011 by NVIDIA Corporation and Rob Farber. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

1. First Programs and How to Think in CUDA 2. CUDA for Machine Learning and Optimization 3. The CUDA Tool Suite: Profiling a PCA/NLPCA Functor 4. The CUDA Execution Model 5. CUDA Memory 6. Efficiently Using GPU Memory 7. Techniques to Increase Parallelism 8. CUDA for All GPU and CPU Applications 9. Mixing CUDA and Rendering 10. CUDA in a Cloud and Cluster Environments 11. CUDA for Real Problems: Monte Carlo, Modeling, and More 12. Application Focus on Live Streaming Video

What People are Saying About This

From the Publisher

A roadmap for developers facing the challenge of developing applications to effectively use GPUs with CUDA to achieve efficiency and performance goals

From the B&N Reads Blog

Page 1 of

Editorial Reviews

"The book by Rob Faber on CUDA Application Design and Development is required reading for anyone who wants to understand and efficiently program CUDA for scientific and visual programming. It provides a hands-on exposure to the details in a readable and easy to understand form." —Jack Dongarra, Innovative Computing Laboratory, EECS Department, University of Tennessee
"GPUs have the potential to take computational simulations to new levels of scale and detail. Many scientists are already realising these benefits, tackling larger and more complex problems that are not feasible on conventional CPU-based systems. This book provides the tools and techniques for anyone wishing to join these pioneers, in an accessible though thorough text that a budding CUDA programmer would do well to keep close to hand." —Dr. George Beckett, EPCC, University of Edinburgh
"With his book, Farber takes us on a journey to the exciting world of programming multi-core processor machines with CUDA. Farber's pragmatic approach is effective in guiding the reader across challenges and their solutions. Farber's broader presentation of parallel programming with CUDA ranging from CUDA in Cloud and Cluster environments to CUDA for real problems and applications helps the reader learning about the unique opportunities this parallel programming language can offer to the scientific community. This book is definitely a must for students, teachers, and developers!" —Michela Taufer, Assistant Professor, Department of Computer and Information Sciences, University of Delaware
"Rob Farber has written an enlightening and accessible book on the application to CUDA for real research tasks, with an eye to developing scalable and distributed GPU applications. He supplies clear and usable code examples combined with insight about _why_ one should use a particular approach. This is an excellent book filled with practical advice for experienced CUDA programmers and ground-up guidance for beginners wondering if CUDA can accelerate their time to solution." —Paul A. Navrátil, Manager, Visualization Software, Texas Advanced Computing Center
"The book provides a solid introduction to the CUDA programming language starting with the basics and progressively exposing the reader to advanced concepts through the well annotated implementation of real-world applications. It makes a first-rate presentation of CUDA, its use in the implementation of portable and efficient applications and the underlying architecture of GPGPU/CPU systems with particular emphasis on memory hierarchies. This is complemented by a thorough presentation both of the CUDA Tool Suite and of techniques for the parallelisation of applications. Farber's book is a valuable addition to the bookshelves of both the advanced and novice CUDA programmer." —Francis Wray, Independent Consultant and Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston

From the Publisher

CUDA Application Design and Development

CUDA Application Design and Development

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

CUDA Application Design and Development

MORGAN KAUFMANN

Chapter One

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

CUDA Application Design and Development

MORGAN KAUFMANN

Chapter One

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews