Read an Excerpt
CUDA Application Design and Development
By Rob Farber
MORGAN KAUFMANN
Copyright © 2011 NVIDIA Corporation and Rob Farber
All right reserved.
ISBN: 978-0-12-388432-9
Chapter One
First Programs and How to Think in CUDA
The purpose of this chapter is to introduce the reader to CUDA (the parallel computing architecture developed by NVIDIA) and differentiate CUDA from programming conventional single and multicore processors. Example programs and instructions will show the reader how to compile and run programs as well as how to adapt them to their own purposes. The CUDA Thrust and runtime APIs (Application Programming Interface) will be used and discussed. Three rules of GPGPU programming will be introduced as well as Amdahl's law, Big-O notation, and the distinction between data-parallel and task-parallel programming. Some basic GPU debugging tools will be introduced, but for the most part NVIDIA has made debugging CUDA code identical to debugging any other C or C++ application. Where appropriate, references to introductory materials will be provided to help novice readers. At the end of this chapter, the reader will be able to write and debug massively parallel programs that concurrently utilize both a GPGPU and the host processor(s) within a single application that can handle a million threads of execution.
At the end of the chapter, the reader will have a basic understanding of:
* How to create, build, and run CUDA applications.
* Criteria to decide which CUDA API to use.
* Amdahl's law and how it relates to GPU computing.
* Three rules of high-performance GPU computing.
* Big-O notation and the impact of data transfers.
* The difference between task-parallel and data-parallel programming.
* Some GPU-specific capabilities of the Linux, Mac, and Windows CUDA debuggers.
* The CUDA memory checker and how it can find out-of-bounds and misaligned memory errors.
SOURCE CODE AND WIKI
Source code for all the examples in this book can be downloaded from http://booksite.mkp.com/9780123884268. A wiki (a website collaboratively developed by a community of users) is available to share information, make comments, and find teaching material; it can be reached at any of the following aliases on gpucomputing.net:
* My name: http://gpucomputing.net/RobFarber.
* The title of this book as one word: http://gpucomputing.net/ CUDAapplicationdesignanddevelopment.
* The name of my series: http://gpucomputing.net/ supercomputingforthemasses.
DISTINGUISHING CUDA FROM CONVENTIONAL PROGRAMMING WITH A SIMPLE EXAMPLE
Programming a sequential processor requires writing a program that specifies each of the tasks needed to compute some result. See Example 1.1, "seqSerial.cpp, a sequential C++ program":
Example 1.1 //seqSerial.cpp #include <iostream> #include <vector> using namespace std; int main() { const int N=50000; // task 1: create the array vector<int> a(N); // task 2: fill the array for(int i=0; i < N; i++) a[i]=i; // task 3: calculate the sum of the array int sumA=0; for(int i=0; i < N; i++) sumA += a[i]; // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else {cerr << "Test FAILED!" << endl; return(1);} return(0); }
Example 1.1 performs five tasks:
1. It creates an integer array.
2. A for loop fills the array a with integers from 0 to N-1.
3. The sum of the integers in the array is computed.
4. A separate for loop computes the sum of the integers by an alternate method.
5. A comparison checks that the sequential and parallel results are the same and reports the success of the test.
Notice that the processor runs each task consecutively one after the other. Inside of tasks 2–4, the processor iterates through the loop starting with the first index. Once all the tasks have finished, the program exits. This is an example of a single thread of execution, which is illustrated in Figure 1.1 for task 2 as a single thread fills the first three elements of array a.
This program can be compiled and executed with the following commands:
* Linux and Cygwin users (Example 1.2, "Compiling with g++"):
Example 1.2 g++ seqSerial.cpp –o seqSerial ./seqSerial
* Utilizing the command-line interface for Microsoft Visual Studio users (Example 1.3, "Compiling with the Visual Studio Command-Line Interface"):
Example 1.3 cl.exe seqSerial.cpp –o seqSerial.exe seqSerial.exe
* Of course, all CUDA users (Linux, Windows, MacOS, Cygwin) can utilize the NVIDIA nvcc compiler regardless of platform (Example 1.4, "Compiling with nvcc"):
Example 1.4 nvcc seqSerial.cpp –o seqSerial ./seqSerial
In all cases, the program will print "Test succeeded!"
For comparison, let's create and run our first CUDA program seqCuda.cu, in C++. (Note: CUDA supports both C and C++ programs. For simplicity, the following example was written in C++ using the Thrust data-parallel API as will be discussed in greater depth in this chapter.) CUDA programs utilize the file extension suffix ".cu" to indicate CUDA source code. See Example 1.5, "A Massively Parallel CUDA Code Using the Thrust API":
Example 1.5 //seqCuda.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> int main() { const int N=50000; // task 1: create the array thrust::device_vector<int> a(N); // task 2: fill the array thrust::sequence(a.begin(), a.end(), 0); // task 3: calculate the sum of the array int sumA= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else { cerr << "Test FAILED!" << endl; return(1);} return(0); }
Example 1.5 is compiled with the NVIDIA nvcc compiler under Windows, Linux, and MacOS. If nvcc is not available on your system, download and install the free CUDA tools, driver, and SDK (Software Development Kit) from the NVIDIA CUDA Zone (http://developer.nvidia.com). See Example 1.6, "Compiling and Running the Example":
Example 1.6 nvcc seqCuda.cu –o seqCuda ./seqCuda
Again, running the program will print "Test succeeded!"
Congratulations: you just created a CUDA application that uses 50,000 software threads of execution and ran it on a GPU! (The actual number of threads that run concurrently on the hardware depends on the capabilities of the GPGPU in your system.)
Aside from a few calls to the CUDA Thrust API (prefaced by thrust:: in this example), the CUDA code looks almost identical to the sequential C++ code. The highlighted lines in the example perform parallel operations.
Unlike the single-threaded execution illustrated in Figure 1.1, the code in Example 1.5 utilizes many threads to perform a large number of concurrent operations as is illustrated in Figure 1.2 for task 2 when filling array a.
CHOOSING A CUDA API
CUDA offers several APIs to use when programming. They are from highest to lowest level:
1. The data-parallel C++ Thrust API
2. The runtime API, which can be used in either C or C++
3. The driver API, which can be used with either C or C++
Regardless of the API or mix of APIs used in an application, CUDA can be called from other high-level languages such as Python, Java, FORTRAN, and many others. The calling conventions and details necessary to correctly link vary with each language.
Which API to use depends on the amount of control the developer wishes to exert over the GPU. Higher-level APIs like the C++ Thrust API are convenient, as they do more for the programmer, but they also make some decisions on behalf of the programmer. In general, Thrust has been shown to deliver high computational performance, generality, and convenience. It also makes code development quicker and can produce easier to read source code that many will argue is more maintainable. Without modification, programs written in Thrust will most certainly maintain or show improved performance as Thrust matures in future releases. Many Thrust methods like reduction perform significant work, which gives the Thrust API developers much freedom to incorporate features in the latest hardware that can improve performance. Thrust is an example of a well-designed API that is simple yet general and that has the ability to be adapted to improve performance as the technology evolves.
A disadvantage of a high-level API like Thrust is that it can isolate the developer from the hardware and expose only a subset of the hardware capabilities. In some circumstances, the C++ interface can become too cumbersome or verbose. Scientific programmers in particular may feel that the clarity of simple loop structures can get lost in the C++ syntax.
Use a high-level interface first and choose to drop down to a lower-level API when you think the additional programming effort will deliver greater performance or to make use of some lower-level capability needed to better support your application. The CUDA runtime in particular was designed to give the developer access to all the programmable features of the GPGPU with a few simple yet elegant and powerful syntactic additions to the C-language. As a result, CUDA runtime code can sometimes be the cleanest and easiest API to read; plus, it can be extremely efficient. An important aspect of the lowest-level driver interface is that it can provide very precise control over both queuing and data transfers.
Expect code size to increase when using the lower-level interfaces, as the developer must make more API calls and/or specify more parameters for each call. In addition, the developer needs to check for runtime errors and version incompatibilities. In many cases when using low-level APIs, it is not unusual for more lines of the application code to be focused on the details of the API interface than on the actual work of the task.
Happily, modern CUDA developers are not restricted to use just a single API in an application, which was not the case prior to the CUDA 3.2 release in 2010. Modern versions of CUDA allow developers to use any of the three APIs in their applications whenever they choose. Thus, an initial code can be written in a high-level API such as Thrust and then refactored to use some special characteristic of the runtime or driver API.
Let's use this ability to mix various levels of API calls to highlight and make more explicit the parallel nature of the sequential fill task (task 2) from our previous examples. Example 1.7, "Using the CUDA Runtime to Fill an Array with Sequential Integers," also gives us a chance to introduce the CUDA runtime API:
Example 1.7 //seqRuntime.cu #include <iostream> using namespace std; #include <thrust/reduce.h> #include <thrust/sequence.h> #include <thrust/host_vector.h> #include <thrust/device_vector.h> __global__ void fillKernel(int *a, int n) { int tid = blockIdx.x*blockDim.x + threadIdx.x; if (tid < n) a[tid] = tid; } void fill(int* d_a, int n) { int nThreadsPerBlock= 512; int nBlocks= n/nThreadsPerBlock + ((n%nThreadsPerBlock)?1:0); fillKernel <<< nBlocks, nThreadsPerBlock >>> (d_a, n); } int main() { const int N=50000; // task 1: create the array thrust::device_vector<int> a(N); // task 2: fill the array using the runtime fill(thrust::raw_pointer_cast(&a[0]),N); // task 3: calculate the sum of the array int sumA= thrust::reduce(a.begin(),a.end(), 0); // task 4: calculate the sum of 0 .. N-1 int sumCheck=0; for(int i=0; i < N; i++) sumCheck += i; // task 5: check the results agree if(sumA == sumCheck) cout << "Test Succeeded!" << endl; else { cerr << "Test FAILED!" << endl; return(1);} return(0); }
(Continues...)
Excerpted from CUDA Application Design and Development by Rob Farber Copyright © 2011 by NVIDIA Corporation and Rob Farber. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.