Heterogeneous Computing with OpenCL

Paperback (Print)
Buy New
Buy New from BN.com
Used and New from Other Sellers
Used and New from Other Sellers
from $39.99
Usually ships in 1-2 business days
(Save 42%)
Other sellers (Paperback)
  • All (5) from $39.99   
  • New (1) from $195.00   
  • Used (4) from $39.99   


Heterogeneous Computing with OpenCL teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs) such as AMD Fusion technology. Designed to work on multiple platforms and with wide industry support, OpenCL will help you more effectively program for a heterogeneous future.

Written by leaders in the parallel computing and OpenCL communities, this book will give you hands-on OpenCL experience to address a range of fundamental parallel algorithms. The authors explore memory spaces, optimization techniques, graphics interoperability, extensions, and debugging and profiling. Intended to support a parallel programming course, Heterogeneous Computing with OpenCL includes detailed examples throughout, plus additional online exercises and other supporting materials.

  • Explains principles and strategies to learn parallel programming with OpenCL, from understanding the four abstraction models to thoroughly testing and debugging complete applications.
  • Covers image processing, web plugins, particle simulations, video editing, performance optimization, and more.
  • Shows how OpenCL maps to an example target architecture and explains some of the tradeoffs associated with mapping to various architectures
  • Addresses a range of fundamental programming techniques, with multiple examples and case studies that demonstrate OpenCL extensions for a variety of hardware platforms
Read More Show Less

Editorial Reviews

From the Publisher
With parallel computing now in the mainstream, this book provides an excellent reference on the state-of-the-art techniques in accelerating applications on CPU-GPU systems.

-David A. Bader, Georgia Institute of Technology

Read More Show Less

Product Details

  • ISBN-13: 9780123877666
  • Publisher: Elsevier Science
  • Publication date: 8/31/2011
  • Pages: 296
  • Sales rank: 1,547,684
  • Product dimensions: 7.50 (w) x 9.10 (h) x 0.60 (d)

Meet the Author

Benedict R. Gaster is a software architect working on programming models for next-generation heterogeneous processors, in particular looking at high-level abstractions for parallel programming on the emerging class of processors that contain both CPUs and accelerators such as GPUs. Benedict has contributed extensively to the OpenCL's design and has represented AMD at the Khronos Group open standard consortium. Benedict has a Ph.D in computer science for his work on type systems for extensible records and variants.

Lee Howes has spent the last two years working at AMD and currently focuses on programming models for the future of heterogeneous computing. Lee's interests lie in declaratively representing mappings of iteration domains to data and in communicating complicated architectural concepts and optimizations succinctly to a developer audience, both through programming model improvements and education. Lee has a Ph.D. in computer science from Imperial College London for work in this area.

David Kaeli received a BS and PhD in Electrical Engineering from Rutgers University, and an MS in Computer Engineering from Syracuse University. He is the Associate Dean of Undergraduate Programs in the College of Engineering and a Full Processor on the ECE faculty at Northeastern University, Boston, MA where he directs the Northeastern University Computer Architecture Research Laboratory (NUCAR). Prior to joining Northeastern in 1993, Kaeli spent 12 years at IBM, the last 7 at T.J. Watson Research Center, Yorktown Heights, NY.

Dr. Kaeli has co-authored more than 200 critically reviewed publications. His research spans a range of areas including microarchitecture to back-end compilers and software engineering. He leads a number of research projects in the area of GPU Computing. He presently serves as the Chair of the IEEE Technical Committee on Computer Architecture. Dr. Kaeli is an IEEE Fellow and a member of the ACM.

Perhaad Mistry is a PhD candidate at Northeastern University. He received a BS in Electronics Engineering from University of Mumbai and an MS in Computer Engineering from Northeastern University in Boston. He is presently a member of the Northeastern University Computer Architecture Research Laboratory (NUCAR) and is advised by Dr. David Kaeli.

Perhaad works on a variety of parallel computing projects. He has designed scalable data structures for the physics simulations for GPGPU platforms and has also implemented medical reconstruction algorithms for heterogeneous devices. His present research focuses on the design of profiling tools for heterogeneous computing, He is studying the potential of using standards like OpenCL for building tools that simplify parallel programming and performance analysis across the variety of heterogeneous devices available today.

Dana Schaa received a BS in Computer Engineering from Cal Poly, San Luis Obispo, and an MS in Electrical and Computer Engineering from Northeastern University, where he is also currently a Ph. D. candidate. His research interests include parallel programming models and abstractions, particularly for GPU architectures. He has developed GPU-based implementations of several medical imaging research projects ranging from real-time visualization to image reconstruction in distributed, heterogeneous environments. Dana married his wonderful wife Jenny in 2010, and they live together in Boston with their charming cats.

Read More Show Less

Read an Excerpt

Heterogeneous Computing with OpenCL

By Benedict Gaster Lee Howes David R. Kaeli Perhaad Mistry Dana Schaa


Copyright © 2012 Advanced Micro Devices, Inc.
All right reserved.

ISBN: 978-0-12-387767-3

Chapter One

Introduction to Parallel Programming


Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core microprocessors, central processing units (CPUs), digital signal processors, reconfigurable hardware (FPGAs), and graphic processing units (GPUs). Presented with so much heterogeneity, the process of developing efficient software for such a wide array of architectures poses a number of challenges to the programming community.

Applications possess a number of workload behaviors, ranging from control intensive (e.g., searching, sorting, and parsing) to data intensive (e.g., image processing, simulation and modeling, and data mining). Applications can also be characterized as compute intensive (e.g., iterative methods, numerical methods, and financial modeling), where the overall throughput of the application is heavily dependent on the computational efficiency of the underlying hardware. Each of these workload classes typically executes most efficiently on a specific style of hardware architecture. No single architecture is best for running all classes of workloads, and most applications possess a mix of the workload characteristics. For instance, control-intensive applications tend to run faster on superscalar CPUs, where significant die real estate has been devoted to branch prediction mechanisms, whereas data-intensive applications tend to run fast on vector architectures, where the same operation is applied to multiple data items concurrently.


The Open Computing Language (OpenCL) is a heterogeneous programming framework that is managed by the nonprofit technology consortium Khronos Group. OpenCL is a framework for developing applications that execute across a range of device types made by different vendors. It supports a wide range of levels of parallelism and efficiently maps to homogeneous or heterogeneous, single- or multiple-device systems consisting of CPUs, GPUs, and other types of devices limited only by the imagination of vendors. The OpenCL definition offers both a device-side language and a host management layer for the devices in a system. The device-side language is designed to efficiently map to a wide range of memory systems. The host language aims to support efficient plumbing of complicated concurrent programs with low overhead. Together, these provide the developer with a path to efficiently move from algorithm design to implementation.

OpenCL provides parallel computing using task-based and data-based parallelism. It currently supports CPUs that include x86, ARM, and PowerPC, and it has been adopted into graphics card drivers by both AMD (called the Accelerated Parallel Processing SDK) and NVIDIA. Support for OpenCL is rapidly expanding as a wide range of platform vendors have adopted OpenCL and support or plan to support it for their hardware platforms. These vendors fall within a wide range of market segments, from the embedded vendors (ARM and Imagination Technologies) to the HPC vendors (AMD, Intel, NVIDIA, and IBM). The architectures supported include multi-core CPUs, throughput and vector processors such as GPUs, and fine-grained parallel devices such as FPGAs.

Most important, OpenCL's cross-platform, industrywide support makes it an excellent programming model for developers to learn and use, with the confidence that it will continue to be widely available for years to come with ever-increasing scope and applicability.


This book is the first of its kind to present OpenCL programming in a fashion appropriate for the classroom. The book is organized to address the need for teaching parallel programming on current system architectures using OpenCL as the target language, and it includes examples for CPUs, GPUs, and their integration in the accelerated processing unit (APU). Another major goal of this text is to provide a guide to programmers to develop well-designed programs in OpenCL targeting parallel systems. The book leads the programmer through the various abstractions and features provided by the OpenCL programming environment. The examples offer the reader a simple introduction and more complicated optimizations, and they suggest further development and goals at which to aim. It also discusses tools for improving the development process in terms of profiling and debugging such that the reader need not feel lost in the development process.

The book is accompanied by a set of instructor slides and programming examples, which support the use of this text by an OpenCL instructor. Please visit http://heterogeneouscomputingwithopencl.org/ for additional information.


Most applications are first programmed to run on a single processor. In the field of high-performance computing, classical approaches have been used to accelerate computation when provided with multiple computing resources. Standard approaches include "divide-and-conquer" and "scatter–gather" problem decomposition methods, providing the programmer with a set of strategies to effectively exploit the parallel resources available in high-performance systems. Divide-and-conquer methods iteratively break a problem into subproblems until the subproblems fit well on the computational resources provided. Scatter–gather methods send a subset of the input data set to each parallel resource and then collect the results of the computation and combine them into a result data set. As before, the partitioning takes account of the size of the subsets based on the capabilities of the parallel resources. Figure 1.1 shows how popular applications such as sorting and a vector–scalar multiply can be effectively mapped to parallel resources to accelerate processing.

The programming task becomes increasingly challenging when faced with the growing parallelism and heterogeneity present in contemporary parallel processors. Given the power and thermal limits of complementary metal-oxide semiconductor (CMOS) technology, microprocessor vendors find it difficult to scale the frequency of these devices to derive more performance and have instead decided to place multiple processors, sometimes specialized, on a single chip. In doing so, the problem of extracting parallelism from an application is left to the programmer, who must decompose the underlying algorithms in the applications and map them efficiently to a diverse variety of target hardware platforms.

In the past 5 years, parallel computing devices have been increasing in number and processing capabilities. GPUs have also appeared on the computing scene and are providing new levels of processing capability at very low cost. Driven by the demand for real-time three-dimensional graphics rendering, a highly data-parallel problem, GPUs have evolved rapidly as very powerful, fully programmable, task and data-parallel architectures. Hardware manufacturers are now combining CPUs and GPUs on a single die, ushering in a new generation of heterogeneous computing. Compute-intensive and data-intensive portions of a given application, called kernels, may be offloaded to the GPU, providing significant performance per watt and raw performance gains, while the host CPU continues to execute nonkernel tasks.

Many systems and phenomena in both the natural world and the man-made world present us with different classes of parallelism and concurrency:

• Molecular dynamics

• Weather and ocean patterns

• Multimedia systems

• Tectonic plate drift

• Cell growth

• Automobile assembly lines

• Sound and light wave propagation

Parallel computing, as defined by Almasi and Gottlieb (1989), is "a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (i.e., in parallel)." The degree of parallelism that can be achieved is dependent on the inherent nature of the problem at hand (remember that there exists significant parallelism in the world), and the skill of the algorithm or software designer is to identify the different forms of parallelism present in the underlying problem. We begin with a discussion of two simple examples to demonstrate inherent parallel computation: vector multiplication and text searching.

Our first example carries out multiplication of the elements of two arrays A and B, each with N elements, storing the result of each multiply in a corresponding array C. Figure 1.2 shows the computation we would like to carry out. The serial Cþþ program for code would look as follows:

for (i = 0; i<N; i++) C[i] = A[i] * B[i];

This code possesses significant parallelism but very little arithmetic intensity. The computation of every element in C is independent of every other element. If we were to parallelize this code, we could choose to generate a separate execution instance to perform the computation of each element of C. This code possesses significant datalevel parallelism because the same operation is applied across all of A and B to produce C. We could also view this breakdown as a simple form of task parallelism where each task operates on a subset of the same data; however, task parallelism generalizes further to execution on pipelines of data or even more sophisticated parallel interactions. Figure 1.3 shows an example of task parallelism in a pipeline to support filtering of images in frequency space using an FFT.

Let us consider a second example. The computation we are trying to carry out is to find the number of occurrences of a string of characters in a body of text (Figure 1.4). Assume that the body of text has already been parsed into a set of N words. We could choose to divide the task of comparing the string against the N potential matches into N comparisons (i.e., tasks), where each string of characters is matched against the text string. This approach, although rather naïve in terms of search efficiency, is highly parallel. The process of the text string being compared against the set of potential words presents N parallel tasks, each carrying out the same set of operations. There is even further parallelism within a single comparison task, where the matching on a character-by-character basis presents a finer-grained degree of parallelism. This example exhibits both data-level parallelism (we are going to be performing the same operation on multiple data items) and task-level parallelism (we can compare the string to all words concurrently).

Once the number of matches is determined, we need to accumulate them to provide the total number of occurrences. Again, this summing can exploit parallelism. In this step, we introduce the concept of "reduction," where we can utilize the availability of parallel resources to combine partials sums in a very efficient manner. Figure 1.5 shows the reduction tree, which illustrates this summation process in log N steps.


Here, we discuss concurrency and parallel processing models so that when attempting to map an application developed in OpenCL to a parallel platform, we can select the right model to pursue. Although all of the following models can be supported in OpenCL, the underlying hardware may restrict which model will be practical to use.

Concurrency is concerned with two or more activities happening at the same time. We find concurrency in the real world all the time—for example, carrying a child in one arm while crossing a road or, more generally, thinking about something while doing something else with one's hands.

When talking about concurrency in terms of computer programming, we mean a single system performing multiple tasks independently. Although it is possible that concurrent tasks may be executed at the same time (i.e., in parallel), this is not a requirement. For example, consider a simple drawing application, which is either receiving input from the user via the mouse and keyboard or updating the display with the current image. Conceptually, receiving and processing input are different operations (i.e., tasks) from updating the display. These tasks can be expressed in terms of concurrency, but they do not need to be performed in parallel. In fact, in the case in which they are executing on a single core of a CPU, they cannot be performed in parallel. In this case, the application or the operating system should switch between the tasks, allowing both some time to run on the core.


Excerpted from Heterogeneous Computing with OpenCL by Benedict Gaster Lee Howes David R. Kaeli Perhaad Mistry Dana Schaa Copyright © 2012 by Advanced Micro Devices, Inc.. Excerpted by permission of MORGAN KAUFMANN PUBLISHERS. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More Show Less

Table of Contents

  1. Introduction to Parallel Programming
  2. Introduction to OpenCL
  3. OpenCL Device Architectures
  4. Basic OpenCL Examples
  5. Understanding OpenCL's Concurrency and Execution Model
  6. Dissecting a CPU/GPU OpenCL Implementation
  7. OpenCL Case Study: Convolution
  8. OpenCL Case Study: Video Processing
  9. OpenCL Case Study: Histogram
  10. OpenCL Case Study: Mixed Particle Simulation
  11. OpenCL Extensions
  12. OpenCL Profiling and Debugging
  13. WebCL
Read More Show Less

Customer Reviews

Average Rating 5
( 1 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously
Sort by: Showing 1 Customer Reviews
  • Posted October 16, 2011


    Are you a programmer or software engineer who needs help in leveraging the power and flexibility of the OpenCL programming standard. If you are, then this book is for you! Authors Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, and Dana Schaa, have done an outstanding job of writing a book which aims to teach students how to program heterogeneous environments. Gaster, Howes, Kaeli, Mistry and Schaa, begin by introducing you to OpenCL, including key concepts, such as kernels, platforms, and devices. In addition, the authors present some of the architectures that OpenCL does or might target, including x86 CPUs, GPUs, and APUs. They then introduce the basic matrix multiplication, image rotation, and convolution implementations to help the reader learn OpenCL by example. Then, they discuss the concurrency and execution in the OpenCL programming model. The authors continue by showing you how OpenCL maps to an example architecture. In addition, the authors present a case study that accelerates a convolution algorithm. They then present another case study that targets video processing, by utilizing OpenCL to build performant image processing effects that can be applied to video streams. The authors then further present another case study examining how to optimize the performance of a histogramming application. Then, the authors discuss how to leverage a heterogeneous CPU-GPU environment. They continue by showing you how to use OpenCL extensions using the device fission and double precision extensions as examples. Next, the authors introduce you to the reader; as well as, debugging and analyzing OpenCL programs. Finally, they provide an overview of performance trade-offs with regards to WebCL. This most excellent book is an attempt by the authors to show developers and students how to leverage the OpenCL framework to build interesting and useful applications. Perhaps more importantly, the authors hope that the reader will embrace this new programming framework and explore the full benefits of heterogeneous computing that it provides!

    1 out of 1 people found this review helpful.

    Was this review helpful? Yes  No   Report this review
Sort by: Showing 1 Customer Reviews

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)