Read an Excerpt
  Heterogeneous Computing with OpenCL 
 By Benedict Gaster  Lee Howes  David R. Kaeli  Perhaad Mistry  Dana Schaa 
 MORGAN KAUFMANN PUBLISHERS 
  Copyright © 2012   Advanced Micro Devices, Inc. 
All right reserved.
 ISBN: 978-0-12-387767-3  
    Chapter One 
  Introduction to Parallel  Programming  
  
  INTRODUCTION  
  Today's computing environments are becoming more multifaceted, exploiting the  capabilities of a range of multi-core microprocessors, central processing units  (CPUs), digital signal processors, reconfigurable hardware (FPGAs), and graphic  processing units (GPUs). Presented with so much heterogeneity, the process of developing  efficient software for such a wide array of architectures poses a number of  challenges to the programming community.  
     Applications possess a number of workload behaviors, ranging from control  intensive (e.g., searching, sorting, and parsing) to data intensive (e.g., image  processing, simulation and modeling, and data mining). Applications can also  be characterized as compute intensive (e.g., iterative methods, numerical methods,  and financial modeling), where the overall throughput of the application is heavily  dependent on the computational efficiency of the underlying hardware. Each of  these workload classes typically executes most efficiently on a specific style of  hardware architecture. No single architecture is best for running all classes of  workloads, and most applications possess a mix of the workload characteristics.  For instance, control-intensive applications tend to run faster on superscalar CPUs,  where significant die real estate has been devoted to branch prediction mechanisms,  whereas data-intensive applications tend to run fast on vector architectures,  where the same operation is applied to multiple data items concurrently.  
  
  OPENCL  
  The Open Computing Language (OpenCL) is a heterogeneous programming  framework that is managed by the nonprofit technology consortium Khronos  Group. OpenCL is a framework for developing applications that execute across  a range of device types made by different vendors. It supports a wide range of  levels of parallelism and efficiently maps to homogeneous or heterogeneous,  single- or multiple-device systems consisting of CPUs, GPUs, and other types of devices  limited only by the imagination of vendors. The OpenCL definition offers both  a device-side language and a host management layer for the devices in a system.  The device-side language is designed to efficiently map to a wide range of memory  systems. The host language aims to support efficient plumbing of complicated concurrent  programs with low overhead. Together, these provide the developer with a path to  efficiently move from algorithm design to implementation.  
     OpenCL provides parallel computing using task-based and data-based parallelism.  It currently supports CPUs that include x86, ARM, and PowerPC, and it has  been adopted into graphics card drivers by both AMD (called the Accelerated  Parallel Processing SDK) and NVIDIA. Support for OpenCL is rapidly expanding  as a wide range of platform vendors have adopted OpenCL and support or plan to  support it for their hardware platforms. These vendors fall within a wide range of  market segments, from the embedded vendors (ARM and Imagination Technologies)  to the HPC vendors (AMD, Intel, NVIDIA, and IBM). The architectures supported  include multi-core CPUs, throughput and vector processors such as GPUs, and fine-grained  parallel devices such as FPGAs.  
     Most important, OpenCL's cross-platform, industrywide support makes it an  excellent programming model for developers to learn and use, with the confidence  that it will continue to be widely available for years to come with ever-increasing  scope and applicability.  
  
  THE GOALS OF THIS BOOK  
  This book is the first of its kind to present OpenCL programming in a fashion appropriate  for the classroom. The book is organized to address the need for teaching parallel  programming on current system architectures using OpenCL as the target  language, and it includes examples for CPUs, GPUs, and their integration in the accelerated  processing unit (APU). Another major goal of this text is to provide a guide  to programmers to develop well-designed programs in OpenCL targeting parallel  systems. The book leads the programmer through the various abstractions and features  provided by the OpenCL programming environment. The examples offer the  reader a simple introduction and more complicated optimizations, and they suggest  further development and goals at which to aim. It also discusses tools for improving  the development process in terms of profiling and debugging such that the reader  need not feel lost in the development process.  
     The book is accompanied by a set of instructor slides and programming examples,  which support the use of this text by an OpenCL instructor. Please visit  http://heterogeneouscomputingwithopencl.org/ for additional information.  
  
  THINKING PARALLEL  
  Most applications are first programmed to run on a single processor. In the field  of high-performance computing, classical approaches have been used to accelerate  computation when provided with multiple computing resources. Standard approaches  include "divide-and-conquer" and "scatter–gather" problem decomposition methods,  providing the programmer with a set of strategies to effectively exploit the parallel  resources available in high-performance systems. Divide-and-conquer methods iteratively  break a problem into subproblems until the subproblems fit well on the computational  resources provided. Scatter–gather methods send a subset of the input data  set to each parallel resource and then collect the results of the computation and combine  them into a result data set. As before, the partitioning takes account of the size of  the subsets based on the capabilities of the parallel resources. Figure 1.1 shows how  popular applications such as sorting and a vector–scalar multiply can be effectively  mapped to parallel resources to accelerate processing.  
     The programming task becomes increasingly challenging when faced with the  growing parallelism and heterogeneity present in contemporary parallel processors.  Given the power and thermal limits of complementary metal-oxide semiconductor  (CMOS) technology, microprocessor vendors find it difficult to scale the frequency  of these devices to derive more performance and have instead decided to place multiple  processors, sometimes specialized, on a single chip. In doing so, the problem of  extracting parallelism from an application is left to the programmer, who must decompose  the underlying algorithms in the applications and map them efficiently to a  diverse variety of target hardware platforms.  
     In the past 5 years, parallel computing devices have been increasing in number  and processing capabilities. GPUs have also appeared on the computing scene and  are providing new levels of processing capability at very low cost. Driven by the  demand for real-time three-dimensional graphics rendering, a highly data-parallel  problem, GPUs have evolved rapidly as very powerful, fully programmable, task  and data-parallel architectures. Hardware manufacturers are now combining CPUs  and GPUs on a single die, ushering in a new generation of heterogeneous computing.  Compute-intensive and data-intensive portions of a given application, called kernels,  may be offloaded to the GPU, providing significant performance per watt and raw  performance gains, while the host CPU continues to execute nonkernel tasks.  
     Many systems and phenomena in both the natural world and the man-made world  present us with different classes of parallelism and concurrency:  
   Molecular dynamics  
   Weather and ocean patterns  
   Multimedia systems  
   Tectonic plate drift  
   Cell growth  
   Automobile assembly lines  
   Sound and light wave propagation  
  
  Parallel computing, as defined by Almasi and Gottlieb (1989), is "a form of computation  in which many calculations are carried out simultaneously, operating on the  principle that large problems can often be divided into smaller ones, which are then  solved concurrently (i.e., in parallel)." The degree of parallelism that can be achieved  is dependent on the inherent nature of the problem at hand (remember that there exists  significant parallelism in the world), and the skill of the algorithm or software  designer is to identify the different forms of parallelism present in the underlying  problem. We begin with a discussion of two simple examples to demonstrate inherent  parallel computation: vector multiplication and text searching.  
     Our first example carries out multiplication of the elements of two arrays A and B,  each with N elements, storing the result of each multiply in a corresponding array C.  Figure 1.2 shows the computation we would like to carry out. The serial Cþþ  program for code would look as follows:  
     for (i = 0; i<N; i++)        C[i] = A[i] * B[i];  
  
  This code possesses significant parallelism but very little arithmetic intensity. The  computation of every element in C is independent of every other element. If we were  to parallelize this code, we could choose to generate a separate execution instance to  perform the computation of each element of C. This code possesses significant datalevel  parallelism because the same operation is applied across all of A and B to produce  C. We could also view this breakdown as a simple form of task parallelism  where each task operates on a subset of the same data; however, task parallelism generalizes  further to execution on pipelines of data or even more sophisticated parallel  interactions. Figure 1.3 shows an example of task parallelism in a pipeline to support  filtering of images in frequency space using an FFT.  
     Let us consider a second example. The computation we are trying to carry out is  to find the number of occurrences of a string of characters in a body of text  (Figure 1.4). Assume that the body of text has already been parsed into a set of N  words. We could choose to divide the task of comparing the string against the N potential  matches into N comparisons (i.e., tasks), where each string of characters is  matched against the text string. This approach, although rather naïve in terms of  search efficiency, is highly parallel. The process of the text string being compared  against the set of potential words presents N parallel tasks, each carrying out the same  set of operations. There is even further parallelism within a single comparison task,  where the matching on a character-by-character basis presents a finer-grained degree  of parallelism. This example exhibits both data-level parallelism (we are going to be  performing the same operation on multiple data items) and task-level parallelism (we  can compare the string to all words concurrently).  
     Once the number of matches is determined, we need to accumulate them to provide  the total number of occurrences. Again, this summing can exploit parallelism. In this  step, we introduce the concept of "reduction," where we can utilize the availability of  parallel resources to combine partials sums in a very efficient manner. Figure 1.5 shows  the reduction tree, which illustrates this summation process in log N steps.  
  
  CONCURRENCY AND PARALLEL PROGRAMMING MODELS  
  Here, we discuss concurrency and parallel processing models so that when attempting  to map an application developed in OpenCL to a parallel platform, we can select  the right model to pursue. Although all of the following models can be supported in  OpenCL, the underlying hardware may restrict which model will be practical to use.  
     Concurrency is concerned with two or more activities happening at the same  time. We find concurrency in the real world all the time—for example, carrying a  child in one arm while crossing a road or, more generally, thinking about something  while doing something else with one's hands.  
     When talking about concurrency in terms of computer programming, we mean a  single system performing multiple tasks independently. Although it is possible that  concurrent tasks may be executed at the same time (i.e., in parallel), this is not a requirement.  For example, consider a simple drawing application, which is either receiving  input from the user via the mouse and keyboard or updating the display with  the current image. Conceptually, receiving and processing input are different operations  (i.e., tasks) from updating the display. These tasks can be expressed in terms of  concurrency, but they do not need to be performed in parallel. In fact, in the case in  which they are executing on a single core of a CPU, they cannot be performed in  parallel. In this case, the application or the operating system should switch between  the tasks, allowing both some time to run on the core.  
  (Continues...)  
     
 
 Excerpted from Heterogeneous Computing with OpenCL by Benedict Gaster  Lee Howes  David R. Kaeli  Perhaad Mistry  Dana Schaa  Copyright © 2012   by Advanced Micro Devices, Inc..   Excerpted by permission of MORGAN KAUFMANN PUBLISHERS. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.