Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

An Introduction to Parallel Programming

An Introduction to Parallel Programming

by Peter Pacheco

See All Formats & Editions

An Introduction to Parallel Programming is the first undergraduate text to directly address compiling and running parallel programs on the new multi-core and cluster architecture. It explains how to design, debug, and evaluate the performance of distributed and shared-memory programs. The author Peter Pacheco uses a tutorial approach to show students how to


An Introduction to Parallel Programming is the first undergraduate text to directly address compiling and running parallel programs on the new multi-core and cluster architecture. It explains how to design, debug, and evaluate the performance of distributed and shared-memory programs. The author Peter Pacheco uses a tutorial approach to show students how to develop effective parallel programs with MPI, Pthreads, and OpenMP, starting with small programming examples and building progressively to more challenging ones. The text is written for students in undergraduate parallel programming or parallel computing courses designed for the computer science major or as a service course to other departments; professionals with no background in parallel computing.

  • Takes a tutorial approach, starting with small programming examples and building progressively to more challenging examples
  • Focuses on designing, debugging and evaluating the performance of distributed and shared-memory programs
  • Explains how to develop parallel programs using MPI, Pthreads, and OpenMP programming models

Editorial Reviews

From the Publisher
"Pacheco succeeds in introducing the reader to the key issues and considerations in parallel programming. The simplicity of the examples allows the reader to focus on parallel programming aspects rather than application logic. Including both MPI and Pthreads/OpenMP is a good way to illustrate the differences between message passing and shared-memory programming models. The discussions about analyzing the scalability and efficiency of the resulting parallel programs present a key aspect of developing real parallel programs. Finally, working through the same examples using all three facilities helps make this even more concrete." —W. Hu, ComputingReviews.com

"This is a well-written book, appropriately targeted at junior undergraduates. Being easily digestible, it makes the difficult task of parallel programming come across a lot less daunting than I have seen in other texts. Admittedly, it is light on theory; however, the most memorable lessons in parallel programming are those learned from mistakes made. With over 100 programming exercises, learning opportunities abound." —Bernard Kuc, ACM’s Computing Reviews.com

"With the coming of multicore processors and the cloud, parallel computing is most certainly not a niche area off in a corner of the computing world. Parallelism has become central to the efficient use of resources, and this new textbook by Peter Pacheco will go a long way toward introducing students early in their academic careers to both the art and practice of parallel computing." —Duncan Buell, Department of Computer Science and Engineering, University of South Carolina

"An Introduction to Parallel Programming illustrates fundamental programming principles in the increasingly important area of shared memory programming using Pthreads and OpenMP and distributed memory programming using MPI. More importantly, it emphasizes good programming practices by indicating potential performance pitfalls. These topics are presented in the context of a variety of disciplines including computer science, physics and mathematics. The chapters include numerous programming exercises that range from easy to very challenging. This is an ideal book for students or professionals looking to learn parallel programming skills or to refresh their knowledge." —Leigh Little, Department of Computational Science, The College at Brockport, The State University of New York

"An Introduction to Parallel Programming is a well written, comprehensive book on the field of parallel computing. Students and practitioners alike will appreciate the relevant, up-to-date information. Peter Pacheco’s very accessible writing style combined with numerous interesting examples keeps the reader’s attention. In a field that races forward at a dizzying pace, this book hangs on for the wild ride covering the ins and outs of parallel hardware and software." —Kathy J. Liszka, Department of Computer Science, University of Akron

"Parallel computing is the future and this book really helps introduce this complicated subject with practical and useful examples." —Andrew N. Sloss FBCS, Consultant Engineer, ARM, Author of ARM System Developer’s Guide

Product Details

Elsevier Science
Publication date:
Sold by:
Barnes & Noble
File size:
5 MB

Read an Excerpt

An Introduction to Parallel Programming

By Peter S. Pacheco


Copyright © 2011 Elsevier Inc.
All right reserved.

ISBN: 978-0-08-092144-0

Chapter One

Why Parallel Computing?

From 1986 to 2002 the performance of microprocessors increased, on average, 50% per year [27]. This unprecedented increase meant that users and software developers could often simply wait for the next generation of microprocessors in order to obtain increased performance from an application program. Since 2002, however, single-processor performance improvement has slowed to about 20% per year. This difference is dramatic: at 50% per year, performance will increase by almost a factor of 60 in 10 years, while at 20%, it will only increase by about a factor of 6.

Furthermore, this difference in performance increase has been associated with a dramatic change in processor design. By 2005, most of the major manufacturers of microprocessors had decided that the road to rapidly increasing performance lay in the direction of parallelism. Rather than trying to continue to develop ever-faster monolithic processors, manufacturers started putting multiple complete processors on a single integrated circuit.

This change has a very important consequence for software developers: simply adding more processors will not magically improve the performance of the vast majority of serial programs, that is, programs that were written to run on a single processor. Such programs are unaware of the existence of multiple processors, and the performance of such a program on a system with multiple processors will be effectively the same as its performance on a single processor of the multiprocessor system.

All of this raises a number of questions:

1. Why do we care? Aren't single processor systems fast enough? After all, 20% per year is still a pretty significant performance improvement.

2. Why can't microprocessor manufacturers continue to develop much faster single processor systems? Why build parallel systems? Why build systems with multiple processors?

3. Why can't we write programs that will automatically convert serial programs into parallel programs, that is, programs that take advantage of the presence of multiple processors?

Let's take a brief look at each of these questions. Keep in mind, though, that some of the answers aren't carved in stone. For example, 20% per year may be more than adequate for many applications.


The vast increases in computational power that we've been enjoying for decades now have been at the heart of many of the most dramatic advances in fields as diverse as science, the Internet, and entertainment. For example, decoding the human genome, ever more accurate medical imaging, astonishingly fast and accurate Web searches, and ever more realistic computer games would all have been impossible without these increases. Indeed, more recent increases in computational power would have been difficult, if not impossible, without earlier increases. But we can never rest on our laurels. As our computational power increases, the number of problems that we can seriously consider solving also increases. The following are a few examples:

Climate modeling. In order to better understand climate change, we need far more accurate computer models, models that include interactions between the atmosphere, the oceans, solid land, and the ice caps at the poles. We also need to be able to make detailed studies of how various interventions might affect the global climate.

Protein folding. It's believed that misfolded proteins may be involved in diseases such as Huntington's, Parkinson's, and Alzheimer's, but our ability to study configurations of complex molecules such as proteins is severely limited by our current computational power.

Drug discovery. There are many ways in which increased computational power can be used in research into new medical treatments. For example, there are many drugs that are effective in treating a relatively small fraction of those suffering from some disease. It's possible that we can devise alternative treatments by careful analysis of the genomes of the individuals for whom the known treatment is ineffective. This, however, will involve extensive computational analysis of genomes.

• Energy research. Increased computational power will make it possible to program much more detailed models of technologies such as wind turbines, solar cells, and batteries. These programs may provide the information needed to construct far more efficient clean energy sources.

Data analysis. We generate tremendous amounts of data. By some estimates, the quantity of data stored worldwide doubles every two years [28], but the vast majority of it is largely useless unless it's analyzed. As an example, knowing the sequence of nucleotides in human DNA is, by itself, of little use. Understanding how this sequence affects development and how it can cause disease requires extensive analysis. In addition to genomics, vast quantities of data are generated by particle colliders such as the Large Hadron Collider at CERN, medical imaging, astronomical research, and Web search engines—to name a few.

These and a host of other problems won't be solved without vast increases in computational power.


Much of the tremendous increase in single processor performance has been driven by the ever-increasing density of transistors—the electronic switches—on integrated circuits. As the size of transistors decreases, their speed can be increased, and the overall speed of the integrated circuit can be increased. However, as the speed of transistors increases, their power consumption also increases. Most of this power is dissipated as heat, and when an integrated circuit gets too hot, it becomes unreliable. In the first decade of the twenty-first century, air-cooled integrated circuits are reaching the limits of their ability to dissipate heat [26].

Therefore, it is becoming impossible to continue to increase the speed of integrated circuits. However, the increase in transistor density can continue—at least for a while. Also, given the potential of computing to improve our existence, there is an almost moral imperative to continue to increase computational power. Finally, if the integrated circuit industry doesn't continue to bring out new and better products, it will effectively cease to exist.

How then, can we exploit the continuing increase in transistor density? The answer is parallelism. Rather than building ever-faster, more complex, monolithic processors, the industry has decided to put multiple, relatively simple, complete processors on a single chip. Such integrated circuits are called multicore processors, and core has become synonymous with central processing unit, or CPU. In this setting a conventional processor with one CPU is often called a single-core system.


Most programs that have been written for conventional, single-core systems cannot exploit the presence of multiple cores. We can run multiple instances of a program on a multicore system, but this is often of little help. For example, being able to run multiple instances of our favorite game program isn't really what we want—we want the program to run faster with more realistic graphics. In order to do this, we need to either rewrite our serial programs so that they're parallel, so that they can make use of multiple cores, or write translation programs, that is, programs that will automatically convert serial programs into parallel programs. The bad news is that researchers have had very limited success writing programs that convert serial programs in languages such as C and C++ into parallel programs.

This isn't terribly surprising. While we can write programs that recognize common constructs in serial programs, and automatically translate these constructs into efficient parallel constructs, the sequence of parallel constructs may be terribly inefficient. For example, we can view the multiplication of two n × n matrices as a sequence of dot products, but parallelizing a matrix multiplication as a sequence of parallel dot products is likely to be very slow on many systems.

An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelizations of each of its steps. Rather, the best parallelization may be obtained by stepping back and devising an entirely new algorithm.

As an example, suppose that we need to compute n values and add them together. We know that this can be done with the following serial code:

sum = 0; for (i = 0; i < n; i++) { x = Compute next value(...); sum += x; }

Now suppose we also have p cores and p is much smaller than n. Then each core can form a partial sum of approximately n/p values:

my sum = 0; my first i = ...; my last i = ...; for (my i = my first i; my i < my last i; my i++) { my x = Compute next value(...); my sum += my x; }

Here the prefix my_indicates that each core is using its own, private variables, and each core can execute this block of code independently of the other cores.

After each core completes execution of this code, its variable my sum will store the sum of the values computed by its calls to Compute next value. For example, if there are eight cores, n = 24, and the 24 calls to Compute next value return the values

1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9,

then the values stored in my sum might be

Core 0 1 2 3 4 5 6 7 my_sum 8 19 7 15 7 13 12 14

Here we're assuming the cores are identified by nonnegative integers in the range 0,1, ..., p - 1, where p is the number of cores.

When the cores are done computing their values of my sum, they can form a global sum by sending their results to a designated "master" core, which can add their results:

if (I'm the master core) { sum = my x; for each core other than myself { receive value from core; sum += value; }

} else { send my x to the master; }

In our example, if the master core is core 0, it would add the values 8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95.

But you can probably see a better way to do this—especially if the number of cores is large. Instead of making the master core do all the work of computing the final sum, we can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result of core 5 and so on. Then we can repeat the process with only the even-ranked cores: 0 adds in the result of 2, 4 adds in the result of 6, and so on. Now cores divisible by 4 repeat the process, and so on. See Figure 1.1. The circles contain the current value of each core's sum, and the lines with arrows indicate that one core is sending its sum to another core. The plus signs indicate that a core is receiving a sum from another core and adding the received sum into its own sum.

For both "global" sums, the master core (core 0) does more work than any other core, and the length of time it takes the program to complete the final sum should be the length of time it takes for the master to complete. However, with eight cores, the master will carry out seven receives and adds using the first method, while with the second method it will only carry out three. So the second method results in an improvement of more than a factor of two. The difference becomes much more dramatic with large numbers of cores. With 1000 cores, the first method will require 999 receives and adds, while the second will only require 10, an improvement of almost a factor of 100!

The first global sum is a fairly obvious generalization of the serial global sum: divide the work of adding among the cores, and after each core has computed its part of the sum, the master core simply repeats the basic serial addition—if there are p cores, then it needs to add p values. The second global sum, on the other hand, bears little relation to the original serial addition.

The point here is that it's unlikely that a translation program would "discover" the second global sum. Rather there would more likely be a predefined efficient global sum that the translation program would have access to. It could "recognize" the original serial loop and replace it with a precoded, efficient, parallel global sum.

We might expect that software could be written so that a large number of common serial constructs could be recognized and efficiently parallelized, that is, modified so that they can use multiple cores. However, as we apply this principle to ever more complex serial programs, it becomes more and more difficult to recognize the construct, and it becomes less and less likely that we'll have a precoded, efficient parallelization.


Excerpted from An Introduction to Parallel Programming by Peter S. Pacheco Copyright © 2011 by Elsevier Inc.. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Meet the Author

Peter Pacheco received a PhD in mathematics from Florida State University. After

completing graduate school, he became one of the first professors in UCLA’s “Program

in Computing,” which teaches basic computer science to students at the College

of Letters and Sciences there. Since leaving UCLA, he has been on the faculty of

the University of San Francisco. At USF Peter has served as chair of the computer

science department and is currently chair of the mathematics department.

His research is in parallel scientific computing. He has worked on the development

of parallel software for circuit simulation, speech recognition, and the simulation

of large networks of biologically accurate neurons. Peter has been teaching

parallel computing at both the undergraduate and graduate levels for nearly twenty

years. He is the author of Parallel Programming with MPI, published by Morgan

Kaufmann Publishers.

Customer Reviews

Average Review:

Post to your social network


Most Helpful Customer Reviews

See all customer reviews