- Shopping Bag ( 0 items )
Expert guidance for those programming today’s dual-core processors PCs
As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-based programming, coordination data structures, PLINQ, thread pools, asynchronous ...
Expert guidance for those programming today’s dual-core processors PCs
As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-based programming, coordination data structures, PLINQ, thread pools, asynchronous programming model, and more. It also teaches other parallel programming techniques, such as SIMD and vectorization.
Master the tools and technology you need to develop thread-safe concurrent applications for multi-core systems, with Professional Parallel Programming with C#.
WHAT'S IN THIS CHAPTER?
* Working with shared-memory multicore
* Understanding the differences between shared-memory multicore and distributed-memory systems
* Working with parallel programming and multicore programming in shared-memory architectures
* Understanding hardware threads and software threads
* Understanding Amdahl's Law
* Considering Gustafson's Law
* Working with lightweight concurrency models
* Creating successful task-based designs
* Understanding the differences between interleaved concurrency, concurrency, and parallelism
* Parallelizing tasks and minimizing critical sections
* Understanding rules for parallel programming for multicore architectures
* Preparing for NUMA architectures
This chapter introduces the new task-based programming that allows you to introduce parallelism in applications. Parallelism is essential to exploit modern shared-memory multicore architectures. The chapter describes the new lightweight concurrency models and important concepts related to concurrency and parallelisms. It includes the necessary background information in order to prepare your mind for the next 10 chapters.
WORKING WITH SHARED-MEMORY MULTICORE
In 2005, Herb Sutter published an article in Dr. Dobb's Journal titled "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software" (www.gotw.ca/publications/ concurrency-ddj.htm). He talked about the need to start developing software considering concurrency to fully exploit continuing exponential microprocessors throughput gains. Microprocessor manufacturers are adding processing cores instead of increasing their clock frequency. Software developers can no longer rely on the free-lunch performance gains these increases in clock frequency provided.
Most machines today have at least a dual-core microprocessor. However, quad-core and octal-core microprocessors, with four and eight cores, respectively, are quite popular on servers, advanced workstations, and even on high-end mobile computers. More cores in a single microprocessor are right around the corner. Modern microprocessors offer new multicore architectures. Thus, it is very important to prepare the software designs and the code to exploit these architectures. The different kinds of applications generated with Visual C# 2010 and .NET Framework 4 run on one or many central processing units (CPUs), the main microprocessors. Each of these microprocessors can have a different number of cores, capable of executing instructions.
You can think of a multicore microprocessor as many interconnected microprocessors in a single package. All the cores have access to the main memory, as illustrated in Figure 1-1. Thus, this architecture is known as shared-memory multicore. Sharing memory in this way can easily lead to a performance bottleneck.
Multicore microprocessors have many different complex micro-architectures, designed to offer more parallel-execution capabilities, improve overall throughput, and reduce potential bottlenecks. At the same time, multicore microprocessors try to shrink power consumption and generate less heat. Therefore, many modern microprocessors can increase or reduce the frequency for each core according to their workload, and they can even sleep cores when they are not in use. Windows 7 and Windows Server 2008 R2 support a new feature called Core Parking. When many cores aren't in use and this feature is active, these operating systems put the remaining cores to sleep. When these cores are necessary, the operating systems wake the sleeping cores.
Modern microprocessors work with dynamic frequencies for each of their cores. Because the cores don't work with a fixed frequency, it is difficult to predict the performance for a sequence of instructions. For example, Intel Turbo Boost Technology increases the frequency of the active cores. The process of increasing the frequency for a core is also known as overclocking.
If a single core is under a heavy workload, this technology will allow it to run at higher frequencies when the other cores are idle. If many cores are under heavy workloads, they will run at higher frequencies but not as high as the one achieved by the single core. The microprocessor cannot keep all the cores overclocked a lot of time, because it consumes more power and its temperature increases faster. The average clock frequency for all the cores under heavy workloads is going to be lower than the one achieved for the single core. Therefore, under certain situations, some code can run at higher frequencies than other code, which can make measuring real performance gains a challenge.
Differences Between Shared-Memory Multicore and Distributed-Memory Systems
Distributed-memory computer systems are composed of many microprocessors with their own private memory, as illustrated in Figure 1-2. Each microprocessor can be in a different computer, with different types of communication channels between them. Examples of communication channels are wired and wireless networks. If a job running in one of the microprocessors requires remote data, it has to communicate with the corresponding remote microprocessor through the communication channel. One of the most popular communications protocols used to program parallel applications to run on distributed-memory computer systems is Message Passing Interface (MPI). It is possible to use MPI to take advantage of shared-memory multicore with C# and .NET Framework. However, MPI's main focus is to help developing applications run on clusters. Thus, it adds a big overhead that isn't necessary in shared-memory multicore, where all the cores can access the memory without the need to send messages.
Figure 1-3 shows a distributed-memory computer system with three machines. Each machine has a quad-core microprocessor, and a shared-memory architecture for these cores. This way, the private memory for each microprocessor acts as a shared memory for its four cores.
A distributed-memory system forces you to think about the distribution of the data, because each message to retrieve remote data can introduce an important latency. Because you can add new machines (nodes) to increase the number of microprocessors for the system, distributed-memory systems can offer great scalability.
Parallel Programming and Multicore Programming
Traditional sequential code, where instructions run one after the other, doesn't take advantage of multiple cores because the serial instructions run on only one of the available cores. Sequential code written with Visual C# 2010 won't take advantage of multiple cores if it doesn't use the new features offered by .NET Framework 4 to split the work into many cores. There isn't an automatic parallelization of existing sequential code.
Parallel programming is a form of programming in which the code takes advantage of the parallel execution possibilities offered by the underlying hardware. Parallel programming runs many instructions at the same time. As previously explained, there are many different kinds of parallel architectures, and their detailed analysis would require a complete book dedicated to the topic.
Multicore programming is a form of programming in which the code takes advantage of the multiple execution cores to run many instructions in parallel. Multicore and multiprocessor computers offer more than one processing core in a single machine. Hence, the goal is to do more in less time by distributing the work to be done in the available cores.
Modern microprocessors can execute the same instruction on multiple data, something classified by Michael J. Flynn in his proposed Flynn's taxonomy in 1966 as Single Instruction, Multiple Data (SIMD). This way, you can take advantage of these vector processors to reduce the time needed to execute certain algorithms.
This book covers two areas of parallel programming in great detail: shared-memory multicore programming and the usage of vector-processing capabilities. The overall goal is to reduce the execution time of the algorithms. The additional processing power enables you to add new features to existing software, as well.
UNDERSTANDING HARDWARE THREADS AND SOFTWARE THREADS
A multicore microprocessor has more than one physical core — real independent processing units that make it possible to run instructions at the same time, in parallel. In order to take advantage of multiple physical cores, it is necessary to run many processes or to run more than one thread in a single process, creating multithreaded code.
However, each physical core can offer more than one hardware thread, also known as a logical core or logical processor. Microprocessors with Intel Hyper-Threading Technology (HT or HTT) offer many architectural states per physical core. For example, many microprocessors with four physical cores with HT duplicate the architectural states per physical core and offer eight hardware threads. This technique is known as simultaneous multithreading (SMT) and it uses the additional architectural states to optimize and increase the parallel execution at the microprocessor's instruction level. SMT isn't restricted to just two hardware threads per physical core; for example, you could have four hardware threads per core. This doesn't mean that each hardware thread represents a physical core. SMT can offer performance improvements for multithreaded code under certain scenarios. Subsequent chapters provide several examples of these performance improvements.
Each running program in Windows is a process. Each process creates and runs one or more threads, known as software threads to differentiate them from the previously explained hardware threads. A process has at least one thread, the main thread. An operating system scheduler shares out the available processing resources fairly between all the processes and threads it has to run. Windows scheduler assigns processing time to each software thread. When Windows scheduler runs on a multicore microprocessor, it has to assign time from a hardware thread, supported by a physical core, to each software thread that needs to run instructions. As an analogy, you can think of each hardware thread as a swim lane and a software thread as a swimmer.
Each software thread shares the private unique memory space with its parent process. However it has its own stack, registers, and a private local storage.
Windows recognizes each hardware thread as a schedulable logical processor. Each logical processor can run code for a software thread. A process that runs code in multiple software threads can take advantage of hardware threads and physical cores to run instructions in parallel. Figure 1-4 shows software threads running on hardware threads and on physical cores. Windows scheduler can decide to reassign one software thread to another hardware thread to load-balance the work done by each hardware thread. Because there are usually many other software threads waiting for processing time, load balancing will make it possible for these other threads to run their instructions by organizing the available resources. Figure 1-5 shows Windows Task Manager displaying eight hardware threads (logical cores and their workloads).
Load balancing refers to the practice of distributing work from software threads among hardware threads so that the workload is fairly shared across all the hardware threads. However, achieving perfect load balance depends on the parallelism within the application, the workload, the number of software threads, the available hardware threads, and the load-balancing policy.
Windows Task Manager and Windows Resource Monitor show the CPU usage history graphics for hardware threads. For example, if you have a microprocessor with four physical cores and eight hardware threads, these tools will display eight independent graphics.
Windows runs hundreds of software threads by assigning chunks of processing time to each available hardware thread. You can use Windows Resource Monitor to view the number of software threads for a specific process in the Overview tab. The CPU panel displays the image name for each process and the number of associated software threads in the Threads column, as shown in Figure 1-6 where the vlc.exe process has 32 software threads.
Core Parking is a Windows kernel power manager and kernel scheduler technology designed to improve the energy efficiency of multicore systems. It constantly tracks the relative workloads of every hardware thread relative to all the others and can decide to put some of them into sleep mode.
Core Parking dynamically scales the number of hardware threads that are in use based on workload. When the workload for one of the hardware threads is lower than a certain threshold value, the Core Parking algorithm will try to reduce the number of hardware threads that are in use by parking some of the hardware threads in the system. In order to make this algorithm efficient, the kernel scheduler gives preference to unparked hardware threads when it schedules software threads. The kernel scheduler will try to let the parked hardware threads become idle, and this will allow them to transition into a lower-power idle state.
Core Parking tries to intelligently schedule work between threads that are running on multiple hardware threads in the same physical core on systems with microprocessors that include HT. This scheduling decision decreases power consumption.
Windows Server 2008 R2 supports the complete Core Parking technology. However, Windows 7 also uses the Core Parking algorithm and infrastructure to balance processor performance between hardware threads with microprocessors that include HT. Figure 1-7 shows Windows Resource Monitor displaying the activity of eight hardware threads, with four of them parked.
Regardless of the number of parked hardware threads, the number of hardware threads returned by .NET Framework 4 functions will be the total number, not just the unparked ones. Core Parking technology doesn't limit the number of hardware threads available to run software threads in a process.
Under certain workloads, a system with eight hardware threads can turn itself into a system with two hardware threads when it is under a light workload, and then increase and spin up reserve hardware threads as needed. In some cases, Core Parking can introduce an additional latency to schedule many software threads that try to run code in parallel. Therefore, it is very important to consider the resultant latency when measuring the parallel performance.
Excerpted from Professional Parallel Programming with C# by Gastón Hillar Copyright © 2011 by John Wiley & Sons, Ltd. Excerpted by permission of John Wiley & Sons. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
CHAPTER 1: TASK-BASED PROGRAMMING 1
Working with Shared-Memory Multicore 2
Differences Between Shared-Memory Multicore and Distributed-Memory Systems 3
Parallel Programming and Multicore Programming 4
Understanding Hardware Threads and Software Threads 5
Understanding Amdahl’s Law 10
Considering Gustafson’s Law 13
Working with Lightweight Concurrency 16
Creating Successful Task-Based Designs 17
Designing With Concurrency in Mind 18
Understanding the Differences between Interleaved Concurrency, Concurrency, and Parallelism 19
Parallelizing Tasks 19
Minimizing Critical Sections 21
Understanding Rules for Parallel Programming for Multicore 22
Preparing for NUMA and Higher Scalability 22
Deciding the Convenience of Going Parallel 27
CHAPTER 2: IMPERATIVE DATA PARALLELISM 29
Launching Parallel Tasks 30
System.Threading.Tasks.Parallel Class 31
No Specific Execution Order 33
Advantages and Trade-Off s 37
Interleaved Concurrency and Concurrency 38
Transforming Sequential Code to Parallel Code 40
Detecting Parallelizable Hotspots 40
Measuring Speedups Achieved by Parallel Execution 43
Understanding the Concurrent Execution 45
Parallelizing Loops 45
Refactoring an Existing Sequential Loop 48
Measuring Scalability 50
Working with Embarrassingly Parallel Problems 52
Working with Partitions in a Parallel Loop 54
Optimizing the Partitions According to the Number of Cores 56
Working with IEnumerable Sources of Data 58
Exiting from Parallel Loops 60
Understanding ParallelLoopState 62
Analyzing the Results of a Parallel Loop Execution 63
Catching Exceptions that Occur Inside Parallel Loops 64
Specifying the Desired Degree of Parallelism 66
Counting Hardware Threads 69
Logical Cores Aren’t Physical Cores 70
Using Gantt Charts to Detect Critical Sections 71
CHAPTER 3: IMPERATIVE TASK PARALLELISM 73
Creating and Managing Tasks 74
Understanding a Task’s Status and Lifecycle 77
TaskStatus: Initial States 77
TaskStatus: Final States 78
Using Tasks to Parallelize Code 78
Starting Tasks 79
Visualizing Tasks Using Parallel Tasks and Parallel Stacks 80
Waiting for Tasks to Finish 85
Forgetting About Complex Threads 85
Cancelling Tasks Using Tokens 86
Handling Exceptions Thrown by Tasks 91
Returning Values from Tasks 92
Chaining Multiple Tasks Using Continuations 95
Mixing Parallel and Sequential Code with Continuations 97
Working with Complex Continuations 97
Programming Complex Parallel Algorithms with Critical Sections Using Tasks 100
Preparing the Code for Concurrency and Parallelism 101
CHAPTER 4: CONCURRENT COLLECTIONS 103
Understanding the Features Offered by Concurrent Collections 104
Understanding a Parallel Producer-Consumer Pattern 111
Working with Multiple Producers and Consumers 115
Designing Pipelines by Using Concurrent Collections 120
Transforming Arrays and Unsafe Collections into
Concurrent Collections 128
Cancelling Operations on a BlockingCollection 142
Implementing a Filtering Pipeline with Many BlockingCollection Instances 144
CHAPTER 5: COORDINATION DATA STRUCTURES 157
Using Cars and Lanes to Understand the Concurrency Nightmares 158
Undesired Side Effects 158
Race Conditions 159
A Lock-Free Algorithm with Atomic Operations 161
A Lock-Free Algorithm with Local Storage 162
Understanding New Synchronization Mechanisms 163
Working with Synchronization Primitives 164
Synchronizing Concurrent Tasks with Barriers 165
Barrier and ContinueWhenAll 171
Catching Exceptions in all Participating Tasks 172
Working with Timeouts 173
Working with a Dynamic Number of Participants 178
Working with Mutual-Exclusion Locks 179
Working with Monitor 182
Working with Timeouts for Locks 184
Refactoring Code to Avoid Locks 187
Using Spin Locks as Mutual-Exclusion Lock Primitives 190
Working with Timeouts 193
Working with Spin-Based Waiting 194
Spinning and Yielding 197
Using the Volatile Modifier 200
Working with Lightweight Manual Reset Events 201
Working with ManualResetEventSlim to Spin and Wait 201
Working with Timeouts and Cancellations 206
Working with ManualResetEvent 210
Limiting Concurrency to Access a Resource 211
Working with SemaphoreSlim 212
Working with Timeouts and Cancellations 216
Working with Semaphore 216
Simplifying Dynamic Fork and Join Scenarios with CountdownEvent 219
Working with Atomic Operations 223
CHAPTER 6: PLINQ: DECLARATIVE DATA PARALLELISM 229
Transforming LINQ into PLINQ 230
ParallelEnumerable and Its AsParallel Method 232
AsOrdered and the orderby Clause 233
Specifying the Execution Mode 237
Understanding Partitioning in PLINQ 237
Performing Reduction Operations with PLINQ 242
Creating Custom PLINQ Aggregate Functions 245
Concurrent PLINQ Tasks 249
Cancelling PLINQ 253
Specifying the Desired Degree of Parallelism 255
Measuring Scalability 257
Working with ForAll 259
Differences Between foreach and ForAll 261
Measuring Scalability 261
Configuring How Results Are Returned by Using WithMergeOptions 264
Handling Exceptions Thrown by PLINQ 266
Using PLINQ to Execute MapReduce Algorithms 268
Designing Serial Stages Using PLINQ 271
Locating Processing Bottlenecks 273
CHAPTER 7: VISUAL STUDIO 2010 TASK DEBUGGING CAPABILITIES 275
Taking Advantage of Multi-Monitor Support 275
Understanding the Parallel Tasks Debugger Window 279
Viewing the Parallel Stacks Diagram 286
Following the Concurrent Code 294
Debugging Anonymous Methods 304
Viewing Methods 305
Viewing Threads in the
Source Code 307
Detecting Deadlocks 310
CHAPTER 8: THREAD POOLS 317
Going Downstairs from the Tasks Floor 317
Understanding the New CLR 4 Thread Pool Engine 319
Understanding Global Queues 319
Waiting for Worker Threads to Finish Their Work 329
Tracking a Dynamic Number of Worker Threads 336
Using Tasks Instead of Threads to Queue Jobs 340
Understanding the Relationship Between Tasks and the Thread Pool 343
Understanding Local Queues and the Work-Stealing Algorithm 347
Specifying a Custom Task Scheduler 353
CHAPTER 9: ASYNCHRONOUS PROGRAMMING MODEL 361
Mixing Asynchronous Programming with Tasks 362
Working with TaskFactory.FromAsync 363
Programming Continuations After Asynchronous Methods End 368
Combining Results from Multiple Concurrent Asynchronous Operations 369
Performing Asynchronous WPF UI Updates 371
Performing Asynchronous Windows Forms UI Updates 379
Creating Tasks that Perform EAP Operations 385
Working with TaskCompletionSource 394
CHAPTER 10: PARALLEL TESTING AND TUNING 399
Preparing Parallel Tests 399
Working with Performance Profi ling Features 404
Measuring Concurrency 406
Solutions to Common Patterns 416
Serialized Execution 416
Lock Contention 419
Lock Convoys 420
Partitioning Problems 428
Workstation Garbage-Collection Overhead 431
Working with the Server Garbage Collector 434
I/O Bottlenecks 434
Main Thread Overload 435
Understanding False Sharing 438
CHAPTER 11: VECTORIZATION, SIMD INSTRUCTIONS, AND ADDITIONAL PARALLEL LIBRARIES 443
Understanding SIMD and Vectorization 443
From MMX to SSE4.x and AVX 446
Using the Intel Math Kernel Library 447
Working with Multicore-Ready, Highly Optimized Software Functions 455
Mixing Task-Based Programming with External Optimized Libraries 456
Generating Pseudo-Random Numbers in Parallel 457
Using Intel Integrated Performance Primitives 461
APPENDIX A: .NET 4 PARALLELISM CLASS DIAGRAMS 469
Task Parallel Library 469
System.Threading.Tasks.Parallel Classes and Structures 469
Task Classes, Enumerations, and Exceptions 471
Data Structures for Coordination in Parallel Programming 472
Concurrent Collection Classes: System.Collections.Concurrent 474
Lightweight Synchronization Primitives 476
Lazy Initialization Classes 477
Thread and ThreadPool Classes and Their Exceptions 479
Signaling Classes 479
Threading Structures, Delegates, and Enumerations 480
BackgroundWorker Component 486
APPENDIX B: CONCURRENT UML MODELS 487
Structure Diagrams 487
Class Diagram 487
Component Diagram 489
Deployment Diagram 489
Package Diagram 489
Behavior Diagrams 489
Activity Diagram 491
Use Case Diagram 491
Interaction Diagrams 493
Interaction Overview Diagram 493
Sequence Diagram 494
APPENDIX C: PARALLEL EXTENSIONS EXTRAS 497
Inspecting Parallel Extensions Extras 497
Coordination Data Structures 502
Parallel Algorithms 513
Task Schedulers 517
Posted September 13, 2011
This is a great book, but it is not for the faint of heart. It's a high level programming book geared towards teaching programmers how to best manage parallel programming techniques. I've dabbled a bit in background processes, but that is nothing compared what's discussed in this book. And just reading the examples are not enough. Putting these concepts into your own code is where the understanding is going to come in and the mythical light bulb is going to suddenly turn on for you. If you have already started working with Parallel Programming, this book will increase your skills and help you master the subject!Was this review helpful? Yes NoThank you for your feedback. Report this reviewThank you, this review has been flagged.
Posted April 6, 2011
I wasn't sure what to think about this book when I got it, but as soon as I started reading it I knew that it was going to be a great reference.
The author starts by explaining that parallel programming is not going to solve every performance problem. In fact, it won't solve most of them. The book attempts to clearly explain how to determine if/when parallel programming is going to be the right solution. The author provides a lot of data to explain what type of gains you can expect (or not). In fact, the author wanted to make sure this point was so clearly understood that it was almost annoying.
The book starts by going over the TPL, PLINQ, Exception handling in parallel code and parallel friendly collections. Later on you get coverage of the Visual Studio parallel debugging tools and a look at how thread pooling works in .NET 4.
Overall this book does a great job of explaining parallel theories and how the TPL works and and you can get up and running with just the first 4-5 chapters, but you get so much more advanced information later in the book. It's really worth keeping around.