Read an Excerpt
GPU Computing Gems Jade Edition
By Wen-mei W. Hwu MORGAN KAUFMANN PUBLISHERS
Copyright © 2012 Elsevier, Inc.
All right reserved.
ISBN: 978-0-12-385964-8
Chapter One
Large-Scale GPU Search
Tim Kaldewey and Andrea Di Blas
With P-ary search we developed a novel scalable parallel search algorithm that optimally leverages Single Instruction Multiple Data (SIMD) architectures like graphical processing unit (GPUs). It outperforms conventional search algorithms like binary search in terms of throughput and response time by up to two orders of magnitude. Moreover, P-ary search scales with the number of threads/cores used to collaboratively answer an individual search query. While response time for conventional search algorithms tends to increase with workload size, P-ary search provides nearly constant query response time, independent of workload. Finally, P-ary search is particularly suited for data structures like B-trees, which are widely used for database indexes.
1.1 INTRODUCTION
As search engines like Google and Yahoo! demonstrate every day, efficiently searching for information in large data collections has become an indispensable operation of the information age. Although disciplines like information retrieval have developed highly efficient algorithms to manage large quantities of data, at the system level searching large data sets is I/O bound. Rapidly growing main memory sizes eliminate disk I/O as a bottleneck, even for traditionally data-intensive applications like search. While the front lines may have shifted from disk to main memory I/O, the basic problem remains: processors are significantly faster than storage. On today's systems accessing main memory takes hundreds of compute cycles.
While memory bandwidth has been increasing continuously, we hit the "memory wall" for latency more than a decade ago. Moreover, the increases in memory bandwidth over the past decade are a result of the multi/many-core era, with GPUs being a poster child, offering memory bandwidth well in excess of 100 GB/s. However, achieving peak bandwidth requires concurrent (parallel) memory access from all cores. The obvious use of increasingly parallel architectures for large-scale search applications is to execute multiple search queries simultaneously, which often results in significant increases in throughput. Query response time, on the other hand, remains memory-latency bound, unless we abandon conventional serial algorithms and go back to the drawing board.
To grasp the implications of porting a memory-bound application like search to the GPU, we first conducted a brief memory performance analysis (see Section 1.2). Conventional threading models and the memory access pattern produced by concurrent search queries do not map well to the GPU's memory subsystem (see Section 1.3.2). With P-ary search we demonstrate how parallel memory access combined with the superior synchronization capabilities of SIMD architectures like GPUs can be leveraged to compensate for memory latency (see the section titled "P-ary Search: Parallel Search from Scratch"). P-ary search outperforms conventional search algorithms not only in terms of throughput but also response time (see Section 1.4). Implemented on a GPU it can outperform a similarly priced CPU by up to three times, and is compatible with existing index data structures like inverted lists and B-trees. We expect the underlying concepts of P-ary search — parallel memory access and exploiting efficient SIMD thread synchronization — to be applicable to an entire class of memory-bound applications (see Section 1.5).
1.2 MEMORY PERFORMANCE
With more than six times the memory bandwidth of contemporary CPUs, GPUs are leading the trend toward throughput computing. On the other hand, traditional search algorithms besides linear scan are latency bound since their iterations are data dependent. However, as large database systems usually serve many queries concurrently both metrics— latency and bandwidth—are relevant.
Memory latency is mainly a function of where the requested piece of data is located in the memory hierarchy. Comparing CPU and GPU memory latency in terms of elapsed clock cycles shows that global memory accesses on the GPU take approximately 1.5 times as long as main memory accesses on the CPU, and more than twice as long in terms of absolute time (Table 1.1). Although shared memory does not operate the same way as the L1 cache on the CPU, its latency is comparable.
Memory bandwidth, on the other hand, depends on multiple factors, such as sequential or random access pattern, read/write ratio, word size, and concurrency. The effects of word size and read/write behavior on memory bandwidth are similar to the ones on the CPU — larger word sizes achieve better performance than small ones, and reads are faster than writes. On the other hand, the impact of concurrency and data access pattern require additional consideration when porting memory-bound applications to the GPU.
Little's Law, a general principle for queuing systems, can be used to derive how many concurrent memory operations are required to fully utilize memory bandwidth. It states that in a system that processes units of work at a certain average rate W, the average amount of time L that a unit spends inside the system is the product of W and λ, where λ is the average unit's arrival rate: L = λW. Applying Little's Law to memory, the number of outstanding requests must match the product of latency and bandwidth. For our GTX 285 GPU the latency is 500 clock cycles, and the peak bandwidth is 128 bytes per clock cycle—the physical bus width is 512 bits, or a 64-byte memory block, and two of these blocks are transferred per clock cycle— so:
outstanding reads = latency x bandwidth/request size = = 500cc x 128B/cc/(4B/request)/ = 16K requests
assuming 4-byte reads as in the code in Section 1.4. This can be achieved using different combinations of number of threads and outstanding requests per thread. In our example, we could make full use of the global memory by having 1K threads issue 16 independent reads each, or 2K threads issue eight reads each, and so on.
The plots in Figure 1.1 show the case in which each thread has only one outstanding memory request. This serves as a baseline example, mimicking the behavior of conventional search algorithms that at any given time have at most one outstanding memory request per search (thread), due to data dependencies. Obviously, if there are no constraints issuing more than one read per thread it is much more efficient to issue multiple reads per thread to maximize memory utilization. In our case, to saturate memory bandwidth we need at least 16,000 threads, for instance as 64 blocks of 256 threads each, where we observe a local peak. The maximum bandwidth of 150 GB/s is not reached here because the number of threads cannot compensate for some overhead required to manage threads and blocks. Increasing the number of threads, the bandwidth takes a small hit before reaching its peak (Figure 1.1a). Although there are many options to launch 16,000 or more threads, only certain configurations can achieve memory bandwidth close to the maximum. Using fewer than 30 blocks is guaranteed to leave some of the 30 streaming multiprocessors (SMs) idle, and using more blocks that can actively fit the SMs will leave some blocks waiting until others finish and might create some load imbalance. Having mutliple threads per block is always desirable to improve efficiency, but a block cannot have more than 512 threads. Considering 4-byte reads as in our experiments, fewer than 16 threads per block cannot fully use memory coalescing as described below.
Returning to Little's Law, we notice that it assumes that the full bandwidth be utilized, meaning, that all 64 bytes transferred with each memory block are useful bytes actually requested by an application, and not bytes that are transferred just because they belong to the same memory block. When any amount of data is accessed, with a minimum of one single byte, the entire 64-byte block that the data belongs to is actually transferred. To make sure that all bytes transferred are useful, it is necessary that accesses are coalesced, i.e. requests from different threads are presented to the memory management unit (MMU) in such a way that they can be packed into accesses that will use an entire 64-byte block. If, for example, theMMUcan only find 10 threads that read 10 4-byte words from the same block, 40 bytes will actually be used and 24 will be discarded. It is clear that coalescing is extremely important to achieve high memory utilization, and that it is much easier when the access pattern is regular and contiguous. The experimental results in Figure 1.1b confirm that random-access memory bandwidth is significantly lower than in the coalesced case. A more comprehensive explanation of memory architecture, coalescing, and optimization techniques can be found in Nvidia's CUDA Programming Guide.
1.3 SEARCHING LARGE DATA SETS
The problem of searching is not theoretically difficult in itself, but quickly searching a large data set offers an exquisite example of how finding the right combination of algorithms and data structures for a specific architecture can dramatically improve performance.
1.3.1 Data Structures
Searching for specific values in unsorted data leaves no other choice than scanning the entire data set. The time to search an unsorted data set in memory is determined by sequential memory bandwidth, because the access pattern is nothing else but sequential reads. Performance is identical to that of coalesced reads (Figure 1.1a), which peak at about 150 GB/s on the GPU when using many thread blocks and many threads per block. On the other hand, searching sorted data or indexes can be implemented much more efficiently.
Databases and many other applications use indexes, stored as sorted lists or B-trees, to accelerate searches. For example, searching a sorted list using binary search requires O(log2(n)) memory accesses as opposed to O(n) using linear search, for a data set of size n. However, the data access pattern of binary search is not amenable to caching and prefetching, as each iteration's memory access is data dependent and distant (Figure 1.2). Although each access incurs full memory latency this approach is orders of magnitude faster on sorted data. For example, assuming a memory latency of 500 clock cycles, searching a 512MB data set of 32-bit integers in the worst case takes log2(128M) * 500cc = 13,500cc, as opposed to millions when scanning the entire data set. B-trees group pivot elements as nodes and store them linearly, which makes them more amenable to caching (Figure 1.2). However, using the same threading model as one would on the CPU — assigning one thread to one search query—inevitably results in threads diverging quickly such that memory accesses across threads cannot be coalesced.
1.3.2 Limitations of Conventional Search Algorithms
Implementing a search function in CUDA can be done by simply addding a global function qualifier to a textbook implementation of binary search. Obviously, the performance of such a naive implementation cannot even keep up with a basic CPU implementation. Although optimizations like manual caching, using vector data types, and inlining functions can significantly improve response time and throughput, the basic problem remains: algorithms like binary search were designed for serial machines and not for massively parallel architectures like the GPU. For instance, each iteration of binary search depends on the outcome of a single load and compare, and response time becomes simply a matter of memory latency (Table 1.1) multiplied by the number of iterations required to find the search key, log2(n). As the GPU has higher memory latency than the CPU, we cannot expect any improvements in response time.
Adopting the CPU threading model, simply running multiple searches concurrently — mapping one search query to one thread—makes matters worse, for two reasons. First, given the GPU's SIMD architecture, all threads within a thread block have to "wait" until all of them complete their searches. With increasing block size the likelihood for one of the threads to require worst-case runtime increases. Second, while all threads start with the same pivot element(s), they quickly diverge as each thread is assigned a different search key. The resulting memory access patterns are not amenable to caching or coalescing. Moreover, the large amount of small memory accesses is likely to lead to contention, thus introducing additional latency. On the other hand, achieving high memory bandwidth, and thus high application throughput, requires a high level of concurrency, which translates to large workloads. Using conventional algorithms, one has to make a choice of whether to optimize for throughput or for response time, as they appear to be conflicting goals.
(Continues...)
Excerpted from GPU Computing Gems Jade Edition by Wen-mei W. Hwu Copyright © 2012 by Elsevier, Inc.. Excerpted by permission of MORGAN KAUFMANN PUBLISHERS. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.