High Performance VLSI Signal Processing: Innovative Algorithms and Architectures / Edition 1 available in Hardcover
- Pub. Date:
Electrical Engineering/Signal Processing HighPerformance VLSI Signal Processing Innovative Architectures and Algorithms Volume 1 Algorithms and Architectures The first volume in a two-volume set, High-Performance VLSI Signal Processing: Innovative Architectures and Algorithms brings together the most innovative papers in the field, focused introductory material, and extensive references. The editors present timely coverage of algorithm and design methodologies with an emphasis on today's rapidly-evolving high-speed architectures for VLSI implementations. These volumes will serve as vital resources for engineers who want a comprehensive knowledge of the extremely interdisciplinary field of high-performance VLSI processing. The editors provide a practical understanding of the merits of total system design through an insightful, synergistic presentation of methodology, architecture, and infrastructure. Each volume features:
• Major papers that span the wide range of research areas in the field
• Chapter introductions, including historical perspectives
• Numerous applications-oriented design examples
• Coverage of current and future technological trends
• Thorough treatment of high-speed architectures
About the Author
About the Editors...
K. J. Ray Liu is an associate professor in the Electrical Engineering Department and Institute for Systems Research at the University of Maryland in College Park. Dr. Liu s research interests span all aspects of signal processing with application to image/video, wireless communications, networking, and medical biomedical technology. He has published more than one hundred papers, many of which are in archival journals, books, and book chapters. He has won many awards, including the IEEE Signal Processing Society s Senior Award for Best Paper in VLSI in 1993 and the National Science Foundation Young Investigator Award in 1994.
Kung Yao is a professor in the Electrical Engineering Department at the University of California in Los Angeles. His research and professional interests include systolic and VLSI algorithms, architectures and systems, digital and acoustic signal processing, microphone and sensor array, spectral analysis, and stochastic processes. Dr. Yao is a member of the Design and Implementation of Signal Processing Systems Technical Committee of the IEEE Signal Processing Society, and the Editorial Board of the Journal of VLSI Signal Processing. He received the IEEE Signal Processing Society s Senior Award for Best Paper in VLSI in 1993. Dr. Yao is a Fellow of IEEE.
Read an Excerpt
High-Performance VLSI Signal Processing Innovative Architectures and Algorithms, Volume 1, Algorithms and Architectures
John Wiley & SonsISBN: 0-7803-3468-X
Chapter OneArray Processors and Mapping
High Performance Signal Processing Via Systolic Array Architectures and Algorithms
In the last fifty years, the development of array processors for high performance signal processing has been closely interweaved with the development of high-performance digital computers. The earliest and most well known computer design was based on von Neumann architecture. It has a single processor executing single sequence of instructions (SI) on a single stream of data (SD) and is called a SISD machine.
This conceptual simplicity and associated technology led to the immense success of the digital computer revolution. In order to achieve higher computational throughput speed, the utilization of newer device technology with a faster switching speed has been an obvious approach. Often, this approach based only on device-technology improvement has not been sufficient. However, various alternative basic computer processing concepts and architectures have also been proposed.
In one approach, various forms of concurrent operations are used. Concurrency denotes the ability of a computational system to perform more than one operation at a given time and can be achieved through either parallelism or pipelining, or both. Parallelism utilizes concurrency by replicating some processing elements (PE). High throughput rate is achieved by having simultaneous operations of these functions on different parts of the program. Many parallel-processing, vector-data processor arrays operate in the single instruction (SI) and multiple data (MD) SIMD mode.
On the other hand, pipelining addresses concurrency by breaking some demanding part of the task into many smaller, simpler pieces, using many corresponding PEs, so that the processing can be performed consecutively. This digital pipe is arranged so that it is capable of processing the instructions and data independent of the number of PEs in the pipe. The high throughput rate in the pipe can be achieved by having fast PEs. A computer utilizing the pipelining feature is considered as a MISD machine since different parts of the single data (SD) stream are being processed simultaneously by multiple instructions (MI). Finally, multiple-instruction and multiple-data stream computers have many PEs with independent computational capabilities.
For many years, these general "parallel" processing computer architectures have been used only on specialized mainframe and minicomputer array processing computers for some very computational intensive off-line scientific, economic, meteorologic, and military applications. These "parallel" processing computers were almost always extremely expensive and were generally not available for dedicated real-time signal processing applications. Only with the advent of VLSI semiconductor fabrication technology have these parallel processing computer architectures been significant to high-performance modern signal processing. One of the most basic and important links between parallel computation/architecture and high throughput modern signal processing is the concept of systolic arrays.
The systolic array concept and term were coined by H. T. Kung and C. E. Leiserson to denote a simple class of concurrent processor, in which processed data move in a regular and periodic manner similar to the systolic pumping action of blood by the heart. The most basic idea of a systolic array is that once data are available, they are used effectively inside many PEs in the array to yield a higher throughput rate. Thus, a systolic array may exploit both the parallelism and pipelining capabilities of various algorithms. In the motivational and tutorial reprinted paper by H. T. Kung, a systolic array is required to satisfy the properties when there are only a few types of PEs in the array, with each type performing the same specific operation. All the operations are performed in a synchronous manner independent of the processed data. The only control data broadcast to the PEs are the synchronous clock signals, and the PEs communicate only with their nearest neighbors. These regular structure and local communication properties of a systolic array are consistent with efficient modern VLSI designs. A series of systolic designs for correlation/convolution was obtained intuitively. These examples demonstrate the basic property that a given signal processing algorithm can have many different systolic implementations with different hardware requirements and implications.
Various extensions of the earlier systolic array assumptions have since been made. These arrays may have the properties of wavefront array processing which allow PEs to start/end/control their own processing tasks dependent on the data. Some of the PEs can perform a limited number of different functions depending on the presence of some control data. Some PEs can communicate with a few nearby neighbors while the PEs at the edge of the array are allowed to have wrap-around communications. The reprinted paper by S. Y. Kung introduces the concept of asynchronous communication among PEs in wavefront array processing. While the original systolic array concept was applied only at the cellular PE level, the reprinted paper by H. T. Kung and Lam considers systolic array methodology to the PEs resulting in a two-level pipelined systolic array with a more efficient design.
The earlier work on mapping a signal processing algorithm onto a systolic array was performed intuitively. In 1982, a systematic approach based on space-time transformation was proposed for the systolic design of convolution. Since then, various formal systematic procedures for systolic designs of many classes of algorithms, often called dependence graph mapping techniques, have been proposed. One of the earliest works in this direction is the reprinted paper by Moldovan which considers the transformation of algorithms with loops into parallel forms suitable for systolic implementations. Examples with Fortran loop structure and LU decompositions are given. Another early systematic systolic design approach of Quinton introduces the concept of shift-invariance of the dependence graph, modeling the index variables of the algorithm. Uniform Recurrence Equations (URE) require only one computational element, while the Generalized URE (GURE) can use multiple computational elements. The reprinted paper by Quinton and Van Dongen extends Quinton's earlier techniques by using general systems of linear recurrences to define the class of algorithms suitable for systolic implementation. The reprinted paper by Rao and Kailath first critically reviews previously known techniques of URE and GURE and then introduces a general class of algorithms for systematic systolic designs, denoted as regular iterative algorithms (RIA). The concepts of PE scheduling and allocating in the systolic design were defined precisely, and the linear programming procedure for their computations was proposed. Variations and generalizations of these approaches have appeared in other works
Another recent systematic approach for efficient systolic design is to formally interpret the dependence graph of an algorithm as a lattice in a multidimensional integral space. The whole procedure of mapping an algorithm onto a systolic-type processor consists of two different but interdependent operations using space and time transformations. The former projects the dependence graph onto a lower dimensional structure which can then be mapped onto the physical array, while the latter yields the execution of each computation in the array. Recent Ph.D. thesis work on general systolic mapping techniques include.
As stated earlier, the systolic concept normally applied at the PE word level of a processor array can also be applied at the lower bit level. Various generalizations of this bit-level design have been demonstrated for correlation, convolution, Winograd matrix-vector multiplication, and IIR. In the reprinted paper by McCanny, McWhirter, and S. Y. Kung, the dependence graph technique based on the cut-set method was proposed to perform general bit-level systolic array designs. On the other hand, it is well known that many modern signal processing tasks are matrix-computational problems. These complex higher-level problems are also well suited for parallel/systolic processing. The reprinted paper by Moreno and Lang considers matrix computations on application-specific systolic-type array architectures. They proposed a multimesh graph method for matching the fine granularity of the architecture to that of the matrix algorithm for an efficient design.
In recent years, there have been many proposed schemes to perform systematic and efficient systolic array designs. One basic approach uses partitioning to split a large dimensional systolic array problem so that it can be implemented on a physically smaller fixed array, as proposed in . Other partitioning papers include. On the other hand, in many practical complex DSP problems, such as in systolic Kalman filtering and systolic eigen/singular value decompositions, the computational algorithms have several distinct stages which need to be processed sequentially. Dedicated systolic array hardware for implementing each stage yields a highly inefficient system. In the reprinted paper by Hwang and Hu, a multistage systolic-design technique has been proposed to map different phases of the algorithm expressed in several nested loops onto a single systolic array hardware. The critical issue of interfacing data flow and distribution between stages is addressed.
As the systolic design field becomes more mature, it is natural to consider optimized systolic designs. The reprinted paper by Wong and Delosme considers the derivation of time-optimal linear schedules of systolic designs for RIA algorithms. The time performance is measured as the product of the number of systolic cycles needed to perform some computation. The maximum time duration of a cycle as well as other optimizations are based on integer linear programming techniques. Many other generalizations and optimization criteria for systolic designs have been proposed in [10, 17, 30, 33, 37, 38, 70, 71, 81, 84].
From the beginning of systolic designs, there is continual interest in constructing CAD tools for automated VLSI synthesis of the processing array architecture. The goal is to start with some high level DSP algorithm specification (such as in C, Matlab, and others) and end up with a reasonably efficient systolic processor design, taking into consideration user-imposed constraints on throughput, chip area, I/O, power consumption, and so on. While much interesting work has been done on this problem, the design of a practical CAD program for the automated synthesis of systolic arrays at the multiprocessor level so far remains unsolved. Some recent work at the automated design of systolic array at the chip level include [45, 77].
In the last fifteen years, hundreds of technical papers on array processor architectures and mapping techniques have appeared. A detailed treatment of these topics is given in the book by S. Y. Kung in 1988. Many other books and edited chapters and tutorial articles dealing with these topics have appeared. New research results have appeared regularly in the proceedings of the VLSI Signal Processing Workshop and Applications Specific Array Processors.
Systolic architectures, which permit multiple computations for each memory access, can speed execution of compute-bound problems without increasing I/O requirements.
Why Systolic Architectures? H. T. Kung Carnegie-Mellon University
High-performance, special-purpose computer systems are typically used to meet specific application requirements or to off-load computations that are especially taxing to general-purpose computers. As hardware cost and size continue to drop and processing requirements become well-understood in areas such as signal and image processing, more special-purpose systems are being constructed. However, since most of these systems are built on an ad hoc basis for specific tasks, methodological work in this area is rare. Because the knowledge gained from individual experiences is neither accumulated nor properly organized, the same errors are repeated. I/O and computation imbalance is a notable example-often, the fact that I/O interfaces cannot keep up with device speed is discovered only after constructing a high-speed, special-purpose device.
We intend to help correct this ad hoc approach by providing a general guideline-specifically, the concept of systolic architecture, a general methodology for mapping high-level computations into hardware structures. In a systolic system, data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory, much as blood circulates to and from the heart. The system works like an automobile assembly line where different people work on the same car at different times and many cars are assembled simultaneously. An assembly line is always linear, however, and systolic systems are sometimes two-dimensional. They can be rectangular, triangular, or hexagonal to make use of higher degrees of parallelism. Moreover, to implement a variety of computations, data flow in a systolic system may be at multiple speeds in multiple directions-both inputs and (partial) results flow, whereas only results flow in classical pipelined systems. Generally speaking, a systolic system is easy to implement because of its regularity and easy to reconfigure (to meet various outside constraints) because of its modularity.
The systolic architectural concept was developed at Carnegie-Mellon University, and versions of systolic processors are being designed and built by several industrial and governmental organizations. This article reviews the basic principle of systolic architectures and explains why they should result in cost-effective, high-performance special-purpose systems for a wide range of problems.
Key architectural issues in designing special-purpose systems
Roughly, the cycle for developing a special-purpose system can be divided into three phases-task definition, design, and implementation. During task definition, some system performance bottleneck is identified, and a decision on whether or not to resolve it with special-purpose hardware is made. The evaluation required for task definition is most fundamental, but since it is often application-dependent, we will concentrate only on architectural issues related to the design phase and will assume routine implementation.
Simple and regular design. Cost-effectiveness has always been a chief concern in designing special-purpose systems; their cost must be low enough to justify their limited applicability. Costs can be classified as nonrecurring (design) and recurring (parts) costs. Part costs are dropping rapidly due to advances in integrated-circuit technology, but this advantage applies equally to both special-purpose and general-purpose systems. Furthermore, since special-purpose systems are seldom produced in large quantities, part costs are less important than design costs. Hence, the design cost of a special-purpose system must be relatively small for it to be more attractive than a general-purpose approach.
Fortunately, special-purpose design costs can be reduced by the use of appropriate architectures. If a structure can truly be decomposed into a few types of simple substructures or building blocks, which are used repetitively with simple interfaces, great savings can be achieved. This is especially true for VLSI designs where a single chip comprises hundreds of thousands of components. To cope with that complexity, simple and regular designs, similar to some of the techniques used in constructing large software systems, are essential. In addition, special-purpose systems based on simple, regular designs are likely to be modular and therefore adjustable to various performance goals-that is, system cost can be made proportional to the performance required. This suggests that meeting the architectural challenge for simple, regular designs yields cost-effective special-purpose systems.
Excerpted from High-Performance VLSI Signal Processing Innovative Architectures and Algorithms, Volume 1, Algorithms and Architectures Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
Array Processors and Mapping.
Transformations and Design Techniques.
System Architecture and Implementation.
Fault Tolerance in VLSI Signal Processing.