Uh-oh, it looks like your Internet Explorer is out of date.
For a better shopping experience, please upgrade now.
Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition / Edition 2 available in Hardcover
Temporarily Out of Stock Online
In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential readingan authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web.
- Up-to-date coverage of new text compression algorithms such as block sorting, approximate arithmetic coding, and fat Huffman coding
- New sections on content-based index compression and distributed querying, with 2 new data structures for fast indexing
- New coverage of image coding, including descriptions of de facto standards in use on the Web (GIF and PNG), information on CALIC, the new proposed JPEG Lossless standard, and JBIG2
- New information on the Internet and WWW, digital libraries, web search engines, and agent-based retrieval
- Accompanied by a public domain system called MG which is a fully worked-out operational example of the advanced techniques developed and explained in the book
- New appendix on an existing digital library system that uses the MG software
|Series:||Morgan Kaufmann Series in Multimedia Information and Systems Series|
|Product dimensions:||7.67(w) x 9.54(h) x 1.34(d)|
Read an Excerpt
Chapter 1: OverviewDocument databases
A library is just one form of document database-a large collection of books, magazines, and newspapers, of which, at any given time, a particular user is interested in only a tiny fraction. As a very rough estimate, we might suppose that one printed page contains about 400 words, or, including formatting and punctuation, about 2,500 characters; then a 400-page book contains about one million characters. For example, the present book contains over 200,000 words, nearly 1,400,000 characters-excluding pictures. Continuing the calculation, if we assume that a 400-page book is 2 centimeters thick, then a library stores information at the rate of 50 million characters per linear meter. A book stack has two sides and might be five shelves high and 5 meters long, so it stores perhaps two and a half billion characters, or, in computer terms, 2.5 gigabytes. Even a small library has 10 or more stacks; a large one might have hundreds. In total, then, we might expect even a relatively small document collection to contain several billion characters.
Document databases are so large, and so common, that it is well worthwhile to consider how they might be stored as efficiently as possible-that is what this book is about. It is possible to reduce significantly the amount of space required to store text on computers using compression techniques. These methods change the representation of a document so that it can be stored in less space, yet recovered quickly in its original form.
it is important to store documents efficiently in terms of storage space, but it is equally important that they can be located and retrieved efficiently-hence ourinterest in concordances. A major theme of the book is the combination of compression techniques with indexing techniques, which together address the two main problems in document retrieval: the space required to store large quantities of text, and the time needed to search it. Much has been written about solutions to each of the problems in isolation, but there are obstacles to combining compression and indexing that have, in the past, prevented their being used together. However, elegant ways to circumvent the obstacles have been devised, and it is these that are presented in this book. The most remarkable result is that it is possible, given a particular text, to produce a compressed and indexed version that is usually less than half the size of the original and yet can be searched extremely rapidly for any given combination of terms. We measure a computer system's speed in terms of the number of accesses made to disk, because this access time-which is measured in fractions of a second-dominates the total time that is involved in a search. Searches require just a few disk accesses, and so the technique is clearly very satisfactory for all but the most impatient interactive user.
A document database system should be able to store more than just text. Images-usually in the form of diagrams or photographs-are an important part of many documents. The above rough estimate of the amount of storage needed for a document database conveniently ignores the cost of storing images. it is much more difficult to estimate the amount of space required for pictures than it is for text, but it is likely to be considerable. For example, the 175-odd pictures in the present book (they occur mainly in Chapters 6, 7, and 8, although figures like the ones in this chapter are also stored as pictures) total almost 40 Mbytes on the computer, which is about 40 times the size of the text in the book. However, to be fair, they are-for technical convenience-stored in an unnecessarily redundant representation, and perhaps these numbers should be reduced by a factor of around five to give a more realistic feeling for the magnitude of the space occupied by the pictures. Even so, they occupy several times as much space as the text.
Sometimes a document that is predominantly text must be stored as an image because it may be necessary to reproduce it later in its original form for legal or historical reasons. Storing it as plain text generally loses a host of information, from spacing and typeface details to illegible or nontextual marks. For example, a document database might include credit card slips, and an accurate facsimile of the slip might be needed for legal purposes, though a textual version could be more useful for the purpose of routine consultation. Moreover, any document database system must provide a way to cope with the vast amount of text that is already on paper, and by far the simplest way of doing this is to scan existing documents into the system, using an optical scanning device, and treat them as a succession of images.
For these reasons, it is important to consider how images can be compressed and indexed alongside the text. A particularly important kind of image in document databases is one made up primarily of text, and we call this a textual image. Examples include fax documents and archives that have been digitally scanned for longterm storage. Special techniques are available to store textual images effectively; these are discussed in later chapters.
In this book we take a fairly liberal view of what is meant by a "document database." We have already expanded the term to include images. Document databases are now beginning to include other material, such as sound and video recordings. Although we do not specifically look at techniques for incorporating these kinds of data, what is involved is really just an extension of the ideas used for incorporating still images into a document database, coupled with appropriate, tailor-made compression methods.
The book focuses on full-text retrieval techniques, as opposed to conventional databases in which there are only a small number of preselected keys (such as "account number" or the mysterious codes like "J6NHYQ" used by airlines to identify reservation records) that can be used to access the stored data. In a full-text system, every part of each record is indexed, so any part may be used as the basis for a query. Imagine trying to extract, from an airline database, a list of all the people on your street who have departures on the same day as yours, so that you can try to hitch a ride to the airport. Airline databases are not likely to be indexed in a way that supports this kind of query, yet this is exactly the type of query of which a full-text system is capable. An example of this sort of query is "find all the documents about meetings between United States presidents and New Zealand and Australian prime ministers in which defense treaties are discussed."
Although full-text retrieval systems are a kind of very large database, the latter term is generally used to refer specifically to very large conventional databases, which form a major area of study in themselves. Full-text retrieval and database systems are also part of the larger field known as information retrieval, which can be defined loosely as the study of methods and structures used to represent and access information. Again, we do not intend to deal in full with this larger field but have drawn in the appropriate material.
Many of the ideas in this book have been incorporated into a public domain full-text retrieval system called mg, a suite of programs that can index, compress, and search large quantities of documents, including both text and images. The mg system uses some of the better techniques now available and is intended to give practitioners an idea of the kind of performance that can be achieved. Information about obtaining and using the mg system is provided in Appendix A.
In the remainder of this chapter we introduce some of the main issues that the book deals with-namely, compression, indexes, images and textual images, and the mg system. Each of these is examined in more detail in the chapters that follow. Inspired by the biblical origins of indexing, the problems addressed in this book are lightheartedly expressed in the allegory of Figure 1.4: this is the order in which we develop our theme...
Table of Contents
2. TEXT COMPRESSION
5. INDEX CONSTRUCTION
6. IMAGE COMPRESSION
7. TEXTUAL IMAGES
8. MIXED TEXT AND IMAGES
10. THE INFORMATION EXPLOSION
A. GUIDE TO THE MG SYSTEM
B. GUIDE TO THE NZDL
What People are Saying About This
The coverage of compression, file organizations, and indexing techniques for full text and document management systems is unsurpassed. Students, researchers, and practitioners will all benefit from reading this book.
The new edition of Witten, Moffat, and Bell not only has newer and better text search algorithms but much material on image analysis and joint image/text processing. If you care about search engines, you need this book: it is the only one with full details of how they work. The book is both detailed and enjoyable; the authors have combined elegant writing with top-grade programming.
This book is the Bible for anyone who needs to manage large data collections. It's required reading for our search gurus at Infoseek. The authors have done an outstanding job of incorporating and describing the most significant new research in information retrieval over the past five years into this second edition.