Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition / Edition 2

Hardcover (Print)
Buy New
Buy New from BN.com
$89.29
Used and New from Other Sellers
Used and New from Other Sellers
from $48.99
Usually ships in 1-2 business days
(Save 55%)
Other sellers (Hardcover)
  • All (5) from $48.99   
  • New (4) from $81.83   
  • Used (1) from $48.99   

Overview

"This book is the Bible for anyone who needs to manage large data collections. It's required reading for our search gurus at Infoseek. The authors have done an outstanding job of incorporating and describing the most significant new research in information retrieval over the past five years into this second edition."
Steve Kirsch, Cofounder, Infoseek Corporation

"The new edition of Witten, Moffat, and Bell not only has newer and better text search algorithms but much material on image analysis and joint image/text processing. If you care about search engines, you need this book: it is the only one with full details of how they work. The book is both detailed and enjoyable; the authors have combined elegant writing with top-grade programming."
Michael Lesk, National Science Foundation

"The coverage of compression, file organizations, and indexing techniques for full text and document management systems is unsurpassed. Students, researchers, and practitioners will all benefit from reading this book."
Bruce Croft, Director, Center for Intelligent Information Retrieval at the University of Massachusetts

In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading—an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web.

"...describes the state-of-the-art techniques for efficiently compressing, storing, and retrieving large documents comprising gigabytes of data...with new coverage of the Internet, WWW, digital libraries, and more."

Read More Show Less

Editorial Reviews

From the Publisher
"This book is the Bible for anyone who needs to manage large data collections. It's required reading for our search gurus at Infoseek. The authors have done an outstanding job of incorporating and describing the most significant new research in information retrieval over the past five years into this second edition."
—Steve Kirsch, Cofounder, Infoseek Corporation

"The new edition of Witten, Moffat, and Bell not only has newer and better text search algorithms but much material on image analysis and joint image/text processing. If you care about search engines, you need this book: it is the only one with full details of how they work. The book is both detailed and enjoyable; the authors have combined elegant writing with top-grade programming."
—Michael Lesk, National Science Foundation

"The coverage of compression, file organizations, and indexing techniques for full text and document management systems is unsurpassed. Students, researchers, and practitioners will all benefit from reading this book."
—Bruce Croft, Director, Center for Intelligent Information Retrieval at the University of Massachusetts

Read More Show Less

Product Details

Meet the Author

Ian H. Witten is a professor of computer science at the University of Waikato in New Zealand. He directs the New Zealand Digital Library research project. His research interests include information retrieval, machine learning, text compression, and programming by demonstration. He received an MA in Mathematics from Cambridge University, England; an MSc in Computer Science from the University of Calgary, Canada; and a PhD in Electrical Engineering from Essex University, England. He is a fellow of the ACM and of the Royal Society of New Zealand. He has published widely on digital libraries, machine learning, text compression, hypertext, speech synthesis and signal processing, and computer typography. He has written several books, the latest being Managing Gigabytes (1999) and Data Mining (2000), both from Morgan Kaufmann.

Read More Show Less

Read an Excerpt

Chapter 1: Overview

Document databases

A library is just one form of document database-a large collection of books, magazines, and newspapers, of which, at any given time, a particular user is interested in only a tiny fraction. As a very rough estimate, we might suppose that one printed page contains about 400 words, or, including formatting and punctuation, about 2,500 characters; then a 400-page book contains about one million characters. For example, the present book contains over 200,000 words, nearly 1,400,000 characters-excluding pictures. Continuing the calculation, if we assume that a 400-page book is 2 centimeters thick, then a library stores information at the rate of 50 million characters per linear meter. A book stack has two sides and might be five shelves high and 5 meters long, so it stores perhaps two and a half billion characters, or, in computer terms, 2.5 gigabytes. Even a small library has 10 or more stacks; a large one might have hundreds. In total, then, we might expect even a relatively small document collection to contain several billion characters.

Document databases are so large, and so common, that it is well worthwhile to consider how they might be stored as efficiently as possible-that is what this book is about. It is possible to reduce significantly the amount of space required to store text on computers using compression techniques. These methods change the representation of a document so that it can be stored in less space, yet recovered quickly in its original form.

it is important to store documents efficiently in terms of storage space, but it is equally important that they can be located and retrieved efficiently-hence ourinterest in concordances. A major theme of the book is the combination of compression techniques with indexing techniques, which together address the two main problems in document retrieval: the space required to store large quantities of text, and the time needed to search it. Much has been written about solutions to each of the problems in isolation, but there are obstacles to combining compression and indexing that have, in the past, prevented their being used together. However, elegant ways to circumvent the obstacles have been devised, and it is these that are presented in this book. The most remarkable result is that it is possible, given a particular text, to produce a compressed and indexed version that is usually less than half the size of the original and yet can be searched extremely rapidly for any given combination of terms. We measure a computer system's speed in terms of the number of accesses made to disk, because this access time-which is measured in fractions of a second-dominates the total time that is involved in a search. Searches require just a few disk accesses, and so the technique is clearly very satisfactory for all but the most impatient interactive user.

A document database system should be able to store more than just text. Images-usually in the form of diagrams or photographs-are an important part of many documents. The above rough estimate of the amount of storage needed for a document database conveniently ignores the cost of storing images. it is much more difficult to estimate the amount of space required for pictures than it is for text, but it is likely to be considerable. For example, the 175-odd pictures in the present book (they occur mainly in Chapters 6, 7, and 8, although figures like the ones in this chapter are also stored as pictures) total almost 40 Mbytes on the computer, which is about 40 times the size of the text in the book. However, to be fair, they are-for technical convenience-stored in an unnecessarily redundant representation, and perhaps these numbers should be reduced by a factor of around five to give a more realistic feeling for the magnitude of the space occupied by the pictures. Even so, they occupy several times as much space as the text.

Sometimes a document that is predominantly text must be stored as an image because it may be necessary to reproduce it later in its original form for legal or historical reasons. Storing it as plain text generally loses a host of information, from spacing and typeface details to illegible or nontextual marks. For example, a document database might include credit card slips, and an accurate facsimile of the slip might be needed for legal purposes, though a textual version could be more useful for the purpose of routine consultation. Moreover, any document database system must provide a way to cope with the vast amount of text that is already on paper, and by far the simplest way of doing this is to scan existing documents into the system, using an optical scanning device, and treat them as a succession of images.

For these reasons, it is important to consider how images can be compressed and indexed alongside the text. A particularly important kind of image in document databases is one made up primarily of text, and we call this a textual image. Examples include fax documents and archives that have been digitally scanned for longterm storage. Special techniques are available to store textual images effectively; these are discussed in later chapters.

In this book we take a fairly liberal view of what is meant by a "document database." We have already expanded the term to include images. Document databases are now beginning to include other material, such as sound and video recordings. Although we do not specifically look at techniques for incorporating these kinds of data, what is involved is really just an extension of the ideas used for incorporating still images into a document database, coupled with appropriate, tailor-made compression methods.

The book focuses on full-text retrieval techniques, as opposed to conventional databases in which there are only a small number of preselected keys (such as "account number" or the mysterious codes like "J6NHYQ" used by airlines to identify reservation records) that can be used to access the stored data. In a full-text system, every part of each record is indexed, so any part may be used as the basis for a query. Imagine trying to extract, from an airline database, a list of all the people on your street who have departures on the same day as yours, so that you can try to hitch a ride to the airport. Airline databases are not likely to be indexed in a way that supports this kind of query, yet this is exactly the type of query of which a full-text system is capable. An example of this sort of query is "find all the documents about meetings between United States presidents and New Zealand and Australian prime ministers in which defense treaties are discussed."

Although full-text retrieval systems are a kind of very large database, the latter term is generally used to refer specifically to very large conventional databases, which form a major area of study in themselves. Full-text retrieval and database systems are also part of the larger field known as information retrieval, which can be defined loosely as the study of methods and structures used to represent and access information. Again, we do not intend to deal in full with this larger field but have drawn in the appropriate material.

Many of the ideas in this book have been incorporated into a public domain full-text retrieval system called mg, a suite of programs that can index, compress, and search large quantities of documents, including both text and images. The mg system uses some of the better techniques now available and is intended to give practitioners an idea of the kind of performance that can be achieved. Information about obtaining and using the mg system is provided in Appendix A.

In the remainder of this chapter we introduce some of the main issues that the book deals with-namely, compression, indexes, images and textual images, and the mg system. Each of these is examined in more detail in the chapters that follow. Inspired by the biblical origins of indexing, the problems addressed in this book are lightheartedly expressed in the allegory of Figure 1.4: this is the order in which we develop our theme...

Read More Show Less

Table of Contents

PREFACE
1. OVERVIEW
2. TEXT COMPRESSION
3. INDEXING
4. QUERYING
5. INDEX CONSTRUCTION
6. IMAGE COMPRESSION
7. TEXTUAL IMAGES
8. MIXED TEXT AND IMAGES
9. IMPLEMENTATION
10. THE INFORMATION EXPLOSION
A. GUIDE TO THE MG SYSTEM
B. GUIDE TO THE NZDL
REFERENCES
INDEX

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation

Reminder:

  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

 
Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)