Search Engines: Information Retrieval in Practice

Hardcover (Print)
Rent
Rent from BN.com
$53.52
(Save 60%)
Est. Return Date: 11/16/2014
Buy New
Buy New from BN.com
$117.43
Used and New from Other Sellers
Used and New from Other Sellers
from $123.22
Usually ships in 1-2 business days
(Save 7%)
Other sellers (Hardcover)
  • All (9) from $123.22   
  • New (7) from $123.22   
  • Used (2) from $149.94   

Overview

Search Engines: Information Retrieval in Practice is ideal for introductory information retrieval courses at the undergraduate and graduate level in computer science, information science and computer engineering departments. It is also a valuable tool for search engine and information retrieval professionals.

Written by a leader in the field of information retrieval, Search Engines: Information Retrieval in Practice , is designed to give undergraduate students the understanding and tools they need to evaluate, compare and modify search engines. Coverage of the underlying IR and mathematical models reinforce key concepts. The book’s numerous programming exercises make extensive use of Galago, a Java-based open source search engine.

Read More Show Less

Product Details

  • ISBN-13: 9780136072249
  • Publisher: Addison-Wesley
  • Publication date: 2/20/2009
  • Series: Alternative eText Formats Series
  • Edition description: New Edition
  • Pages: 552
  • Sales rank: 653,082
  • Product dimensions: 7.60 (w) x 9.20 (h) x 1.10 (d)

Meet the Author

W. Bruce Croft is a Distinguished Professor in the Department of Computer Science at the University of Massachusetts, Amherst, which he joined in 1979. In 1992, he became the Director of the Center for Intelligent Information Retrieval (CIIR), which combines basic research with technology transfer to a variety of government and industry partners. He has published more than 180 articles related to information retrieval. Dr. Croft was elected a Fellow of ACM in 1997, received the Research Award from the American Society for Information Science and Technology in 2000, and received the Gerard Salton Award from the ACM Special Interest Group in Information Retrieval (SIGIR) in 2003.

Donald Metzler is a Research Scientist in the Search and Computational Advertising group at Yahoo! Research in Santa Clara, CA. He obtained his Ph.D. from the University of Massachusetts in 2007. During his graduate studies he was awarded a Microsoft Live Labs Graduate Fellowship. His research interests include formal information retrieval models, web search, advertising, and machine learning.

Trevor Strohman is a software engineer in the Google search quality division. His Ph.D., from the University of Massachusetts Amherst, focused on high-performance text retrieval systems that are easily adaptable to fit specific retrieval applications. He has published papers and presented a tutorial at the top information retrieval conference, SIGIR. He is the creator of the Galago search engine, and the primary developer of the Indri search engine (www.lemurproject.org/indri). He has ten years of professional software development experience, including desktop, server, and web applications.

Read More Show Less

Table of Contents

1 Search Engines and Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 What is Information Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Search Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Architecture of a Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1 What is an Architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Breaking It Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Text Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Text Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.3 Index Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.4 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.5 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 How Does It Really Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Crawls and Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1 Deciding what to search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Crawling the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Directory Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Document Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 The Conversion Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 Storing the Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.7 Detecting Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8 Removing Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Processing Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 From Words to Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Vocabulary Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.2 Estimating Database and Result Set Sizes . . . . . . . . . . . . . . . 57

4.3 Document Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.2 Tokenizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.3 Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.4 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.5 Phrases and N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Document Structure and Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.5.1 Anchor Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.3 Link Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.6 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.7 Internationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 Ranking with Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Abstract Model of Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3 Inverted indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3.1 Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.3.2 Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3.3 Positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.3.4 Fields and Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.5 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3.6 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.4.1 Entropy and Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.4.2 Delta Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.4.3 Bit-aligned codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4.4 Byte-aligned codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.4.5 Looking ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.4.6 Skipping and Skip Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.5 Auxiliary Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.6 Index Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.6.1 Simple Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.6.2 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.6.3 Parallelism and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.6.4 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.7 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.7.1 Document-at-a-time evaluation . . . . . . . . . . . . . . . . . . . . . . . 138

5.7.2 Term-at-a-time evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.7.3 Optimization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.7.4 Structured queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.7.5 Distributed evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.7.6 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6 Queries and Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.1 Information Needs and Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.2 Query Transformation and Refinement . . . . . . . . . . . . . . . . . . . . . . . 162

6.2.1 Stopping and Stemming Revisited . . . . . . . . . . . . . . . . . . . . . 162

6.2.2 Spell Checking and Suggestions . . . . . . . . . . . . . . . . . . . . . . . 165

6.2.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

6.2.4 Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

6.2.5 Context and Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . 183

6.3 Showing the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6.3.1 Result Pages and Snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6.3.2 Advertising and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

6.3.3 Clustering the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6.4 Cross-Language Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

7 Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

7.1 Overview of Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

7.1.1 Boolean Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7.1.2 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

7.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

7.2.1 Information Retrieval as Classification . . . . . . . . . . . . . . . . . 216

7.2.2 The BM25 Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 221

7.3 Ranking based on Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 224

7.3.1 Query Likelihood Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

7.3.2 Relevance Models and Pseudo-Relevance Feedback . . . . . . 232

7.4 Complex Queries and Combining Evidence . . . . . . . . . . . . . . . . . . . 238

7.4.1 The Inference Network Model . . . . . . . . . . . . . . . . . . . . . . . . 239

7.4.2 The Galago Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . 245

7.5 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

7.6 Machine Learning and Information Retrieval . . . . . . . . . . . . . . . . . . 255

7.6.1 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

7.6.2 Topic Models and Vocabulary Mismatch . . . . . . . . . . . . . . . . 259

7.7 Application-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

8 Evaluating Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

8.1 Why Evaluate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

8.2 The Evaluation Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

8.3 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

8.4 Effectiveness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

8.4.1 Recall and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

8.4.2 Averaging and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 285

8.4.3 Focusing On The Top Documents . . . . . . . . . . . . . . . . . . . . . 290

8.4.4 Using Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

8.5 Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

8.6 Training, Testing, and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

8.6.1 Significance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

8.6.2 Setting Parameter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

8.7 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

9 Classification and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

9.1 Classification and Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

9.1.1 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

9.1.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

9.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

9.1.4 Classifier and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 329

9.1.5 Spam, Sentiment, and Online Advertising . . . . . . . . . . . . . . 333

9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

9.2.1 Hierarchical and K-Means Clustering . . . . . . . . . . . . . . . . . . 344

9.2.2 K Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . 354

9.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

9.2.4 How to Choose K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

9.2.5 Clustering and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

10 Social Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

10.1 What is Social Search? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

10.2 User Tags and Manual Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

10.3 Searching With Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

10.4 Filtering and Recommending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

10.4.1 Document Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

10.4.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

10.5 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

10.6 Peer-to-Peer and Metasearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

10.6.1 Distributed search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

10.6.2 P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

11 Beyond Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

11.2 Feature-Based Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

11.3 Term Dependence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394

11.4 Structure Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

11.4.1 XML Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

11.5 Longer Questions, Better Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

11.6 Words, Pictures, and Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408

11.7 One Search Fits All? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation

Reminder:

  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

 
Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)