Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage / Edition 1

Hardcover (Print)
Buy New
Buy New from
Used and New from Other Sellers
Used and New from Other Sellers
from $46.93
Usually ships in 1-2 business days
(Save 53%)
Other sellers (Hardcover)
  • All (11) from $46.93   
  • New (4) from $86.42   
  • Used (7) from $46.90   


This text demonstrates how to extract knowledge by finding meaningful connections among data spread throughout the Web. Readers learn methods and algorithms from the fields of information retrieval, machine learning, and data mining which, when combined, provide a solid framework for mining the Web. The authors walk readers through the algorithms with the aid of examples and exercises. This text is divided into three parts: Part One, Web Structure, presents basic concepts and techniques for extracting information from the Web. Readers learn how to collect and index Web documents as well as search and rank Web pages according to their textual content and hyperlink structure. Part Two, Web Content Management, offers two approaches, clustering and classification, for organizing Web content. For both approaches, the authors set forth specific algorithms that enable readers to convert Web data into knowledge. Part Three, Web Usage Mining, demonstrates the application of data mining methods to uncover meaningful patterns of Internet usage.

Methods and algorithms are illustrated by simple examples. More than 100 exercises help readers assess their grasp of the material. Further, thirty-tour hands-on analysis problems ask readers to use their new data mining expertise to solve real problems, working with large data sets. All the data sets needed for the examples, exercises, and analysis problems are available on the companion Web site. The extensive use of examples, along with the opportunity to test and apply data mining skills, makes this text ideal for graduate and upper-level undergraduates in computer science and engineering. Web designers and researchers will find that this textgives them a new set of tools to further mine the Web for knowledge and move well beyond the capabilities of standard search engines.

About the Author:
Zdravko Markov, P.D, is Associate Professor of Computer Science at Central Connecticut State University

About the Author:
Daniel T. Larose, P.D, is Professor of statistics in the Department of Mathematical Sciences at Central Connecticut State University

Read More Show Less

Editorial Reviews

From the Publisher
"…it has to be noted that this book is an excellent resource for conducting Web mining lectures or single units within Data mining class. The data can be used for small as well as quite comprehensive business intelligence projects. The book's content is easy to access; even students with very basic statistical skills can get the flavor of the intriguing aspects of Web mining." (Journal of Statistical Software, April 2008)

"…highlight[s] the exciting research related to data mining the Web…a detailed summary of the current state of the art." (CHOICE, December 2007)

"I can say I really enjoyed reading this book…a great educational resource for students and teachers." (Information Retrieval, 2008)

Read More Show Less

Product Details

Table of Contents

Preface     xi
Web Structure Mining
Information Retrieval and Web Search     3
Web Challenges     3
Web Search Engines     4
Topic Directories     5
Semantic Web     5
Crawling the Web     6
Web Basics     6
Web Crawlers     7
Indexing and Keyword Search     13
Document Representation     15
Implementation Considerations     19
Relevance Ranking     20
Advanced Text Search     28
Using the HTML Structure in Keyword Search     30
Evaluating Search Quality     32
Similarity Search     36
Cosine Similarity     36
Jaccard Similarity     38
Document Resemblance     41
References     43
Exercises     43
Hyperlink-Based Ranking     47
Introduction     47
Social Networks Analysis     48
PageRank     50
Authorities and Hubs     53
Link-Based Similarity Search     55
Enhanced Techniques for Page Ranking     56
References     57
Exercises     57
Web Content Mining
Clustering     61
Introduction     61
Hierarchical Agglomerative Clustering     63
k-Means Clustering     69
Probabilty-Based Clustering     73
Finite Mixture Problem     74
Classification Problem     76
Clustering Problem     78
Collaborative Filtering (Recommender Systems)     84
References     86
Exercises     86
Evaluating Clustering     89
Approaches to Evaluating Clustering     89
Similarity-Based Criterion Functions     90
Probabilistic Criterion Functions     95
MDL-Based Model and Feature Evaluation     100
Minimum Description Length Principle     101
MDL-Based Model Evaluation     102
Feature Selection     105
Classes-to-Clusters Evaluation     106
Precision, Recall, and F-Measure     108
Entropy     111
References     112
Exercises     112
Classification     115
General Setting and Evaluation Techniques     115
Nearest-Neighbor Algorithm     118
Feature Selection     121
Naive Bayes Algorithm     125
Numerical Approaches     131
Relational Learning     133
References     137
Exercises     138
Web Usage Mining
Introduction to Web Usage Mining     143
Definition of Web Usage Mining     143
Cross-Industry Standard Process for Data Mining     144
Clickstream Analysis     147
Web Server Log Files     148
Remote Host Field     149
Date/Time Field     149
HTTP Request Field     149
Status Code Field     150
Transfer Volume (Bytes) Field     151
Common Log Format     151
Identification Field     151
Authuser Field     151
Extended Common Log Format     151
Referrer Field     152
User Agent Field     152
Example of a Web Log Record     152
Microsoft IIS Log Format     153
Auxiliary Information     154
References     154
Exercises     154
Preprocessing for Web Usage Mining     156
Need for Preprocessing the Data     156
Data Cleaning and Filtering     158
Page Extension Exploration and Filtering     161
De-Spidering the Web Log File     163
User Identification     164
Session Identification     167
Path Completion     170
Directories and the Basket Transformation     171
Further Data Preprocessing Steps     174
References     174
Exercises     174
Eploratory Data Analysis for Web Usage Mining     177
Introduction     177
Number of Visit Actions     177
Session Duration     178
Relationship between Visit Actions and Session Duration     181
Average Time per Page     183
Duration for Individual Pages     185
References     188
Exercises     188
Modeling for Web Usage Mining: Clustering, Association, and Classification     191
Introduction     191
Modeling Methodology     192
Definition of Clustering     193
The BIRCH Clustering Algorithm     194
Affinity Analysis and the A Priori Algorithm     197
Discretizing the Numerical Variables: Binning     199
Applying the A Priori Algorithm to the CCSU Web Log Data     201
Classification and Regression Trees      204
The C4.5 Algorithm     208
References     210
Exercises     211
Index     213
Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)