Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

( 2 )

Overview

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to ...

See more details below
Paperback (Second Edition)
$29.83
BN.com price
(Save 25%)$39.95 List Price

Pick Up In Store

Reserve and pick up in 60 minutes at your local store

Other sellers (Paperback)
  • All (17) from $7.75   
  • New (12) from $7.75   
  • Used (5) from $7.75   
Webbots, Spiders, and Screen Scrapers, 2nd Edition: A Guide to Developing Internet Agents with PHP/CURL

Available on NOOK devices and apps  
  • NOOK Devices
  • Samsung Galaxy Tab 4 NOOK 7.0
  • Samsung Galaxy Tab 4 NOOK 10.1
  • NOOK HD Tablet
  • NOOK HD+ Tablet
  • NOOK eReaders
  • NOOK Color
  • NOOK Tablet
  • Tablet/Phone
  • NOOK for Windows 8 Tablet
  • NOOK for iOS
  • NOOK for Android
  • NOOK Kids for iPad
  • PC/Mac
  • NOOK for Windows 8
  • NOOK for PC
  • NOOK for Mac
  • NOOK for Web

Want a NOOK? Explore Now

NOOK Book (eBook)
$17.99
BN.com price
(Save 43%)$31.95 List Price

Overview

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:

  • Send email or SMS notifications to alert you to new information quickly
  • Search different data sources and combine the results on one page, making the data easier to interpret and analyze
  • Automate purchases, auction bids, and other online activities to save time

Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.

This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.

This text first outlines the deficiencies of browsers, and then explains how these deficiencies can be exploited in the design and deployment of task-specific webbots. Readers will learn how to write stealthy webbots that read email, emulate online forms, auto-authenticate, manage cookies, and handle encryption.

Read More Show Less

Product Details

  • ISBN-13: 9781593273972
  • Publisher: No Starch Press San Francisco, CA
  • Publication date: 3/22/2012
  • Edition description: Second Edition
  • Edition number: 2
  • Pages: 392
  • Sales rank: 1,441,437
  • Product dimensions: 7.08 (w) x 9.06 (h) x 0.96 (d)

Meet the Author

Michael Schrenk develops webbots and spiders for clients across North America. He has written for Computerworld and Web Techniques magazines and has taught college courses on web usability and Internet marketing. He is also an occasional speaker at DEFCON.

Read More Show Less

Table of Contents

;
About the Author;
About the Technical Reviewer;
Acknowledgments;
Introduction;
Old-School Client-Server Technology;
The Problem with Browsers;
What to Expect from This Book;
About the Website;
About the Code;
Requirements;
A Disclaimer (This Is Important);
Fundamental Concepts and Techniques;
Chapter 1: What’s in It for You?;
1.1 Uncovering the Internet’s True Potential;
1.2 What’s in It for Developers?;
1.3 What’s in It for Business Leaders?;
1.4 Final Thoughts;
Chapter 2: Ideas for Webbot Projects;
2.1 Inspiration from Browser Limitations;
2.2 A Few Crazy Ideas to Get You Started;
2.3 Final Thoughts;
Chapter 3: Downloading Web Pages;
3.1 Think About Files, Not Web Pages;
3.2 Downloading Files with PHP’s Built-in Functions;
3.3 Introducing PHP/CURL;
3.4 Installing PHP/CURL;
3.5 LIB_http;
3.6 Final Thoughts;
Chapter 4: Basic Parsing Techniques;
4.1 Content Is Mixed with Markup;
4.2 Parsing Poorly Written HTML;
4.3 Standard Parse Routines;
4.4 Using LIB_parse;
4.5 Useful PHP Functions;
4.6 Final Thoughts;
Chapter 5: Advanced Parsing with Regular Expressions;
5.1 Pattern Matching, the Key to Regular Expressions;
5.2 PHP Regular Expression Types;
5.3 Learning Patterns Through Examples;
5.4 Regular Expressions of Particular Interest to Webbot Developers;
5.5 When Regular Expressions Are (or Aren’t) the Right Parsing Tool;
5.6 Final Thoughts;
Chapter 6: Automating Form Submission;
6.1 Reverse Engineering Form Interfaces;
6.2 Form Handlers, Data Fields, Methods, and Event Triggers;
6.3 Unpredictable Forms;
6.4 Analyzing a Form;
6.5 Final Thoughts;
Chapter 7: Managing Large Amounts of Data;
7.1 Organizing Data;
7.2 Making Data Smaller;
7.3 Thumbnailing Images;
7.4 Final Thoughts;
Projects;
Chapter 8: Price-Monitoring Webbots;
8.1 The Target;
8.2 Designing the Parsing Script;
8.3 Initialization and Downloading the Target;
8.4 Further Exploration;
Chapter 9: Image-Capturing Webbots;
9.1 Example Image-Capturing Webbot;
9.2 Creating the Image-Capturing Webbot;
9.3 Further Exploration;
9.4 Final Thoughts;
Chapter 10: Link-Verification Webbots;
10.1 Creating the Link-Verification Webbot;
10.2 Running the Webbot;
10.3 Further Exploration;
Chapter 11: Search-Ranking Webbots;
11.1 Description of a Search Result Page;
11.2 What the Search-Ranking Webbot Does;
11.3 Running the Search-Ranking Webbot;
11.4 How the Search-Ranking Webbot Works;
11.5 The Search-Ranking Webbot Script;
11.6 Final Thoughts;
11.7 Further Exploration;
Chapter 12: Aggregation Webbots;
12.1 Choosing Data Sources for Webbots;
12.2 Example Aggregation Webbot;
12.3 Adding Filtering to Your Aggregation Webbot;
12.4 Further Exploration;
Chapter 13: FTP Webbots;
13.1 Example FTP Webbot;
13.2 PHP and FTP;
13.3 Further Exploration;
Chapter 14: Webbots That Read Email;
14.1 The POP3 Protocol;
14.2 Executing POP3 Commands with a Webbot;
14.3 Further Exploration;
Chapter 15: Webbots That Send Email;
15.1 Email, Webbots, and Spam;
15.2 Sending Mail with SMTP and PHP;
15.3 Writing a Webbot That Sends Email Notifications;
15.4 Further Exploration;
Chapter 16: Converting a Website into a Function;
16.1 Writing a Function Interface;
16.2 Final Thoughts;
Advanced Technical Considerations;
Chapter 17: Spiders;
17.1 How Spiders Work;
17.2 Example Spider;
17.3 LIB_simple_spider;
17.4 Experimenting with the Spider;
17.5 Adding the Payload;
17.6 Further Exploration;
Chapter 18: Procurement Webbots and Snipers;
18.1 Procurement Webbot Theory;
18.2 Sniper Theory;
18.3 Testing Your Own Webbots and Snipers;
18.4 Further Exploration;
18.5 Final Thoughts;
Chapter 19: Webbots and Cryptography;
19.1 Designing Webbots That Use Encryption;
19.2 A Quick Overview of Web Encryption;
19.3 Final Thoughts;
Chapter 20: Authentication;
20.1 What Is Authentication?;
20.2 Example Scripts and Practice Pages;
20.3 Basic Authentication;
20.4 Session Authentication;
20.5 Final Thoughts;
Chapter 21: Advanced Cookie Management;
21.1 How Cookies Work;
21.2 PHP/CURL and Cookies;
21.3 How Cookies Challenge Webbot Design;
21.4 Further Exploration;
Chapter 22: Scheduling Webbots and Spiders;
22.1 Preparing Your Webbots to Run as Scheduled Tasks;
22.2 The Windows XP Task Scheduler;
22.3 The Windows 7 Task Scheduler;
22.4 Non-calendar-based Triggers;
22.5 Final Thoughts;
Chapter 23: Scraping Difficult Websites with Browser Macros;
23.1 Barriers to Effective Web Scraping;
23.2 Overcoming Webscraping Barriers with Browser Macros;
23.3 Final Thoughts;
Chapter 24: Hacking iMacros;
24.1 Hacking iMacros for Added Functionality;
24.2 Further Exploration;
Chapter 25: Deployment and Scaling;
25.1 One-to-Many Environment;
25.2 One-to-One Environment;
25.3 Many-to-Many Environment;
25.4 Many-to-One Environment;
25.5 Scaling and Denial-of-Service Attacks;
25.6 Creating Multiple Instances of a Webbot;
25.7 Managing a Botnet;
25.8 Further Exploration;
Larger Considerations;
Chapter 26: Designing Stealthy Webbots and Spiders;
26.1 Why Design a Stealthy Webbot?;
26.2 Stealth Means Simulating Human Patterns;
26.3 Final Thoughts;
Chapter 27: Proxies;
27.1 What Is a Proxy?;
27.2 Proxies in the Virtual World;
27.3 Why Webbot Developers Use Proxies;
27.4 Using a Proxy Server;
27.5 Types of Proxy Servers;
27.6 Final Thoughts;
Chapter 28: Writing Fault-Tolerant Webbots;
28.1 Types of Webbot Fault Tolerance;
28.2 Error Handlers;
28.3 Further Exploration;
Chapter 29: Designing Webbot-Friendly Websites;
29.1 Optimizing Web Pages for Search Engine Spiders;
29.2 Web Design Techniques That Hinder Search Engine Spiders;
29.3 Designing Data-Only Interfaces;
29.4 Final Thoughts;
Chapter 30: Killing Spiders;
30.1 Asking Nicely;
30.2 Building Speed Bumps;
30.3 Setting Traps;
30.4 Final Thoughts;
Chapter 31: Keeping Webbots out of Trouble;
31.1 It’s All About Respect;
31.2 Copyright;
31.3 Trespass to Chattels;
31.4 Internet Law;
31.5 Final Thoughts;
PHP/CURL Reference;
Creating a Minimal PHP/CURL Session;
Initiating PHP/CURL Sessions;
Setting PHP/CURL Options;
Executing the PHP/CURL Command;
Closing PHP/CURL Sessions;
Status Codes;
HTTP Codes;
NNTP Codes;
SMS Gateways;
Sending Text Messages;
Reading Text Messages;
A Sampling of Text Message Email Addresses;

Read More Show Less

Customer Reviews

Average Rating 5
( 2 )
Rating Distribution

5 Star

(2)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation

Reminder:

  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

 
Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously
Sort by: Showing all of 2 Customer Reviews
  • Anonymous

    Posted April 4, 2014

    Great Book

    Definitely the missing link in how to automate internet activity. A most have book in your Tech Library.

    Was this review helpful? Yes  No   Report this review
  • Posted March 5, 2012

    Automating data collection with your eyes closed...

    This is a review of Michael's 2nd Edition of the same book (I received an early release edition from the publisher, I did not have an opportunity to read the 1st edition): I thoroughly enjoy this book. I found myself glued to this topic, I have heard about it many times before just never investigated it. This is "good stuff" and I missed out by not starting earlier. The author, Michael Schrenk knows his stuff and is passionate about his craft and it shows in the way he writes. All throughout his book his excitement about how incredible this technology is, and his use of these tools in creative ways is contagious. I like to read books by authors who are so enthusiastic about their subject matter, as oppose to just droning out facts and knowledge. Reading this book was exciting and addicting. Following along, tinkering with his examples was just play fun. His excitement and ingenious way of looking at things just rubs off, even before I got to the real-world examples the ideas just started flowing. It's like I just discovered the next BIG THING, but I'm not going to shared that here. He does a great job of explaining everything in step by step details and then compliments them with photos and diagrams to aide with comprehension. His code examples are simple and it was easy to see what was going on. His code examples are written in an imperative, or procedural style as oppose to an object oriented style, which in my opinion, is better suited when teaching new or difficult concepts. Also, it's just easier to follow along by a wider range of people with varying programming backgrounds. He also provides his own supplemental library (via the book website), to simplify using cURL itself. Using his library, I was able to quickly get things up and running and see how everything works, and that is a good thing when learning something new. It sets you on a possible spin and leaves you with nothing but good stuff to say about the subject you just learned. In the end, would I recommend this book to others? Absolutely. It is just like learning the command line, once you start and see the benefits, you never look back.

    Was this review helpful? Yes  No   Report this review
Sort by: Showing all of 2 Customer Reviews

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)