Spidering Hacks: 100 Industrial-Strength Tips and Techniques

Spidering Hacks: 100 Industrial-Strength Tips and Techniques

by Morbus Iff, Tara Calishain
     
 

View All Available Formats & Editions

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from

Overview

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.Spidering Hacks takes you to the next level in Internet data retrieval—beyond search engines—by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented—you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

  • Aggregate and associate data from disparate locations, then store and manipulate the data as you like
  • Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
  • Integrate third-party data into your own applications or web sites
  • Make your own site easier to scrape and more usable to others
  • Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

Editorial Reviews

bn.com
The Barnes & Noble Review
The Web has more extraordinarily useful content than anyone can conceivably get their arms around. But you can organize and use a whole lot more of it than you’re using now. The secret is in writing two forms of programs: spiders and scrapers.

Both types of programs fetch goodies from the Internet. Kinda like your dog bringing in the morning paper, except you can choose exactly what you want fetched. Spiders typically follow links to grab entire pages, files, or even sites. Scrapers generally grab specific pieces of information from within individual pages or files. Often, they’re used together: you send a spider to find the right pages, then send a scraper to pull the right excerpts.

Spiders and scrapers let you automatically pull together data from dozens of sites and present it any way you like. You could automatically keep up with anything that’s regularly posted on the Web (your favorite comic, your competitor’s new product introductions, the latest news from Iraq, new postings at your favorite blog).

There are millions of people who could benefit from writing these programs, but few of them know how. This stuff’s eminently learnable -- especially if you can already find your way around Perl. That’s where Spidering Hacks comes in.

This is the latest in O’Reilly’s Hacks series -- intended, in their words, to “reclaim the term ‘hacking’ for the good guys: innovators who explore and experiment, unearth shortcuts, create useful tools, and come up with fun things to try on your own.” We’ve raved about Google Hacks. Spidering Hacks is just as cool.

Kevin Hemenway and Tara Calishain first outline the basic concepts, techniques, and tools -- especially the Perl LWP modules for accessing web data, and automated tools like WWW::Mechanize. (They also cover the etiquette of spidering: how to “walk softly” on the sites you’re spidering, and respect requests not to spider.)

You’ll learn how to deal with secured site access and redesigns and improve your programs with feedback and progress bars. Next, the authors show how to retrieve media files (for example, downloading movies from the Library of Congress, retrieving daily comics, and finding MP3 files associated with an M3U playlist).

There’s extensive coverage of spidering database applications: for example, aggregating multiple search engine results, archiving Yahoo! Groups messages, scraping e-commerce site product reviews, collecting specific TV listings, automatically finding blogs you’re interested in, tracking overnight express packages, bargain hunting by automating the price comparison process.

As a bonus, the authors take you into some immensely useful hidden corners of the Web. For example, they show how to spider Lexical Freenet, which displays word relationships like puns, rhymes, concepts, relevant people, antonyms, and so forth. Add a simple spider, and you’ve built a truly amazing tool for any writer, librarian, or researcher.

There’s also more than a little fun here (for example, a spider that captures song lyrics and forwards them to a text-to-speech site that creates a .WAV file. Think of it as Robot Karaoke.) So, use your imagination. What do you want to spider today? Bill Camarda

Bill Camarda is a consultant, writer, and web/multimedia content developer. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks for Dummies, Second Edition.

Product Details

ISBN-13:
9780596005771
Publisher:
O'Reilly Media, Incorporated
Publication date:
10/28/2003
Pages:
428
Sales rank:
791,455
Product dimensions:
6.00(w) x 9.00(h) x 0.97(d)

Meet the Author

Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love.

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.

Customer Reviews

Average Review:

Write a Review

and post it to your social network

     

Most Helpful Customer Reviews

See all customer reviews >