The Web has more extraordinarily useful content than anyone can conceivably get their arms around. But you can organize and use a whole lot more of it than you’re using now. The secret is in writing two forms of programs: spiders and scrapers.
Both types of programs fetch goodies from the Internet. Kinda like your dog bringing in the morning paper, except you can choose exactly what you want fetched. Spiders typically follow links to grab entire pages, files, or even sites. Scrapers generally grab specific pieces of information from within individual pages or files. Often, they’re used together: you send a spider to find the right pages, then send a scraper to pull the right excerpts.
Spiders and scrapers let you automatically pull together data from dozens of sites and present it any way you like. You could automatically keep up with anything that’s regularly posted on the Web (your favorite comic, your competitor’s new product introductions, the latest news from Iraq, new postings at your favorite blog).
There are millions of people who could benefit from writing these programs, but few of them know how. This stuff’s eminently learnable -- especially if you can already find your way around Perl. That’s where Spidering Hacks comes in.
This is the latest in O’Reilly’s Hacks series -- intended, in their words, to “reclaim the term ‘hacking’ for the good guys: innovators who explore and experiment, unearth shortcuts, create useful tools, and come up with fun things to try on your own.” We’ve raved about Google Hacks. Spidering Hacks is just as cool.
Kevin Hemenway and Tara Calishain first outline the basic concepts, techniques, and tools -- especially the Perl LWP modules for accessing web data, and automated tools like WWW::Mechanize. (They also cover the etiquette of spidering: how to “walk softly” on the sites you’re spidering, and respect requests not to spider.)
You’ll learn how to deal with secured site access and redesigns and improve your programs with feedback and progress bars. Next, the authors show how to retrieve media files (for example, downloading movies from the Library of Congress, retrieving daily comics, and finding MP3 files associated with an M3U playlist).
There’s extensive coverage of spidering database applications: for example, aggregating multiple search engine results, archiving Yahoo! Groups messages, scraping e-commerce site product reviews, collecting specific TV listings, automatically finding blogs you’re interested in, tracking overnight express packages, bargain hunting by automating the price comparison process.
As a bonus, the authors take you into some immensely useful hidden corners of the Web. For example, they show how to spider Lexical Freenet, which displays word relationships like puns, rhymes, concepts, relevant people, antonyms, and so forth. Add a simple spider, and you’ve built a truly amazing tool for any writer, librarian, or researcher.
There’s also more than a little fun here (for example, a spider that captures song lyrics and forwards them to a text-to-speech site that creates a .WAV file. Think of it as Robot Karaoke.) So, use your imagination. What do you want to spider today? Bill Camarda
Bill Camarda is a consultant, writer, and web/multimedia content developer. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks for Dummies, Second Edition.