Read an Excerpt
Hacking RSS and Atom
By Leslie M. Orchard
John Wiley & SonsISBN: 0-7645-9758-2
Chapter OneGetting Ready to Hack
in this chapter
* Taking a Crash Course in RSS and Atom Feeds
* Gathering Tools
What are RSS and Atom feeds? If you're reading this, it's pretty likely you've already seen links to feeds (things such as "Syndicate this Site" or the ubiquitous orange-and-white "RSS" buttons) starting to pop up on all of your favorite sites. In fact, you might already have secured a feed reader or aggregator and stopped visiting most of your favorite sites in person. The bookmarks in your browser have started gathering dust since you stopped clicking through them every day. And, if you're like some feed addicts, you're keeping track of what's new from more Web sites and news sources than you ever have before, or even thought possible.
If you're a voracious infovore like me and this story doesn't sound familiar, you're in for a treat. RSS and Atom feeds-collectively known as syndication feeds-are behind one of the biggest changes to sweep across the Web since the invention of the personal home page. These syndication feeds make it easy for machines to surf the Web, so you don't have to.
So far, syndication feed readers won't actually read or intelligently digest content on the Web for you, but they will let you know when there's something new to peruse and can collect it in an inbox, like email.
In fact, these feeds and their readers layer theWeb with features not altogether different than email newsletters and Usenet newsgroups, but with much more control over what you receive and none of the spam. With the time you used to spend browsing through bookmarked sites checking for updates, you can now just get straight to reading new stuff presented directly. It's almost as though someone is publishing a newspaper tailored just for you.
From the publishing side of things, when you serve up your messages and content using syndication feeds, you make it so much easier for someone to keep track of your updates-and so much more likely that they will stay in touch because, once someone has subscribed to your feed, it's practically effortless to stay tuned in. As long as you keep pushing out things worthy of an audience's attention, syndication feeds make it easier to slip into their busy schedules and stay there.
Furthermore, the way syndication feeds slice up the Web into timely capsules of microcontent allows you to manipulate, filter, and remix streams of fluid online content in a way never seen before. With the right tools, you can work toward applications that help more cleverly digest content and sift through the firehose of information available. You can gather resources and collectively republish, acting as the editorial newsmaster of your own personal news wire. You can train learning machines to filter for items that match your interests. And the possibilities offered by syndication will only expand as new kinds of information and new types of media are carried and referenced by feed items.
But that's enough gushing about syndication feeds. Let's get to work figuring out what these things are, under the hood, and how you can actually do some of the things promised earlier.
Taking a Crash Course in RSS and Atom Feeds
If you're already familiar with all the basics of RSS and Atom feeds, you can skip ahead to the section "Gathering Tools" later in this chapter. But, just in case you need to be brought up to speed, this section takes a quick tour of feed consumers, feed producers, and the basics of feed anatomy.
Catching Up with Feed Readers and Aggregators
One of the easiest places to start with an introduction to syndication feeds is with feed aggregators and readers, because the most visible results of feeds start there. Though you will be building your own aggregator soon enough, having some notion of what sorts of things other working aggregators do can certainly give you some ideas. It also helps to have other aggregators around as a source of comparison once you start creating some feeds.
For the most part, you'll find feed readers fall into categories such as the following:
* Desktop newscasts, headline tickers, and screensavers
* Personalized portals * Mixed reverse-chronological aggregators * Three-pane aggregators
Though you're sure to find many more shapes and forms of feed readers, these make a good starting point-and going through them, you can see a bit of the evolution of feed aggregators from heavily commercial and centralized apps to more personal desktop tools.
Desktop Headline Tickers and Screensavers
One of the most common buzzwords heard in the mid-1990's dot-com boom was "push." Microsoft introduced an early form of syndication feeds called Channel Definition Format (or CDF) and incorporated CDF into Internet Explorer in the form of Active Channels. These were managed from the Channel Bar, which contained selections from many commercial Web sites and online publications.
A company named PointCast, Inc., offered a "desktop newscast" that featured headlines and news on the desktop, as well as an animated screensaver populated with news content pulled from commercial affiliates and news wires. Netscape and Marimba teamed up to offer Netcaster, which provided many features similar to PointCast and Microsoft's offerings but used different technology to syndicate content.
These early feed readers emphasized mainly commercial content providers, although it was possible to subscribe to feeds published by independent and personal sites. Also, because these aggregators tended to present content with scrolling tickers, screensavers, and big and chunky user interfaces using lots of animation, they were only really practical for use in subscribing to a handful of feeds-maybe less than a dozen.
Feed readers of this form are still in use, albeit with less buzz and venture capital surrounding them. They're useful for light consumption of a few feeds, in either an unobtrusive or highly branded form, often in a role more like a desktop accessory than a full-on, attention-centric application. Figure 1-1 offers an example of such an accessory from the K Desktop Environment project, named KNewsTicker.
Although not quite as popular or common as they used to be, personalized portals were one of the top buzzworthy topics competing for interest with "push" technology back before the turn of the century. In the midst of the dot-com days, Excite, Lycos, Netscape, Microsoft, and Yahoo! were all players in the portal industry-and a Texas-based fish-processing company named Zapata even turned itself into an Internet-startup, buying up a swath of Web sites to get into the game.
The idea was to pull together as many useful services and as much attractive content as possible into one place, which Web surfers would ideally use as their home page. This resulted in modular Web pages, with users able to pick and choose from a catalog of little components containing, among other things, headline links syndicated from other Web sites.
One of the more interesting contenders in this space was the My Netscape portal offered by, of course, Netscape. My Netscape was one of the first services to offer support for RSS feeds in their first incarnations. In fact, the original specification defining the RSS format in XML was drafted by team members at Netscape and hosted on their corporate Web servers.
Portals, with their aggregated content modules, are more information-dense than desktop tickers or screensavers. Headlines and resources are offered more directly, with less branding and presentation than with the previous "push" technology applications. So, with less window-dressing to get in the way, users can manageably pull together even more information sources into one spot.
The big portals aren't what they used to be, though, and even My Netscape has all but backed away from being a feed aggregator. However, feed aggregation and portal-like features can still be found on many popular community sites, assimilated as peripheral features. For example, the nerd news site Slashdot offers "slashbox" modules in a personalizable sidebar, many or most drawn from syndication feeds (see Figure 1-2).
Other Open Source Web community packages, such as Drupal (drupal.org) and Plone (plone.org), offer similar feed headline modules like the classic portals. But although you could build and host a portal-esque site just for yourself and friends, this form of feed aggregation still largely appears on either niche and special-interest community sites or commercial sites aiming to capture surfers' home page preferences for marketing dollars.
In contrast, however, the next steps in the progression of syndication feed aggregator technology led to some markedly more personal tools.
Mixed Reverse-Chronological Aggregators
Wow, that's a mouthful, isn't it? "Mixed reverse-chronological aggregators." It's hard to come up with a more concise description, though. Maybe referring to these as "blog-like" would be better. These aggregators are among the first to treat syndication feeds as fluid streams of content, subject to mixing and reordering. The result, by design, is something not altogether unlike a modern blog. Content items are presented in order from newest to oldest, one after the other, all flowed into the same page regardless of their original sources.
And, just as important, these aggregators are personal aggregators. Radio UserLand from UserLand Software was one of the first of this form of aggregator (see Figure 1-3). Radio was built as a fully capable Web application server, yet it's intended to be installed on a user's personal machine. Radio allows the user to manage his or her own preferences and list of feed subscriptions, to be served up to a Web browser of choice from its own private Web server (see Figure 1-4).
The Radio UserLand application stays running in the background and about once an hour it fetches and processes each subscribed feed from their respective Web sites. New feed items that Radio hasn't seen before are stored away in its internal database. The next time the news aggregation page is viewed or refreshed, the newest found items appear in reverse-chronological order, with the freshest items first on the page.
So for the first time, with this breed of aggregator, the whole thing lives on your own computer. There's no centralized delivery system or marketing-supported portal-aggregators like these put all the tools into your hands, becoming a real personal tool. In particular, Radio comes not only with publishing tools to create a blog and associated RSS feeds, but a full development environment with its own scripting language and data storage, allowing the user-turned-hacker to reach into the tool to customize and extend the aggregator and its workings. After its first few public releases, Radio UserLand was quickly followed by a slew of inspired clones and variants, such as AmphetaDesk (disobey.com/amphetadesk/), but they all shared advances that brought the machinery of feed aggregation to the personal desktop.
And, finally, this form of feed aggregator was even more information-dense than desktop newscasters or portals that came before. Rather than presenting things with entertaining but time-consuming animation, or constrained to a mosaic of on-page headline modules, the mixed reverse-chronological display of feed items could scale to build a Web page as long as you could handle and would keep you constantly up to date with the latest feed items. So, the number of subscribed feeds you could handle was limited only by how large a page your browser could load and your ability to skim, scan, and read it.
This family of feed aggregators builds upon what I consider to be one of the chief advances of Radio UserLand and friends: feeds treated as fluid streams of items, subject to mixing, reordering, and many other manipulations. With the bonds of rigid headline collections broken, content items could now be treated like related but individual messages.
But, whereas Radio UserLand's aggregator recast feed items in a form akin to a blog, other offerings began to look at feed items more like email messages or Usenet postings. So, the next popular form of aggregator takes all the feed fetching and scanning machinery and uses the familiar user interface conventions of mail and newsgroup applications. Figure 1-5, Figure 1-6, Figure 1-7, and Figure 1-8 show some examples.
In this style of aggregator, one window pane displays subscriptions, another lists items for a selected subscription (or group of subscriptions), and the third pane presents the content of a selected feed item. Just like the mail and news readers that inspired them, these aggregators present feed items in a user interface that treats feeds as analogous to newsgroups, mailboxes, or folders. Extending this metaphor further, many of these aggregators have cloned or translated many of the message-management features of email or Usenet clients, such as filtering, searching, archiving, and even republishing items to a blog as analogous to forwarding email messages or crossposting on Usenet.
Aggregators from the Future
As the value of feed aggregation becomes apparent to more developers and tinkerers, you'll see an even greater diversity of variations and experiments with how to gather and present feed items. You can already find Web-based aggregators styled after Web email services, other applications with a mix of aggregation styles, and still more experimenting with novel ways of organizing and presenting feed items (see Figure 1-9 and Figure 1-10).
In addition, the content and structure of feeds are changing, encompassing more forms of content such as MP3 audio and calendar events. For these new kinds of content, different handling and new presentation techniques and features are needed. For example, displaying MP3 files in reverse-chronological order doesn't make sense, but queuing them up into a playlist for a portable music player does. Also, importing calendar events into planner software and a PDA makes more sense than displaying them as an email inbox (see Figure 1-11).
The trend for feed aggregators is to continue to become even more personal, with more machine smarts and access from mobile devices. Also in the works are aggregators that take the form of intermediaries and routers, aggregating from one set of sources for the consumption of other aggregators-feeds go in, feeds come back out. Far removed from the top-heavy centralized models of managed desktop newscasts and portal marketing, feeds and aggregators are being used to build a layer of plumbing on top of the existing Web, through which content and information filter and flow into personal inboxes and news tools.
Checking Out Feed Publishing Tools
There aren't as many feed publishing tools as there are tools that happen to publish feeds. For the most part, syndication feeds have been the product of an add-on, plug-in, or template used within an existing content management system (CMS). These systems (which include packages ranging from multimillion-dollar enterprise CMS systems to personal blogging tools) can generate syndication feeds from current content and articles right alongside the human-readable Web pages listing the latest headlines.
However, as the popularity and usage of syndication feeds have increased, more feed-producing tools have come about. For example, not all Web sites publish syndication feeds. So, some tinkerers have come up with scripts and applications that "scrape" existing pages intended for people, extract titles and content from those pages, and republish that information in the form of machine-readable syndication feeds, thus allowing even sites lacking feeds to be pulled into your personal subscriptions.
Excerpted from Hacking RSS and Atom by Leslie M. Orchard Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.