Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

Perl for Web Site Management

Perl for Web Site Management

5.0 1
by John Callender, Linda Mui (Editor)

Checking links, batch editing HTML files, tracking users, and writing CGI scripts--these are the often tedious daily tasks that can be done much more easily with Perl, the scripting language that runs on almost all computing platforms. If you're more interested in streamlining your web activities than in learning a new programming language, Perl for Web Site


Checking links, batch editing HTML files, tracking users, and writing CGI scripts--these are the often tedious daily tasks that can be done much more easily with Perl, the scripting language that runs on almost all computing platforms. If you're more interested in streamlining your web activities than in learning a new programming language, Perl for Web Site Management is for you: it's not so much about learning Perl as it is about using Perl to do common web chores more efficiently.The secret is that, although becoming a Perl expert may be hard, most Perl scripts are relatively simple. Using Perl and other open source tools, you'll learn how to:

  • Incorporate a simple search engine
  • Write a simple CGI gateway
  • Convert multiple text files into HTML
  • Monitor log files
  • Track users as they navigate your site
Even if you don't have any programming background, this book will get you quickly past Perl's seemingly forbidding barrier of chops and chomps, execs and elsifs. You'll be able to put an end to using clunky tools, editing files tediously by hand, or relying on programmers and system administrators to do "the hard stuff" for you. Sure, you might learn a little bit about programming as well, and perhaps something about the role of open source tools on the Web. But the purpose of Perl for Web Site Management isn't to educate you--it's to empower you. Whether you're a developer, a designer, or simply a dabbler on the Web, this book is the plain-English, hands-on introduction to Perl you've been waiting for.

Editorial Reviews

The Barnes & Noble Review
Responsible for a web site? You may be following a very familiar trajectory. You never thought about becoming a programmer. But one day you need to publish a form, and you want the contents of your filled-out forms to be automatically emailed to you. You need a CGI script -- and there's nobody around to write one except you-know-who.

Perl for Web Site Management is the solution. This gentle Perl introduction focuses specifically on the skills most valuable to web pros. By Chapter 3, you're creating mail gateways using the magical cgi.pm module. By Chapter 5, you're parsing existing data to generate hundreds of web pages at once. Your Perl skills are already good enough to clean "dirty data," generate pages organized into categories, and automate the creation of a home page, links and all.

And you've barely started. You'll transform your log files into information you can actually understand and use. Identify dead links before they madden your visitors. Use SWISH-E to provide fast, powerful site search. Monitor your search engine positioning. Even use templates to rewrite your whole site at once. In short, you'll solve real problems, save oodles of time, write real programs. Yes, you. (Bill Camarda)

Bill Camarda is a consultant, writer, and web/multimedia content developer with nearly 20 years' experience in helping technology companies deploy and market advanced software, computing, and networking products and services. He served for nearly ten years as vice president of a New Jersey–based marketing company, where he supervised a wide range of graphics and web design projects. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks For Dummies®, Second Edition.

Product Details

O'Reilly Media, Incorporated
Publication date:
Edition description:
Product dimensions:
7.00(w) x 9.19(h) x 1.05(d)

Read an Excerpt

Chapter 8: Parsing Web Access Logs

Web server access logs are an excellent source of information about what your site's visitors are up to. The information on separate visitors is all mixed together, though, and for all but the smallest sites the raw access logs are too large to read directly. What you need is log-analysis software to make the information in the log more easily accessible. You can buy commercial log-analysis software to do this, but Perl makes it easy to write your own. The next three chapters describe how to build such a home-grown log-analysis tool.

This chapter focuses on the first part of the process: extracting and storing the information we're interested in. We talk about log file structure, converting IP addresses, and creating regular expressions capable of parsing web access logs. We also talk about creating a suitable data structure for storing the extracted data, so we can answer interesting questions about what our site's visitors have been doing. Along the way we discuss the difficulty of identifying those visitors in the web server's log entries and devise an approach for extracting at least an approximate version of that information.

The example continues in Chapter 9, which focuses on how to do computations involving dates and times, and finishes in Chapter 10, which covers the specifics of how we manipulate the "visit" information from our logs, as well as the actual output of the finished report.

Log File Structure

Most web servers store their access log in what is called the "common log format." Each time a user requests a file from the server, a line containing the following fields is added to the end of the log file:

This is either the IP address (like or the corresponding hostname (like pm9-31.sba1.avtel.net) of the remote user requesting the page. For performance reasons, many web servers are configured not to do hostname lookups on the remote host. This means that all you end up with in the log file is a bunch of IP addresses. A bit later in this chapter, you'll develop a Perl script that you can use to convert those IP addresses into hostnames.

identd result
This is a field for logging the response returned by the remote user's identd server. Almost no one actually uses this; in every web log I've ever seen, this field is always just a dash (-).

If you are using basic HTTP authentication (which we'll be talking about in Chapter 19) to restrict access to some of your web documents, this is where the username of the authenticated user for this transaction will be recorded. Otherwise, it will be just a dash (-).

date and time
Next comes a date and time string inside square brackets, like: [06/Jul/1999:00:09:12 -0700]. That's the day of the month, the abbreviated month name, and the four-digit year, all separated by slashes. Next come the time (expressed in 24-hour format, so 11:30 P.M. would be 23:30:00) and a time-zone offset (in this example, -0700, because the web server this log was from was using Pacific Daylight Time, which is seven hours behind Universal Time/Greenwich Mean Time).

This is the actual request sent by the remote user, enclosed in double quotes. Normally it will look something like: "GET / HTTP/1.0". The GET part means it is a GET request (as opposed to a POST or a HEAD request). The next part is the path of the URL requested; in this case, the default page in the server's top-level directory, as indicated by a single slash (/). The last part of the request is the protocol being used, at the time of this writing typically HTTP/1.0 or HTTP/1.1.

status code
This is the status code returned by the server; by definition this will be a three-digit number. A status code of 200 means everything was handled okay, 304 means the document has not changed since the client last requested it, 404 means the document could not be found, and 500 indicates that there was some sort of server-side error. (More detail on the various status codes can be found in RFC 1945, which describes the HTTP/1.0 protocol. See http://www.w3.org/Protocols/rfc1945/rfc1945.)

bytes sent
The amount of data returned by the server, not counting the header line.

An extended version of this log format, often referred to as the "combined" format, includes two additional fields at the end:

The referring page, if any, as reported by the remote user's browser. Note that referer is consistently misspelled (with a single "r" in the middle) in the HTTP specification, and in the name of the corresponding environment variable.

user agent
The user agent reported by the remote user's browser. Typically, this is a string describing the type and version of browser software being used.

Assuming you have control over your web server's configuration, or can get your ISP to modify it for you, the combined format's extra fields can provide some very interesting information about the users visiting your site. The log analysis script described in this chapter will work with either format, however.

Converting IP Addresses

Before we jump into log-file analysis, let's return briefly to the problem of doing hostname lookups on the IP addresses that most likely comprise the "host" entries in our web access logs. Example 8-1 gives a script, clf_lookup.plx, that does just that. (Like all the examples in this book, it is available for download from the book's web site, at http://www.elanus.net/book/.)

Example 8-1: A script to do hostname lookups on IP addresses in web access logs

#!/usr/bin/perl -w
# clf_lookup.plx
# given common or extended-format web logs on STDIN, outputs 
# them with numeric IP addresses in the first (host) field converted 
# to hostnames (where possible).
use strict;
use Socket;
my %hostname;
while (<>) {
    my $line = $_;
    my($host, $rest) = split / /, $line, 2;
    if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {
        # looks vaguely like an IP address
        unless (exists $hostname{$host}) {
            # no key, so haven't processed this IP before
            $hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);
        if ($hostname{$host}) {
            # only processes IPs with successful lookups
            $line = "$hostname{$host} $rest";
    print $line;

The script itself is pretty simple, but it introduces some new concepts that are definitely worth learning about. The first new thing is this line:

use Socket;

Here we are importing a module called Socket.pm. Just as we did earlier, when we pulled in the CGI.pm module, we're doing this in order to let some more experienced programmers do our dirty work for us. Specifically, the use Socket declaration in this script means we'll be able to do DNS lookups (converting numeric IP addresses to hostnames) using just a few lines of code.

Thousands of Perl modules are available. Some are distributed as part of the Perl language itself; these are usually referred to as being in the standard distribution, or as the standard module Walnuts.pm. (CGI.pm and Socket.pm are in the standard distribution.) Others can be found at CPAN, the Comprehensive Perl Archive Network, which we'll be learning more about in Chapter 11. If you can't wait until then, though (which I can totally understand, CPAN being something like the world's biggest toy store for a Perl programmer), see the accompanying sidebar, "Using CPAN," for details on how you can jump the gun and start exploring CPAN on your own.

Using CPAN

CPAN, the Comprehensive Perl Archive Network, is the official place to (among other things) get Perl modules that are not included in the standard distribution (that is, that are not distributed automatically along with all recent versions of the language). The hardest part about dealing with CPAN, at least for a beginning programmer, is that it is so extensive. With user contributions from all over the world, it has grown like kudzu, spreading organically in all directions, defying efforts to organize its contents usefully for anyone unwilling to spend a significant amount of time studying it.

Of course, if you are spending much time at all programming with Perl, the time spent learning what's in CPAN will be repaid many times over by the time you save using other people's code to perform common tasks rather than reinventing the wheel.

In any event, the following resources will help you get started with CPAN:


The top-level overview of what's in CPAN, with links to more-specific starting points


The top-level page within the modules portion of CPAN, with pointers to various views of the modules


A long, annotated list of all the modules in CPAN


The CPAN search engine

The next thing in clf_lookup.plx is a my variable declaration for the %hostname hash. This is going to be used to cache hostname lookups while the script is running. That way, each IP address will have to be looked up only once rather than every time it appears in the log. It is important to initialize the %hostname hash out here, before the while loop that actually processes each line from the log file, because putting my %hostname inside the loop block would make it so that a new copy of the hash was created each time through the loop.

Let's get to the loop now. The beginning of the loop takes the form:

while (<>) {

Here we're beginning a while loop, which you'll remember means we're going to run a block of code repeatedly as long as whatever is inside those parentheses evaluates to a true value. But what a weird thing we've got inside that logical test. It looks somewhat like the angle-input operator we use to read lines from a filehandle, but there's no filehandle inside it.

What the <> (which is sometimes called the diamond operator) is doing is this: it looks at the @ARGV array (which you'll remember from Chapter 4 is the special variable holding your script's command-line arguments) and assumes that those arguments represent the names of one or more text files. The <> operator then returns the text from those files, one line at a time, so you can work with those lines in the body of your while block. It keeps feeding you lines of text until it has exhausted all of the first file mentioned in @ARGV, then goes on to the second file, and so on, until it has exhausted all the files mentioned in @ARGV. After the last line from the last file has been delivered, it returns undef (the undefined value), which is false, ending the loop.

You get an interesting extra feature with the <> operator. If you don't give your script any command-line arguments, such that there are no files mentioned in @ARGV, <> instead will read from standard input (that is, from the STDIN filehandle your script gets by default when it is started up). This in turn lets you do cool things like use your script in a shell pipeline to process the input or output for another program. In fact, we'll be using that feature with this script a little later.

Where does the <> operator put each line of text as it is working its way through the files mentioned in @ARGV? In the special scalar variable $_. As I mentioned previously, many of Perl's operators and functions are designed to work with $_ by default, and this ends up being really handy, because it lets you write certain common operations very quickly.

In this case, though, we're going to go ahead and stick the contents of $_ into something a little more memorable. That happens in the next line:

my $line = $_;

Next comes the following:

my($host, $rest) = split / /, $line, 2;
if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {
    # looks vaguely like an IP address

Here you are using the split function to take the current line from the log file and separate it into everything before the first space character (which goes into the scalar variable $host) and everything after the first space character (which goes into $rest). This takes advantage of an optional third argument to the split function, with that argument being a number telling split how many fields to split the string into (in this case, two, because we don't need to keep splitting once we've split off the first field).

Next comes an if statement with a regular expression in the logical test. With your new understanding of regular expressions it should be pretty easy to decipher the meaning of /^\d+\.\d+\.\d+\.\d+$/: it matches a string consisting of four sets of one or more numbers each, separated by periods. This is not the exact same thing as an IP address (in which the component numbers can fall only within a certain range); this pattern is naïve, in that it would accept as IP addresses things like 98765.1234.1.1, but it's close enough for our current purpose....

Meet the Author

John Callender is an independent consultant specializing in web development. He has been a teacher, writer, editor, and network administrator.

Customer Reviews

Average Review:

Post to your social network


Most Helpful Customer Reviews

See all customer reviews

Perl for Web Site Management 5 out of 5 based on 0 ratings. 1 reviews.
Guest More than 1 year ago
If you are just beginning to learn Perl and want to know where to go after O'Reilly's Learning Perl, this is the book for you. There is nothing like seeing Perl do useful things, AND understanding how it works to get excited about using a new programming language. Perl, being such a great first language, is useful right away. The author's clear and amusing style allows for easy reading and quick results. Wait until you complete chapter six and watch as directories fill up with well-formed HTML pages generated from multiple text files and you'll be hooked on this book, and Perl too! Geared towards a beginner or mid-level programmer with lots of useful code samples. A very good book.