Data Munging with Perl

Data Munging with Perl

by David Cross

Techniques for using Perl to recognize, parse, transform, and filter data.


Techniques for using Perl to recognize, parse, transform, and filter data.

Editorial Reviews

Doug Nickerson
I have been reading several introductory Perl books recently and thought Data Munging with Perl, by David Cross, looked like a good second Perl book. After all, what the author calls Data Munging reading and writing data, converting data from one format to another is firmly within the computing mainstream. And the Web has not given us any less data or any fewer data formats to deal with.

But what is "munging"? Perl's interpreted nature and obscure-looking syntax appeal to me. Perhaps it is this same bent that causes me veritable excitement upon reading the assertion in Chapter 2, that "most data munging tasks look like: read input, Munge (or process), write output."

In a day in which the prominent development approach consists of objects whose interactions are not known until run time, and where a hot development technique more resembles a new variety of computing buddy system, this assertion has an appealing historical ring to it. This is something COBOL programmers and programmers coding UNIX filters can agree upon.

Cross clearly believes that fiddling with data and converting among formats are still important in the life of the working programmer, and that Perl is the language for the task.

For instance, UNIX-style filters are high on the list of techniques Cross recommends for munging. Other tips are "don't throw anything away" (sometimes it pays to read in more data than you currently need), design your data structures well in the beginning, and "don't do too much processing in the input routine." The third means to leave something for the munging (processing) routine to do.

Data Munging with Perl's 12 chapters address progressively more "interesting" types of data, with strategies for dealing with each. Each section includes an introductory rationale followed by examples in Perl.

Cross sometimes refines these examples. My favorite section: "How Not To Parse HTML" in Chapter 8 (by annihilating everything between "<" and ">") is followed in Chapter 9 with several Perl add-in modules that do the parsing for you.

Techniques are also presented from simple to complex. Reading data line-by-line into an array of strings or array of hashes may work for record-oriented data (Chapter 6), whereas parsing with an extension module (XML::Parser), works better for data with strict idiosyncratic structure.

Chapter 7 provides a discussion on reading binary with read() and unpack() that I found useful.

I also like the explanation of regular expressions in Chapter 4, which starts with simple examples that become more all encompassing. For example, /regular expression/ matches "regular expression" and /[a-z]/ matches the lowercase letters.

Similar for the explanation of parsers on (page 159): tokens, rules, grammars, top-down and bottom-up parsing. This is lucid stuff for such a short space and a topic that has so much theory attached to it.

However, Data Munging with Perl suffers from the "same page syndrome."

While writing a Java book, I was beset by the urge (which I now think misguided) to write the "short history of programming languages" in an introductory chapter. You know the stuff: Languages evolved from ML to assembler, which gave way to high-level languages. My editor averred, "That's fine to include Doug, just so everyone is on the same page."

This attitude stems from the goal of publishers to sell a book to a cross-over audience such as Intermediate to Advanced. The result is that there are conservatively hundreds of computer books on the market that say the same things.

And this syndrome also subjects readers to some strange contradictions. On page Data Munging,139 of you have the author introducing ASCII text, what it is and how it takes more space to store than the same data in binary. But the Perl examples in this book can only be understood by a veteran. If I can read the Perl in this book without help, why would I not know about ASCII already?

But this minor flaw only annoyed me a little, making Data Munging a bit wordier than it might have been. With its narrow focus on a language of current interest, this book does not quite rise to the level of "Software Tools," but it still shows some good Perl programming, and provides convincing evidence of the value of data structures beyond the halls of academia along the way.

Pikes Peak Perl Mongers
I found the sample problems and the author's solutions to be very well done. I especially liked the design tips...
Well worth the price, and a good starting point for more advanced forays.
ACM Computing Reviews
A very good resource for programmers who want to learn more about data parsing, data filters, and data conversion...
Web Techniques
"The book's chapters are concise, the coverage is comprehensive, and the examples are plentiful and relevant. I've been using Perl's data munging capabilities heavily for many years, and I still picked up some useful new insights from Cross' book."
"Coders looking to transform data somehow and hackers who want to take advantage of Perl's unique features will improve their knowledge and understanding. If you find yourself working with files or records in Perl, this book will save you time and trouble."
"Munging" is a computer term referring to the process of data conversion. Perl is particularly well suited to data munging and this programmer's guide provides advice on how to most efficiently manipulate data using Perl. After the manipulation of unstructured, record-oriented, fixed-width, and binary data is explored, the work moves into the realms of hierarchical data structures and parsers such as HTML and XML parsing tools. Finally, a demonstration of how to write one's own parsers for data structures is provided. Annotation c. Book News, Inc., Portland, OR (

Product Details

Manning Publications Company
Publication date:
Product dimensions:
7.36(w) x 9.23(h) x 0.65(d)

Read an Excerpt

Chapter 1: Data, Data Munging, and Perl

1.3 Where does data come from? Where does it go?

As we saw in the previous section, the point of data munging is to take data in one format, carry out various transformations on it, and write it out in another format. Let's take a closer look at where the data might come from and where it might go.

First a bit of terminology. The place that you receive data from is known as your data source. The place where you send data to is known as your data sink.

Sources and sinks can take a number of different forms. Some of the most common ones that you will come across are:

  • Data files
  • Databases
  • Data pipes

Let's look at these data sources and sinks in more detail.

1.3.1 Data files

Probably the most common way to transfer data between systems is in a file. One application writes a file. This file is then transferred to a place where your data munging process can pick it up. Your process opens the file, reads in the data, and writes a new file containing the transformed data. This new file is then used as the input to another application elsewhere.

Data files are used because they represent the lowest common denominator between computer systems. Just about every computer system has the concept of a disk file. The exact format of the file will vary from system to system (even a plain ASCII text file has slightly different representations under UNIX and Windows) but handling that is, after all, part of the job of the data munger.

File transfer methods

Transferring files between different systems is also something that is usually very easy to achieve. Many computer systemsimplement a version of the File Transfer Protocol (FTP) which can be used to copy files between two systems that are connected by a network. A more sophisticated system is the Network File System (NFS) protocol, in which file systems from one computer can be viewed as apparently local files systems on another computer. Other common methods of transferring files are by using removable media (CD-ROMs, floppy disks, or tapes) or even as a MIME attachment to an email message.

Ensuring that file transfers are complete

One difficulty to overcome with file transfer is the problem of knowing if a file is complete. You may have a process that sits on one system, monitoring a file system where your source file will be written by another process. Under most operating systems the file will appear as soon as the source process begins to write it. Your process shouldn't start to read the file until it has all been transferred. In some cases, people write complex systems which monitor the size of the file and trigger the reading process only once the file has stopped growing. Another common solution is for the writing process to write another small flag file once the main file is complete and for the reading process to check for the existence of this flag file. In most cases a much simpler solution is also the best-simply write the file under a different name and only rename it to the expected name once it is complete.

Data files are most useful when there are discrete sets of data that you want to process in one chunk. This might be a summary of banking transactions sent to an accounting system at the end of the day. In a situation where a constant flow of data is required, one of the other methods discussed below might be more appropriate.

1.3.2 Databases

Databases are becoming almost as ubiquitous as data files. Of course, the term "database" means vastly differing things to different people. Some people who are used to a Windows environment might think of dBase or some similar nonrelational database system. UNIX users might think of a set of DBM files. Hopefully, most people will think of a relational database management system (RDBMS), whether it is a single-user product like Microsoft Access or Sybase Adaptive Server Anywhere, or a full multi-user product such as Oracle or Sybase Adaptive Server Enterprise.

Imposing structure on data

Databases have advantages over data files in that they impose structure on your data. A database designer will have defined a database schema, which defines the shape and type of all of your data objects. It will define, for example, exactly which data items are stored for each customer in the database, which ones are optional and which ones are mandatory. Many database systems also allow you to define relationships between data objects (for example, "each order must contain a customer identifier which must relate to an existing customer"). Modern databases also contain executable code which can define some of your business logic (for example, "when the status of an order is changed to 'delivered,' automatically create an invoice object relating to that order").

Of course, all of these benefits come at a price. Manipulating data within a database is potentially slower than equivalent operations on data files...

Meet the Author

Cross is the owner and managing director of Magnum Solutions, Ltd., an Internet and database consulting firm.

Customer Reviews

Average Review:

Write a Review

and post it to your social network


Most Helpful Customer Reviews

See all customer reviews >