Data Munging with Perl


Techniques for using Perl to recognize, parse, transform, and filter data.

"Data Munging with Perl" covers the basic paradigms of programming and discusses the many techniques specific to Perl. It examines standard data formats -- such as text, binary, HTML and XML -- before giving tips on creating and parsing new structured data formats. 5 line drawings, 5 tables.

Read More Show Less
... See more details below
Available through our Marketplace sellers.
Other sellers (Paperback)
  • All (12) from $2.02   
  • New (1) from $35.19   
  • Used (11) from $2.01   
Sort by
Page 1 of 1
Showing All
Note: Marketplace items are not eligible for any coupons and promotions
Seller since 2015

Feedback rating:



New — never opened or used in original packaging.

Like New — packaging may have been opened. A "Like New" item is suitable to give as a gift.

Very Good — may have minor signs of wear on packaging but item works perfectly and has no damage.

Good — item is in good condition but packaging may have signs of shelf wear/aging or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Acceptable — item is in working order but may show signs of wear such as scratches or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Used — An item that has been opened and may show signs of wear. All specific defects should be noted in the Comments section associated with each item.

Refurbished — A used item that has been renewed or updated and verified to be in proper working condition. Not necessarily completed by the original manufacturer.


Ships from: San Diego, CA

Usually ships in 1-2 business days

  • Canadian
  • International
  • Standard, 48 States
  • Standard (AK, HI)
  • Express, 48 States
  • Express (AK, HI)
Page 1 of 1
Showing All
Sort by
Sending request ...


Techniques for using Perl to recognize, parse, transform, and filter data.

"Data Munging with Perl" covers the basic paradigms of programming and discusses the many techniques specific to Perl. It examines standard data formats -- such as text, binary, HTML and XML -- before giving tips on creating and parsing new structured data formats. 5 line drawings, 5 tables.

Read More Show Less

Editorial Reviews

Doug Nickerson
I have been reading several introductory Perl books recently and thought Data Munging with Perl, by David Cross, looked like a good second Perl book. After all, what the author calls Data Munging reading and writing data, converting data from one format to another is firmly within the computing mainstream. And the Web has not given us any less data or any fewer data formats to deal with.

But what is "munging"? Perl's interpreted nature and obscure-looking syntax appeal to me. Perhaps it is this same bent that causes me veritable excitement upon reading the assertion in Chapter 2, that "most data munging tasks look like: read input, Munge (or process), write output."

In a day in which the prominent development approach consists of objects whose interactions are not known until run time, and where a hot development technique more resembles a new variety of computing buddy system, this assertion has an appealing historical ring to it. This is something COBOL programmers and programmers coding UNIX filters can agree upon.

Cross clearly believes that fiddling with data and converting among formats are still important in the life of the working programmer, and that Perl is the language for the task.

For instance, UNIX-style filters are high on the list of techniques Cross recommends for munging. Other tips are "don't throw anything away" (sometimes it pays to read in more data than you currently need), design your data structures well in the beginning, and "don't do too much processing in the input routine." The third means to leave something for the munging (processing) routine to do.

Data Munging with Perl's 12 chapters address progressively more "interesting" types of data, with strategies for dealing with each. Each section includes an introductory rationale followed by examples in Perl.

Cross sometimes refines these examples. My favorite section: "How Not To Parse HTML" in Chapter 8 (by annihilating everything between "<" and ">") is followed in Chapter 9 with several Perl add-in modules that do the parsing for you.

Techniques are also presented from simple to complex. Reading data line-by-line into an array of strings or array of hashes may work for record-oriented data (Chapter 6), whereas parsing with an extension module (XML::Parser), works better for data with strict idiosyncratic structure.

Chapter 7 provides a discussion on reading binary with read() and unpack() that I found useful.

I also like the explanation of regular expressions in Chapter 4, which starts with simple examples that become more all encompassing. For example, /regular expression/ matches "regular expression" and /[a-z]/ matches the lowercase letters.

Similar for the explanation of parsers on (page 159): tokens, rules, grammars, top-down and bottom-up parsing. This is lucid stuff for such a short space and a topic that has so much theory attached to it.

However, Data Munging with Perl suffers from the "same page syndrome."

While writing a Java book, I was beset by the urge (which I now think misguided) to write the "short history of programming languages" in an introductory chapter. You know the stuff: Languages evolved from ML to assembler, which gave way to high-level languages. My editor averred, "That's fine to include Doug, just so everyone is on the same page."

This attitude stems from the goal of publishers to sell a book to a cross-over audience such as Intermediate to Advanced. The result is that there are conservatively hundreds of computer books on the market that say the same things.

And this syndrome also subjects readers to some strange contradictions. On page Data Munging,139 of you have the author introducing ASCII text, what it is and how it takes more space to store than the same data in binary. But the Perl examples in this book can only be understood by a veteran. If I can read the Perl in this book without help, why would I not know about ASCII already?

But this minor flaw only annoyed me a little, making Data Munging a bit wordier than it might have been. With its narrow focus on a language of current interest, this book does not quite rise to the level of "Software Tools," but it still shows some good Perl programming, and provides convincing evidence of the value of data structures beyond the halls of academia along the way.

Pikes Peak Perl Mongers
I found the sample problems and the author's solutions to be very well done. I especially liked the design tips...
Well worth the price, and a good starting point for more advanced forays.
ACM Computing Reviews
A very good resource for programmers who want to learn more about data parsing, data filters, and data conversion...
Web Techniques
"The book's chapters are concise, the coverage is comprehensive, and the examples are plentiful and relevant. I've been using Perl's data munging capabilities heavily for many years, and I still picked up some useful new insights from Cross' book."
"Coders looking to transform data somehow and hackers who want to take advantage of Perl's unique features will improve their knowledge and understanding. If you find yourself working with files or records in Perl, this book will save you time and trouble."
"Munging" is a computer term referring to the process of data conversion. Perl is particularly well suited to data munging and this programmer's guide provides advice on how to most efficiently manipulate data using Perl. After the manipulation of unstructured, record-oriented, fixed-width, and binary data is explored, the work moves into the realms of hierarchical data structures and parsers such as HTML and XML parsing tools. Finally, a demonstration of how to write one's own parsers for data structures is provided. Annotation c. Book News, Inc., Portland, OR (
Read More Show Less

Product Details

  • ISBN-13: 9781930110007
  • Publisher: Manning Publications Company
  • Publication date: 1/28/2001
  • Pages: 304
  • Product dimensions: 7.36 (w) x 9.23 (h) x 0.65 (d)

Read an Excerpt

Chapter 1: Data, Data Munging, and Perl

1.3 Where does data come from? Where does it go?

As we saw in the previous section, the point of data munging is to take data in one format, carry out various transformations on it, and write it out in another format. Let's take a closer look at where the data might come from and where it might go.

First a bit of terminology. The place that you receive data from is known as your data source. The place where you send data to is known as your data sink.

Sources and sinks can take a number of different forms. Some of the most common ones that you will come across are:

  • Data files
  • Databases
  • Data pipes

Let's look at these data sources and sinks in more detail.

1.3.1 Data files

Probably the most common way to transfer data between systems is in a file. One application writes a file. This file is then transferred to a place where your data munging process can pick it up. Your process opens the file, reads in the data, and writes a new file containing the transformed data. This new file is then used as the input to another application elsewhere.

Data files are used because they represent the lowest common denominator between computer systems. Just about every computer system has the concept of a disk file. The exact format of the file will vary from system to system (even a plain ASCII text file has slightly different representations under UNIX and Windows) but handling that is, after all, part of the job of the data munger.

File transfer methods

Transferring files between different systems is also something that is usually very easy to achieve. Many computer systemsimplement a version of the File Transfer Protocol (FTP) which can be used to copy files between two systems that are connected by a network. A more sophisticated system is the Network File System (NFS) protocol, in which file systems from one computer can be viewed as apparently local files systems on another computer. Other common methods of transferring files are by using removable media (CD-ROMs, floppy disks, or tapes) or even as a MIME attachment to an email message.

Ensuring that file transfers are complete

One difficulty to overcome with file transfer is the problem of knowing if a file is complete. You may have a process that sits on one system, monitoring a file system where your source file will be written by another process. Under most operating systems the file will appear as soon as the source process begins to write it. Your process shouldn't start to read the file until it has all been transferred. In some cases, people write complex systems which monitor the size of the file and trigger the reading process only once the file has stopped growing. Another common solution is for the writing process to write another small flag file once the main file is complete and for the reading process to check for the existence of this flag file. In most cases a much simpler solution is also the best-simply write the file under a different name and only rename it to the expected name once it is complete.

Data files are most useful when there are discrete sets of data that you want to process in one chunk. This might be a summary of banking transactions sent to an accounting system at the end of the day. In a situation where a constant flow of data is required, one of the other methods discussed below might be more appropriate.

1.3.2 Databases

Databases are becoming almost as ubiquitous as data files. Of course, the term "database" means vastly differing things to different people. Some people who are used to a Windows environment might think of dBase or some similar nonrelational database system. UNIX users might think of a set of DBM files. Hopefully, most people will think of a relational database management system (RDBMS), whether it is a single-user product like Microsoft Access or Sybase Adaptive Server Anywhere, or a full multi-user product such as Oracle or Sybase Adaptive Server Enterprise.

Imposing structure on data

Databases have advantages over data files in that they impose structure on your data. A database designer will have defined a database schema, which defines the shape and type of all of your data objects. It will define, for example, exactly which data items are stored for each customer in the database, which ones are optional and which ones are mandatory. Many database systems also allow you to define relationships between data objects (for example, "each order must contain a customer identifier which must relate to an existing customer"). Modern databases also contain executable code which can define some of your business logic (for example, "when the status of an order is changed to 'delivered,' automatically create an invoice object relating to that order").

Of course, all of these benefits come at a price. Manipulating data within a database is potentially slower than equivalent operations on data files...

Read More Show Less

Table of Contents

Data, Data Munging, and Perl
General Munging Practices
Useful Perl Idioms
Pattern Matching
Unstructured Data
Record-Oriented Data
Fixed-Width and Binary Data
Complex Data Formats
Building Your Own Parsers
Looking Back-and Ahead
Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)