Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools

Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools

Paperback(Older Edition)

$28.50 $34.95 Save 18% Current price is $28.5, Original price is $34.95. You Save 18%.

Temporarily Out of Stock Online

Eligible for FREE SHIPPING


Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools by Jeffrey E. F. Friedl

Regular expressions are a powerful tool for manipulating text and data. If you don't use them yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener.With regular expressions, you can save yourself time and aggravation while dealing with documents, mail messages, log files — you name it — any type of text or data. For example, regular expressions can play a vital role in constructing a World Wide Web CGI script, which can involve text and data of all sorts.Regular expressions are not a tool in and of themselves, but are included as part of a larger utility. The classic example is grep. These days, regular expressions can be found everywhere, such as in:

  • Scripting languages (including Perl, Tcl, awk, and Python)
  • Editors (including Emacs, vi, and Nisus Writer)
  • Programming environments (including Delphi and Visual C++)
While many of these tools originated on UNIX, they are now available for a wide variety of platforms, including DOS/Windows and MacOS, so you can use them in your home environment. Additionally, many favorite programming languages offer regular-expression libraries, so you can include support for them in your own programs, and yes, even applets.There can be certain subtle, but valuable, ways to think when you're using regular expressions, and these can be taught. Jeffrey Friedl has spent years helping people on the Net understand and use regular expressions. In this book he leads you through the steps of knowing exactly how to craft a regular expression to get the job done.Regular expressions are not used in a vacuum. In this book, a variety of tools are examined and used in an extensive array of examples, with a major focus on Perl. Perl is extremely well endowed with rich and expressive regular expressions. Yet what is power in the hands of an expert can be fraught with peril for the unwary. This book will help you navigate the minefield to becoming an expert.

Product Details

ISBN-13: 9781565922570
Publisher: O'Reilly Media, Incorporated
Publication date: 01/08/1997
Series: Nutshell Handbooks Series
Edition description: Older Edition
Pages: 368
Product dimensions: 7.05(w) x 9.19(h) x 0.85(d)

About the Author

Jeffrey Friedl was raised in the countryside of Rootstown, Ohio, and had aspirations of being an astronomer until one day he noticed a TRS-80 Model I sitting unused in the corner of the chem lab (bristling with a full 16K of RAM, no less). He eventually began using Unix (and regular expressions) in 1980, and earned degrees in Computer Science from Kent (BS) and the University of New Hampshire (MS). He did kernel development for Omron Corporation in Kyoto, Japan for eight years before moving in 1997 to Silicon Valley to apply his regular-expression know-how to financial news and data for a little-known company called "Yahoo!"

When faced with the daunting task of filling his copious free time, Jeffrey enjoys playing Ultimate Frisbee and basketball with friends at Yahoo!, programming his house, and feeding the squirrels and jays in his back yard. He also enjoys spending time with his wife Fumie, and preparing for the Fall 2002 release of their first "software project" together.

Read an Excerpt

Chapter 4: The Mechanics of Expression Processing

Now that we have some background under our belt, let's delve into the mechanics of how a regex engine really goes about its work. Here we don't care much about the Shine and Finish of the previous chapter; this chapter is all about the engine and the drive train, the stuff that grease monkeys talk about in bars. We'll spend a fair amount of time under the hood, so expect to get a bit dirty with some practical hands-on experience.

Start Your Engines!

Let's see how much I can milk this engine analogy for. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the Rich Corinthian Leather. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?

Two Kinds of Engines

Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.

Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.

New Standards

Let's stoke the fire by adding another variable: the California Emissions Standards. Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say one thing or the other about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).

I Come to think of it, I bet that an electric engine can qualify for the standard without much change, so it's not really impacted very much - the standard just "blesses" the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed it -use the wrong kind of gas and you're in big trouble in more ways than one.

The impact of standards

Better pollution standards are a good thing, but they require that the driver exercise more thought and foresight (well, at least for gas engines, as I noted in the previous paragraph). Frankly, however, the standard doesn't impact most people since all the other states still do their own thing and don't follow California's standard... yet. It's probably just a matter of time.

Okay, so you realize that these four types of engines can be classified into three groups (the two kinds for gas, and electric in general). You know about the differences, and that in the end they all still turn the wheels. What you don't know is what the heck this has to do with regular expressions!

More than you might imagine.

Regex Engine Types

There are two fundamentally different types of regex engines: one called "DFA" (the electric engine of our story) and one called "NFA" (the gas engine). The details follow shortly ( 101), but for now just consider them names, like Bill and Ted. Or electric and gas.

Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include Tcl, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Table 4-1 on the next page lists a few common programs available for a wide variety of platforms and the regex engine that most versions use. A generic version means that it's an old tool with many clones-I have listed notably different clones that I'm aware of.

As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by specifying clearly which metacharacters an engine should support, and exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to the standard, but the kind of results an NFA traditionally provided were quite different from the new standard, so changes were needed. As a result, broadly speaking, we have three types of regex engines:

  • DFA (POSIX or not-similar either way)
  • Traditional NFA

POSIX standardized the workings of over 70 programs, including traditional regex-wielding tools such as awk, ed, egrep, expr, grep, lex, and sed. Most of these tools' regex flavor had (and still have) the weak powers equivalent to a moped. So weak, in fact, that I don't find them interesting for discussing regular expressions. Although they can certainly be extremely useful tools, you won't find much mention of expr, ed, and sed in this book. Well, to be fair, some modern versions of these tools have been retrofitted with a more-powerful flavor. This is commonly done to grep, a direct regex sibling of sed, ed, and expr.

On the other hand, egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just confirmed the status quo-no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIX-compliant. The gas engines that passed the California Emission Standards tests (POSIX NFA) were fine in that they produced results according to the standard, but the necessary changes only enhanced their fickleness to proper tuning. Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be "good enough" now causes knocks and pings. But so long as you know how to maintain your baby, the engine runs smoothly. And cleanly....

Table of Contents

Why I Wrote This Book
Intended Audience
How to Read This Book
This Book, as a Story
This Book, as a Reference
The Introduction
The Details
Tool-Specific Information
Typographical Conventions
Personal Comments and Acknowledgments
Shoulders to Stand On
Other Thanks
In the Future
Chapter 1. Introduction to Regular Expressions
Solving Real Problems
Regular Expressions as a Language
The Filename Analogy
The Language Analogy
The Regular-Expression Frame of Mind
Searching Text Files: Egrep
Egrep Metacharacters
Start and End of the Line
Character Classes
Matching Any Character -- Dot
Word Boundaries
In a Nutshell
Optional Items
Other Quantifiers: Repetition
Ignoring Differences in Capitalization
Parentheses and Backreferences
The Great Escape
Expanding the Foundation
Linguistic Diversification
The Goal of a Regular Expression
A Few More Examples
Regular Expression Nomenclature
Improving on the Status Quo
Personal Glimpses
Chapter 2. Extended Introductory Examples
About the Examples
A Short Introduction to Perl
Matching Text with Regular Expressions
Toward a More Real-World Example
Side Effects of a Successful Match
Intertwined Regular Expressions
Modifying Text with Regular Expressions
Automated Editing
A Small Mail Utility
That Doubled-Word Thing
Chapter 3. Overview of Regular Expression Features and Flavors
A Casual Stroll Across the Regex Landscape
The World According to Grep
The Times They Are a'\|Changin'
At a Glance
Care and Handling of Regular Expressions
Identifying a Regex
Doing Something with the Matched Text
Other Examples
Care and Handling: Summary
Engines and Chrome Finish
Chrome and Appearances
Engines and Drivers
Common Metacharacters
Character Shorthands
Strings as Regular Expressions
Class Shorthands, Dot, and Character Classes
Grouping and Retrieving
Guide to the Advanced Chapters
Tool-Specific Information
Chapter 4. The Mechanics of Expression Processing
Start Your Engines!
Two Kinds of Engines
New Standards
Regex Engine Types
From the Department of Redundancy Department
Match Basics
About the Examples
Rule 1: The Earliest Match Wins
The ``Transmission'' and the Bump-Along
Engine Pieces and Parts
Rule 2: Some Metacharacters Are Greedy
Regex-Directed vs. Text-Directed
NFA Engine: Regex-Directed
DFA Engine: Text-Directed
The Mysteries of Life Revealed
A Really Crummy Analogy
Two Important Points on Backtracking
Saved States
Backtracking and Greediness
More About Greediness
Problems of Greediness
Multi-Character ``Quotes''
Greediness Always Favors a Match.
Is Alternation Greedy?
Uses for Non-Greedy Alternation
Greedy Alternation in Perspective
Character Classes vs. Alternation
``The Longest-Leftmost''
POSIX and the Longest-Leftmost Rule
Speed and Efficiency
DFA and NFA in Comparison
Practical Regex Techniques
Contributing Factors
Be Specific
Difficulties and Impossibilities
Watching Out for Unwanted Matches
Matching Delimited Text
Knowing Your Data and Making Assumptions
Additional Greedy Examples
Match Mechanics Summary
Some Practical Effects of Match Mechanics
Chapter 5. Crafting a Regular Expression
A Sobering Example
A Simple Change -- Placing Your Best Foot Forward
More Advanced -- Localizing the Greediness
Reality Check
A Global View of Backtracking
More Work for a POSIX NFA
Work Required During a Non-Match
Being More Specific
Alternation Can Be Expensive
A Strong Lead
The Impact of Parentheses
Internal Optimizations
First-Character Discrimination
Fixed-String Check
Simple Repetition
Needless Small Quantifiers
Length Cognizance
Match Cognizance
Need Cognizance
String/Line Anchors
Compile Caching
Testing the Engine Type
Basic NFA vs. DFA Testing
Traditional NFA vs. POSIX NFA Testing
Unrolling the Loop
Method 1: Building a Regex From Past Experiences
The Real ``Unrolling the Loop'' Pattern
Method 2: A Top-Down View
Method 3: A Quoted Internet Hostname
Unrolling C Comments
Regex Headaches
A Na\(:ive View
Unrolling the C Loop
The Freeflowing Regex
A Helping Hand to Guide the Match
A Well-Guided Regex is a Fast Regex
The Many Twists and Turns of Optimizations
Chapter 6. Tool-Specific Information
Questions You Should Be Asking
Something as Simple as Grep...
In This Chapter
Differences Among Awk Regex Flavors
Awk Regex Functions and Operators
Tcl Regex Operands
Using Tcl Regular Expressions
Tcl Regex Optimizations
GNU Emacs
Emacs Strings as Regular Expressions
Emacs's Regex Flavor
Emacs Match Results
Benchmarking in Emacs
Emacs Regex Optimizations
Chapter 7. Perl Regular Expressions
The Perl Way
Regular Expressions as a Language Component
Perl's Greatest Strength
Perl's Greatest Weakness
A Chapter, a Chicken, and The Perl Way
An Introductory Example: Parsing CSV Text
Regular Expressions and The Perl Way
Perl Unleashed
Regex-Related Perlisms
Expression Context
Dynamic Scope and Regex Match Effects
Special Variables Modified by a Match
``Doublequotish Processing'' and Variable Interpolation
Perl's Regex Flavor
Quantifiers -- Greedy and Lazy
String Anchors
Multi-Match Anchor
Word Anchors
Convenient Shorthands and Other Notations
Character Classes
Modification with \Q and Friends: True Lies
The Match Operator
Match-Operand Delimiters
Match Modifiers
Specifying the Match Target Operand
Other Side Effects of the Match Operator
Match Operator Return Value
Outside Influences on the Match Operator
The Substitution Operator
The Replacement Operand
The /e Modifier
Context and Return Value
Using /g with a Regex That Can Match Nothingness
The Split Operator
Basic Split
Advanced Split
Advanced Split's Match Operand
Scalar-Context Split
Split's Match Operand with Capturing Parentheses
Perl Efficiency Issues
``There's More Than One Way to Do It''
Regex Compilation, the /o Modifier, and Efficiency
Unsociable $ and Friends
The Efficiency Penalty of the /i Modifier
Substitution Efficiency Concerns
Regex Debugging Information
The Study Function
Putting It All Together
Stripping Leading and Trailing Whitespace
Adding Commas to a Number
Removing C Comments
Matching an Email Address
Final Comments
Notes for Perl4
Appendix A. Online Information
General Information
Mastering Regular Expressions
O'Reilly Associates
OAK Archive's Virtual Software Library
The GNU Archive
Other Web Links
C Library Packages
Appendix B. Email Regex Program


This book is about a powerful tool called "regular expressions."

Here, you will learn how to use regular expressions to solve problems and get the most out of tools that provide them. Not only that, but much more: this book is about mastering regular expressions.

If you use a computer, you can benefit from regular expressions all the time (even if you don't realize it). When accessing World Wide Web search engines, with your editor, word processor, configuration scripts, and system tools, regular expressions are often provided as "power user" options. Languages such as Awk, Elisp, Expect, Perl, Python, and Tcl have regular-expression support built in (regular expressions are the very heart of many programs written in these languages), and regular-expression libraries are available for most other languages. For example, quite soon after Java became available, a regular-expression library was built and made freely available on the Web. Regular expressions are found in editors and programming environments such as vi, Delphi, Emacs, Brief, Visual C++, Nisus Writer, and many, many more. Regular expressions are very popular.

There's a good reason that regular expressions are found in so many diverse applications: they are extremely powerful. At a low level, a regular expression describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data.

Why I Wrote This Book

You might think that with their wide availability, general popularity, andunparalleled power, regular expressions would be employed to their fullest, wherever found. You might also think that they would be well documented, with introductory tutorials for the novice just starting out, and advanced manuals for the expert desiring that little extra edge.

Sadly, that hasn't been the case. Regular-expression documentation is certainly plentiful, and has been available for a long time. (I read my first regular-expression-related manual back in 1981.) The problem, it seems, is that the documentation has traditionally centered on the "low-level view" that I mentioned a moment ago. You can talk all you want about how paints adhere to canvas, and the science of how colors blend, but this won't make you a great painter. With painting, as with any art, you must touch on the human aspect to really make a statement. Regular expressions, composed of a mixture of symbols and text, might seem to be a cold, scientific enterprise, but I firmly believe they are very much creatures of the right half of the brain. They can be an outlet for creativity, for cunningly brilliant programming, and for the elegant solution.

I'm not talented at anything that most people would call art. I go to karaoke bars in Kyoto a lot, but I make up for the lack of talent simply by being loud. I do, however, feel very artistic when I can devise an elegant solution to a tough problem. In much of my work, regular expressions are often instrumental in developing those elegant solutions. Because it's one of the few outlets for the artist in me, I have developed somewhat of a passion for regular expressions. It is my goal in writing this book to share some of that passion.

Intended Audience

This book will interest anyone who has an opportunity to use regular expressions. In particular, if you don't yet understand the power that regular expressions can provide, you should benefit greatly as a whole new world is opened up to you. Many of the popular cross-platform utilities and languages that are featured in this book are freely available for MacOS, DOS/Windows, Unix, VMS, and more. Appendix A has some pointers on how to obtain many of them.

Anyone who uses GNU Emacs or vi, or programs in Perl, Tcl, Python, or Awk, should find a gold mine of detail, hints, tips, and understanding that can be put to immediate use. The detail and thoroughness is simply not found anywhere else.

Regular expressions are an idea -- one that is implemented in various ways by various utilities (many, many more than are specifically presented in this book). If you master the general concept of regular expressions, it's a short step to mastering a particular implementation. This book concentrates on that idea, so most of the knowledge presented here transcend the utilities used in the examples.

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews