Mastering Regular Expressions: Powerful Techniques for Perl and Other Toolsby Jeffrey E. F. Friedl
Regular expressions are a powerful tool for manipulating text and data. If you don't use them yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a
Regular expressions are a powerful tool for manipulating text and data. If you don't use them yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener.With regular expressions, you can save yourself time and aggravation while dealing with documents, mail messages, log files you name it any type of text or data. For example, regular expressions can play a vital role in constructing a World Wide Web CGI script, which can involve text and data of all sorts.Regular expressions are not a tool in and of themselves, but are included as part of a larger utility. The classic example is grep. These days, regular expressions can be found everywhere, such as in:
- Scripting languages (including Perl, Tcl, awk, and Python)
- Editors (including Emacs, vi, and Nisus Writer)
- Programming environments (including Delphi and Visual C++)
Staying Mentally Regular in the World According to Grep
In the forties, Warren McCulluch and Walter Pitts created neuron-level models of how the nervous system operates. The mathematician, Stephen Kleene, later described these models using his mathematical notation called regular expressions. Ken Thompson incorporated that system of notation into qed (the grandfather of the UNIX ed) and eventually into grep. Ever since that time, regular expressions have constantly seeped into UNIX and UNIX-like utilities. These are the regular expressions that are not used to explain nerves; these are ones used to get on them.
Grep, egrep, vi, sed, lex, awk, emacs, Perl, Tcl, and Python support regular expressions. In fact, regular expressions (regexes) are an essential part of these utilities. Unfortunately, regexes usually don't seem very important in the documentation. Man pages only casually mention regexes with usually no theoretical explanation and very limited practical discussion. This leaves most of us crafting regexes like we're playing the board game Battleship -- keep guessing until we sink a solution.
Mastering Regular Expressions is an important work about an often overlooked concept that permeates UNIX: the regular expression. It is clear and practical. The regular expression is explained by concept and not by rote. Friedl is trying to get us to think in regular expressions and not to blindly parrot (or pirate) his examples.
The book uses quick do-you-really-understand questions at key spots in the text. The book is constructed so that you have to turn the current page to see the answer to a question. Unfortunately, this prevents the "accidental" hint. Be honest. It's human nature to have wondering eyes when a question is too hard. If you can't answer a question in this book, then you know for a fact that you can't answer it. It's too hard to "accidentally" turn a page.
That's only one example of its thoughtful design. The typography is the best use of typography as a communication tool that I have seen. With code snippets like "if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)([CF])$/)", the typography takes you by the hand with a series of underlining, shading, and contrived characters that highlight important sections of the code. Whereas in most books the "Typographical Conventions Used" section is fluff, read and memorize this section before you cross the river regex with Friedl. You can't journey to the other side without it.
So, what exactly is a regular expression? What makes it regular? Don't look for those answers here. Friedl gives you practical concepts, not a theoretical framework. He defines regular expressions as "the key to powerful, flexible, and efficient text processing." He clarifies that with "regular expressions...allow you to describe and parse text." That is the closest thing to a definition of regular expressions in the book. That is also like defining the sun as our source of heat and light, but failing to mention that it is the ball of flame in the center of the solar system. The title of the book is not The Theory of Regular Expressions but Mastering Regular Expressions. Look for practical advice on crafting efficient regexes at your desk and not for conversation tidbits in the break room.--Dr. Dobb's Electronic Review of Computer Books
Read an Excerpt
Chapter 4: The Mechanics of Expression ProcessingNow that we have some background under our belt, let's delve into the mechanics of how a regex engine really goes about its work. Here we don't care much about the Shine and Finish of the previous chapter; this chapter is all about the engine and the drive train, the stuff that grease monkeys talk about in bars. We'll spend a fair amount of time under the hood, so expect to get a bit dirty with some practical hands-on experience.
Start Your Engines!
Let's see how much I can milk this engine analogy for. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the Rich Corinthian Leather. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?
Two Kinds of Engines
Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.
Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.
Let's stoke the fire by adding another variable: the California Emissions Standards. Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say one thing or the other about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).
I Come to think of it, I bet that an electric engine can qualify for the standard without much change, so it's not really impacted very much - the standard just "blesses" the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed it -use the wrong kind of gas and you're in big trouble in more ways than one.
The impact of standards
Better pollution standards are a good thing, but they require that the driver exercise more thought and foresight (well, at least for gas engines, as I noted in the previous paragraph). Frankly, however, the standard doesn't impact most people since all the other states still do their own thing and don't follow California's standard... yet. It's probably just a matter of time.
Okay, so you realize that these four types of engines can be classified into three groups (the two kinds for gas, and electric in general). You know about the differences, and that in the end they all still turn the wheels. What you don't know is what the heck this has to do with regular expressions!
More than you might imagine.
Regex Engine Types
There are two fundamentally different types of regex engines: one called "DFA" (the electric engine of our story) and one called "NFA" (the gas engine). The details follow shortly ( 101), but for now just consider them names, like Bill and Ted. Or electric and gas.
Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include Tcl, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Table 4-1 on the next page lists a few common programs available for a wide variety of platforms and the regex engine that most versions use. A generic version means that it's an old tool with many clones-I have listed notably different clones that I'm aware of.
As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by specifying clearly which metacharacters an engine should support, and exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to the standard, but the kind of results an NFA traditionally provided were quite different from the new standard, so changes were needed. As a result, broadly speaking, we have three types of regex engines:
- DFA (POSIX or not-similar either way)
- Traditional NFA
- POSIX NFA
POSIX standardized the workings of over 70 programs, including traditional regex-wielding tools such as awk, ed, egrep, expr, grep, lex, and sed. Most of these tools' regex flavor had (and still have) the weak powers equivalent to a moped. So weak, in fact, that I don't find them interesting for discussing regular expressions. Although they can certainly be extremely useful tools, you won't find much mention of expr, ed, and sed in this book. Well, to be fair, some modern versions of these tools have been retrofitted with a more-powerful flavor. This is commonly done to grep, a direct regex sibling of sed, ed, and expr.
On the other hand, egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just confirmed the status quo-no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIX-compliant. The gas engines that passed the California Emission Standards tests (POSIX NFA) were fine in that they produced results according to the standard, but the necessary changes only enhanced their fickleness to proper tuning. Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be "good enough" now causes knocks and pings. But so long as you know how to maintain your baby, the engine runs smoothly. And cleanly....
Meet the Author
Jeffrey Friedl was raised in the countryside of Rootstown, Ohio, and had aspirations of being an astronomer until one day he noticed a TRS-80 Model I sitting unused in the corner of the chem lab (bristling with a full 16K of RAM, no less). He eventually began using Unix (and regular expressions) in 1980, and earned degrees in Computer Science from Kent (BS) and the University of New Hampshire (MS). He did kernel development for Omron Corporation in Kyoto, Japan for eight years before moving in 1997 to Silicon Valley to apply his regular-expression know-how to financial news and data for a little-known company called "Yahoo!"
When faced with the daunting task of filling his copious free time, Jeffrey enjoys playing Ultimate Frisbee and basketball with friends at Yahoo!, programming his house, and feeding the squirrels and jays in his back yard. He also enjoys spending time with his wife Fumie, and preparing for the Fall 2002 release of their first "software project" together.
and post it to your social network
Most Helpful Customer Reviews
See all customer reviews >