LEX and YACC

LEX and YACC

by Doug Brown, John Levine, Tony Mason
     
 

View All Available Formats & Editions

This book shows you how to use two Unix utilities, lex andyacc, in program development. These tools help programmers build compilers and interpreters, but they also have a wider range of applications.The second edition contains completely revised tutorial sections for novice users and reference sections for advanced users. This edition is twice the

Overview

This book shows you how to use two Unix utilities, lex andyacc, in program development. These tools help programmers build compilers and interpreters, but they also have a wider range of applications.The second edition contains completely revised tutorial sections for novice users and reference sections for advanced users. This edition is twice the size of the first and has an expanded index.The following material has been added:

  • Each utility is explained in a chapter that covers basic usage and simple, stand-alone applications
  • How to implement a full SQL grammar, with full sample code
  • Major MS-DOS and Unix versions of lex and yacc are explored in depth, including AT&T lex and yacc, Berkeley yacc, Berkeley/GNU Flex, GNU Bison, MKS lex andyacc, and Abraxas PCYACC

Product Details

ISBN-13:
9781565920002
Publisher:
O'Reilly Media, Incorporated
Publication date:
10/28/1992
Series:
O'Reilly Nutshell Series
Edition description:
Second Edition
Pages:
388
Product dimensions:
6.02(w) x 9.08(h) x 0.95(d)

Read an Excerpt


Chapter 3: Using Yacc

The previous chapter concentrated on lex alone. In this chapter we turn our attention to yacc, although we use lex to generate our lexical analyzers. Where lex recognizes regular expressions, yacc recognizes entire grammars. Lex divides the input stream into pieces (tokens) and then yacc takes these pieces and groups them together logically.

In this chapter we create a desk calculator, starting with simple arithmetic, then adding built-in functions, user variables, and finally user-defined functions.

Grammars

Yacc takes a grammar that you specify and writes a parser that recognizes valid "sentences" in that grammar. We use the term "sentence" here in a fairly general way-for a C language grammar the sentences are syntactically valid C programs.

As we saw in Chapter 1, a grammar is a series of rules that the parser uses to recognize syntactically valid input. For example, here is a version of the grammar we'll use later in this chapter to build a calculator.


      statement -> NAME = expression

      expression NUMBER + NUMBER ‌ NUMBER - NUMBER


The vertical bar, " I means there are two possibilities for the same symbol, i.e., an expression can be either an addition or a subtraction. The symbol to the left of the -> is known as the left-band side of the rule, often abbreviated LHS, and the symbols to the right are the right-hand side, usually abbreviated RHS. Several rules may have the same left-hand side-, the vertical bar is just a short hand for this. Symbols that actually appear in the input and are returned by the lexer are terminal symbols or tokens, while those that appear on the left-hand side of some rule are non-terminal symbols or non-terminals. Terminal and non-terminal symbols must be different; it is an error to write a rule with a token on the left side.

The usual way to represent a parsed sentence is as a tree. For example, if we parsed the input "fred = 12 + 13" with this grammar, the tree would look like Figure 3-1. "12 + 13" is an expression, and "fred = expression" is a statement. A yacc parser doesn't actually create this tree as a data structure, although it is not hard to do so yourself.


(Figure 3-1: parse tree)

Every grammar includes a start symbol, the one that has to be at the root of the parse tree. In this grammar, statement is the start symbol.

Recursive Rules

Rules can refer directly or indirectly to themselves; this important ability makes it possible to parse arbitrarily long input sequences. Let's extend our grammar to handle longer arithmetic expressions:


      expression -> NUMBER

      ‌ expression + NUMBER

      ‌ expression - NUMBER


Now we can parse a sequence like "fred = 14 + 23 - 11 + 7" by applying the expression rules repeatedly, as in Figure 3-2. Yacc can parse recursive rules very efficiently, so we will see recursive rules in nearly every grammar we use.

Shift/Reduce Parsing

A yacc parser works by looking for rules that might match the tokens seen so far. When yacc processes a parser, it creates a set of states each of which reflects a possible position in one or more partially parsed rules. As the parser reads tokens, each time it reads a token that doesn't complete a rule it pushes the token on an internal stack and switches to a new state reflecting the token it just read. This action is called a shift. When it has found all the symbols that constitute the right- hand side of a rule, it pops the right-hand side symbols off the stack, pushes the left-hand side symbol onto the stack, and switches to a new state reflecting the new symbol on the stack. This action is called a reduction, since it usually reduces the number of items on the stack. (Not always, since it is possible to have rules with empty right-hand sides.) Whenever yacc reduces a rule, it executes user code associated with the rule. This is how you actually do something with the material that the parser parses.

Let's look how it parses the input "fred = 12 + 13" using the simple rules in Figure 3-1. The parser starts by shifting tokens on to the internal stack one at a time:


fred
fred =
fred = 12
fred = 12 +
fred = 12 + 13

(Figure3-2: A parse using recursive rules)

At this point it can reduce the rule "expression -> NUMBER + NUMBER" so it pops the 12, the plus, and the 13 from the stack and replaces them with expression:


fred = expression

Now it reduces the rule "statement -> NAME = expression", so it pops fred, =, and expression and replaces them with statement. We've reached the end of the input and the stack has been reduced to the start symbol, so the input was valid according to the grammar.

What Yacc Cannot Parse

Although yacc's parsing technique is general, you can write grammars which yacc cannot handle. It cannot deal with ambiguous grammars, ones in which the same input can match more than one parse tree.* it also cannot deal with grammars that need more than one token of lookahead to tell whether it has matched a rule. Consider this extremely contrived example:


      phrase -> cart-animal AND CART

          ‌ work-animal AND PLOW


      cart-animal -> HORSE ‌ GOAT

      work-animal -> HORSE ‌ OX

This grammar isn't ambiguous, since there is only one possible parse tree for any valid input, but yacc can't handle it because it requires two symbols of lookahead. In particular, in the input "HORSE AND CART" it cannot tell whether HORSE is a cart_animal or a work_animal until it sees CART, and yacc cannot look that far ahead.

If we changed the first rule to this:

      phrase -> cart-animal CART

          ‌ work-animal PLOW


yacc would have no trouble, since it can look one token ahead to see whether an input of HORSE is followed by CART, in which case the horse is a cart_animal or by PLOW in which case it is a work-animal.

In practice, these rules are not as complex and confusing as they may seem here. One reason is that yacc knows exactly what grammars it can parse and what it cannot. If you give it one that it cannot handle it will tell you, so there is no problem of overcomplex parsers silently failing. Another reason is that the grammars that yacc can handle correspond pretty well to ones that people really write. As often as not, a grammatical construct that confuses yacc will confuse people as well, so if you have some latitude in your language design you should consider changing the language to make it both more understandable to yacc and to its users.

For more information on shift/reduce parsing, see Chapter 8. For a discussion of what yacc has to do to turn your specification into a working C program, see the classic compiler text by Aho, Sethi, and Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley, 1986, often known as the "dragon book" because of the cover illustration.

A Yacc Parser

A yacc grammar has the same three-part structure as a lex specification. (Lex copied its structure from yacc.) The first section, the definition section, handles control information for the yacc-generated parser (from here on we will call it the parser), and generally sets up the execution environment in which the parser will operate. The second section contains the rules for the parser, and the third section is C code copied verbatim into the generated C program.

We'll first write parser for the simplest grammar, the one in Figure 3-1, then extend it to be more useful and realistic.

The Definition Section

The definition section includes declarations of the tokens used in the grammar, the types of values used on the parser stack, and other odds and ends. it can also include a literal block, C code enclosed in %{ %} lines. We start our first parser by declaring two symbolic tokens.


%token NAME NUMBER

You can use single quoted characters as tokens without declaring them, so we don't need to declare "=", "+"' or "-"....

Meet the Author

Gregory Satir helps develop online publishing tools in the Portland, Oregon, office of Electronic Book Technologies. He graduated with a B.S. in computer science from Brown University. Doug Brown is a consultant/contractor in Beaverton, Oregon. He has been developing software for circuit simulation, synthesis, and testing since 1977. Doug coauthored lex & yacc, another O'Reilly & Associates Nutshell Handbook. He received an M.S. in electrical engineering from the University of Illinois at Urbana-Champaign in 1976.

John R. Levine writes, lectures, and consults on Unix and compiler topics. He moderates the online comp.compilers discussion group at Usenet. He worked on Unix versions Lotus 1-2-3 and the Norton Utilities and was one of the architects of AIX for the IBM RT PC. He received a Ph.D in computer science from Yale in 1984.

Tony Mason is currently a member of the AFS development team at Transarc Corporation, a small start-up company specializing in distributed systems software. Previously, he worked with the Distributed Systems Group at Stanford University in the area of distributed operating systems and data communications. He received a B.S. in mathematics from the University of Chicago in 1987.

Customer Reviews

Average Review:

Write a Review

and post it to your social network

     

Most Helpful Customer Reviews

See all customer reviews >