Using LR-parser generators
Introduction:
Most LR-parser generators use an extended version of the SLR construction called LALR (1). The “LA” in the abbreviation is short for “look ahead” and the (1) indicates that the look ahead is one symbol, i.e., the next input symbol.
We have chosen to present the SLR construction instead of the LALR(1) construction for several reasons:
- It is simpler.
- In practice, LALR(1) handles only a few more grammars than SLR.
- When a grammar is in the SLR class, the parse-table produced by an SLR parser generator is identical to the table produced by an LALR(1) parser generator.
- Understanding of SLR principles is sufficient to know how to handle a grammar rejected by a LALR(1) parser generator by adding precedence declarations or by rewriting the grammar.
In short, knowledge of SLR parsing is sufficient when using LALR(1) parser generators.
Most LR-parser generators organize their input in several sections:
- Declarations of the terminals and non-terminals used.
- Declaration of the start symbol of the grammar.
- Declarations of operator precedence.
- The productions of the grammar.
- Declaration of various auxiliary functions and data-types used in the actions (see below).
Declarations and actions
Each nonterminal and terminal is declared and associated with a data-type. For a terminal, the data-type is used to hold the values that are associated with the tokens that come from the lexer, e.g., the values of numbers or names of identifiers. For a nonterminal, the type is used for the values that are built for the non-terminals during parsing (at reduce-actions).
While, conceptually, parsing a string produces a syntax tree for that string, parser generators usually allow more control over what is actually produced. This is done by assigning an action to each production. The action is a piece of program text that is used to calculate the value of a production that is being reduced by using the values associated with the symbols on the right-hand side. For example, by putting appropriate actions on each production, the numerical value of an expression may be calculated as the result of parsing the expression. Indeed, compilers can be made such that the value produced during parsing is the compiled code of a program. For all but the simplest compilers it is, however, better to build some kind of syntax representation during parsing and then later operate on this representation.
The syntax trees described in section 3.3.1 are not always optimally suitable for compilation. They contain a lot of redundant information: Parentheses, keywords used for grouping purposes only, and so on. They also reflect structures in the grammar that are only introduced to eliminate ambiguity or to get the grammar accepted by a parser generator (such as left-factorisation or elimination of left recursion).
Hence, abstract syntax is commonly used. Abstract syntax keeps the essence of the structure of the text but omits the irrelevant details. An abstract syntax tree is a tree structure where each node corresponds to one or more nodes in the (concrete) syntax tree. For example, the concrete syntax tree shown in figure 3.12 may be represented by the following abstract syntax tree:
Here the names PlusExp, MulExp and NumExp may be constructors in a data-type, they may be elements from an enumerated type used as tags in a union-type or they may be names of subclasses of an Exp class. The names indicate which production is chosen, so there is no need to keep the subtrees that are implied by the choice of production, such as the subtree from figure 3.12 that holds the symbol . Likewise, the sequence of nodes Exp, Exp2, Exp3, 2 at the left of figure 3.12 are combined to a single node NumExp(2) that includes both the choice of productions for Exp, Exp2 and Exp3 and the value of the terminal node. In short, each node in the abstract syntax tree corresponds to one or more nodes in the concrete syntax tree.