Context-free grammars
Introduction: Like regular expressions, context-free grammars describe sets of strings, i.e., languages. Additionally, a context-free grammar also defines structure on the strings in the language it defines. A language is defined over some alphabet, for example the set of tokens produced by a lexer or the set of alphanumeric characters. The symbols in the alphabet are called terminals.
A context-free grammar recursively defines several sets of strings. Each set is denoted by a name, which is called a non terminal. The set of non terminals is disjoint from the set of terminals. One of the non terminals are chosen to denote the language described by the grammar. This is called the start symbol of the grammar. The sets are described by a number of productions. Each production describes some of the possible strings that are contained in the set denoted by a non-terminal. A production has the form
N -> X1 . . . . Xn
where N is a non-terminal and X1 : : :Xn are zero or more symbols, each of which is either a terminal or a non-terminal. The intended meaning of this notation is to say that the set denoted by N contains strings that are obtained by concatenating strings from the sets denoted by X1 . . . . . Xn. In this setting, a terminal denotes a singleton set, just like alphabet characters in regular expressions. We will, when no confusion is likely, equate a non-terminal with the set of strings it denotes some examples:
A-> a
says that the set denoted by the non-terminal A contains the one-character string a.
A-> aA
says that the set denoted by A contains all strings formed by putting an a in front of a string taken from the set denoted by A. Together, these two productions indicate that A contains all non-empty sequences of as and is hence (in the absence of other productions) equivalent to the regular expression a .
We can define a grammar equivalent to the regular expression a* by the two productions
B ->
B -> aB
where the first production indicates that the empty string is part of the set B.
Productions with empty right-hand sides are called empty productions. These are sometimes written with an e on the right hand side instead of leaving it empty. So far, we have not described any set that could not just as well have been described using regular expressions. Context-free grammars are, however, capable of expressing much more complex languages. In section 2.10, we noted that the language {anbn | n ≥ 0} is not regular. It is, however, easily described by the grammar
S ->
S -> aSb
The second production ensures that the as and bs are paired symmetrically around the middle of the string, so they occur in equal number. The examples above have used only one non-terminal per grammar. When several non-terminals are used, we must make it clear which of these is the start symbol. By convention (if nothing else is stated), the non-terminal on the left-hand side of the first production is the start symbol. As an example, the grammar
T -> R
T -> aTa
R -> b
R -> bR
has T as start symbol and denotes the set of strings that start with any number of as followed by a non-zero number of bs and then the same number of as with which it started.
Sometimes, a shorthand notation is used where all the productions of the same non-terminal are combined to a single rule, using the alternative symbol (j) from regular expressions to separate the right-hand sides. In this notation, the above grammar would read
T-> R | aTa
R -> b | bR
There are still four productions in the grammar, even though the arrow symbol -> is only used twice.
Each sub-expression of the regular expression is numbered and sub-expression si is assigned a non-terminal Ni. The productions for Ni depend on the shape of si as shown in the table above.