Syntax analysis and Predictive parsing
Syntax analysis:
The syntax analysis phase of a compiler will take a string of tokens produced by the lexer, and from this construct a syntax tree for the string by finding a derivation of the string from the start symbol of the grammar.
This can be done by guessing derivations until the right one is found, but random guessing is hardly an effective method. Even so, some parsing techniques are based on “guessing” derivations. However, these make sure, by looking at the string, that they will always guess right. These are called predictive parsing methods. Predictive parsers always build the syntax tree from the root down to the leaves and are hence also called (deterministic) top-down parsers.
Other parsers go the other way: They search for parts of the input string that matches right-hand sides of productions and rewrite these to the left-hand non-terminals, at the same time building pieces of the syntax tree. The syntax tree is eventually completed when the string has been rewritten (by inverse derivation) to the start symbol. Also here, we wish to make sure that we always pick the “right” rewrites, so we get deterministic parsing. Such methods are called bottom-up parsing methods.
We will in the next sections first look at predictive parsing and later at a bottom up parsing method called SLR parsing.
Predictive parsing:
If we look at the left-derivation in string aabbbcc, we see that, to the left of the rewritten non terminals, there are only terminals. These terminals correspond to a prefix of the string that is being parsed. In a parsing situation, this prefix will be the part of the input that has already been read. The job of the parser is now to choose the production by which the leftmost unexpanded non terminal should be rewritten. Our aim is to be able to make this choice deterministically based on the next unmatched input symbol.
If we look at the third line in string aabbbcc, we have already read two as and (if the input string is the one shown in the bottom line) the next symbol is a b. Since the right-hand side of the production
T -> aTc
starts with an a, we obviously cannot use this. Hence, we can only rewrite T using the production
T -> R
We are not quite as lucky in the next step. None of the productions for R start with a terminal symbol, so we can not immediately choose a production based on this. As the grammar is ambiguous, it should not be a surprise that we cannot always choose uniquely. If we instead use the unambiguous grammar we can immediately choose the second production for R. When all the bs are read and we are at the following c, we choose the empty production for R and match the remaining input with the rest of the derived string.
If we can always choose a unique production based on the next input symbol, we are able to do predictive parsing without backtracking.